dataintegrityfingerprint
Data Integrity Fingerprint (DIF) - A reference implementation in Python
https://github.com/expyriment/dataintegrityfingerprint-python
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 5 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.8%) to scientific vocabulary
Repository
Data Integrity Fingerprint (DIF) - A reference implementation in Python
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 4
Metadata Files
README.md
Data Integrity Fingerprint (DIF)
A reference implementation in Python
- Command line interface (CLI) application
- Graphical user interface (GUI) application
- Programming library (Python package)
by Oliver Lindemann & Florian Krause
Table of contents
Introduction
This software calculates the Data Integrity Fingerprint (DIF) of multi-file datasets. It can be used via the command line, via a graphical user interface, or as a Python library for embedding in other software. In either case, the user has the choice of calculating the DIF based on a variety of (cryptographic) algorithms using serial (single CPU core) or parallel (multiple CPU cores) computing. In addition, a checksums file with fingerprints of individual files in a dataset can be created. These files can also serve as the basis for calculating the DIF and, in addition, can be compared against a dataset in order to reveal content differences in case a DIF could not be verified.
Note: We strongly recommend to use SHA-256 or one of the other cryptographic algorithms for calculating the DIF. The non-cryptographic algorithms are significantly faster, but also significantly less secure (i.e. collisions are much more likely, breaking the uniqueness of a DIF, and opening a door for potential manipulation). They might hence only be an option for very large datasets in scenarios where a potential manipulation by a third party is not part of the threat model. The graphical user interface does not allow for selecting non-cryptographic algorithms.
Installation
The quickest way to use the application is to install it with pipx:
pipx install dataintegrityfingerprint
To also make use of the programming library, a classical pip installation is of course also possible:
python -m pip install dataintegrityfingerprint
Usage
Command line interface (CLI) application usage
After successful installation, the command line interface is available as dataintegrityfingerprint:
``` dataintegrityfingerprint [-h] [-f] [-a ALGORITHM] [-C] [-D] [-G] [-L] [-s] [-d CHECKSUMSFILE] [-n] [-p] [--non-cryptographic] [PATH]
positional arguments: PATH the path to the data directory
options: -h, --help show this help message and exit -f, --from-checksums-file Calculate dif from checksums file. PATH is a checksums file -a ALGORITHM, --algorithm ALGORITHM the hash algorithm to be used (default=SHA-256) -C, --checksums print checksums only -D, --dif-only print dif only -G, --gui open graphical user interface -L, --list-available-algorithms print available algorithms -s, --save-checksums-file save checksums to file -d CHECKSUMSFILE, --diff-checksums-file CHECKSUMSFILE Calculate differences of checksums to CHECKSUMSFILE -n, --no-multi-processing switch of multi processing -p, --progress show progressbar --non-cryptographic allow non cryptographic algorithms (Not suggested, please read documentation carefully!)
```
Graphical user interface (GUI) application usage
After successful installation, the graphical user interface is available as dataintegrityfingerprint-gui:

- Button "Browse..." - Opens a file browser for selecting a data directory. The selected data directory will be shown at the top of the interface.
- Button "Generate DIF" - Generates the DIF for the selected data directory. The DIF will be shown at the bottom of the interface. In addition, the main area in the middle of the interface will show the checksums (fingerprints) of individual files.
- Button "Copy" - Copies the DIF into the clipboard for pasting into other applications.
- Menu item "File --> Open checksums" - Opens a checksums file. The DIF of that checksums file will be shown at the bottom of the interface. In addition, the main area in the middle of the interface will show the checksums (fingerprints) of individual files.
- Menu item "File --> Save checksums" - Saves the checksums (fingerprints) of individual files to a file.
- Menu item "File --> Quit" - Quits the application.
- Menu item "Edit --> Diff checksums" - Opens a checksums file and shows differences of checksums (fingerprints) of individual files to those currently shown in the main area in the middle of the interface.
- Menu item "Options --> Hash algorithm" - Selects the cryptographic hash algorithm used as basis for DIF calculation.
- Menu item "Progress updating" - Enables/disables progress updating via a progress bar.
- Menu item "Options --> Multi-core processing" - Enables/disables parallel computing (usage of multiple CPU cores).
Programming library (Python package) usage
After successful installation, the Python package is available as dataintegrityfingerprint:
python3
import dataintegrityfingerprint
A DIF can then be created in the following way:
python3
dif = dataintegrityfingerprint.DataIntegrityFingerprint("/path/to/dataset")
print(dif) # get the DIF
print(dif.checksums) # get the list of checksums of individual files
API documentation
The main functionality for usage in other code is made available via the class DataIntegrityFingerprint.
DataIntegrityFingerprint
Create a DataIntegrityFingerprint object. ``` DataIntegrityFingerprint(data, fromchecksumsfile=False, hashalgorithm='SHA-256', multiprocessing=True, allownoncryptographicalgorithms=False)
Parameters
----------
data : str
the path to the data
from_checksums_file : bool
data argument is a checksums file
hash_algorithm : str
the hash algorithm (optional, default: sha256)
multiprocessing : bool
using multi CPU cores (optional, default: True)
speeds up creating of checksums for large data files
allow_non_cryptographic_algorithms : bool
set True only, if you need non cryptographic algorithms (see
notes!)
Note
----
We do not suggest to use non-cryptographic algorithms.
Non-cryptographic algorithms are, while much faster, not secure (e.g.
can be tempered with). Only use these algorithms to check for technical
file damage and in cases security is not of critical concern.
```
The DataIntegrityFingerprint class includes a set of global variables which
affect all instances.
CHECKSUMFILENAMESEPARATOR = ' '
Global variable.
Default value = '␣␣' (i.e., two U+0020 whitespace characters)
CRYPTOGRAPHIC_ALGORITHMS
Global variable.
Default value = ['MD5', 'SHA-1', 'SHA-224', 'SHA-256', 'SHA-384', 'SHA-512',
'SHA3-224', 'SHA3-256', 'SHA3-384', 'SHA3-512']
NONCRYPTOGRAPHICALGORITHMS
Global variable.
Default value = ['ADLER-32', 'CRC-32']
Once initiated, a DataIntegrityFingerprint object provides several methods and
attributes.
dif_checksums
Calculate differences of checksums to checksums file. ``` diff_checksums(filename)
Parameters
----------
filename : str
the name of the checksums file
Returns
-------
diff : str
the difference of checksums to the checksums file
(minus means checksums is missing something from checksums file,
plus means checksums has something in addition to checksums file)
```
generate
Generate hash list to get Data Integrity Fingerprint. ``` generate(progress=None)
Parameters
----------
progress: function, optional
a callback function for a progress reporting that takes the
following parameters:
count -- the current count
total -- the total count
status -- a string describing the status
```
get_files
Get all files to hash. ``` get_files(self)
Returns
files : list the list of files to hash ```
save_checksums
Save the checksums to a file. ``` save_checksums(filename=None)
Parameters
filename : str, optional the name of the file to save checksums to
Returns
success : bool whether saving was successful ```
An initiated DataIntegrityFingerprint object also provides a set of
read-only properties.
allownoncryptographic_algorithms
Read-only property
checksums
Read-only property.
data
Read-only property.
dif
Read-only property.
file_count
Read-only property.
filehashlist
Read-only property.
hash_algorithm
Read-only property.
multiprocessing
Read-only property.
Support and contribution
For any questions, please use the discussion section from the code repository. If you wish to contribute or report an issue, please use the issue tracker and pull requests.
Citation
To cite this software conceptually, you can use the following general citation/DOI:
Lindemann, O., & Krause, F. Data Integrity Fingerprint (DIF) - A reference implementation in Python [Computer software]. https://doi.org/10.5281/zenodo.5866698
To cite a specific version (preferred), please see the corresponding citation/DOI under releases!
Owner
- Name: Expyriment
- Login: expyriment
- Kind: organization
- Email: info@expyriment.org
- Website: www.expyriment.org
- Repositories: 9
- Profile: https://github.com/expyriment
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Lindemann" given-names: "Oliver" orcid: "https://orcid.org/0000-0003-3789-5373" - family-names: "Krause" given-names: "Florian" orcid: "https://orcid.org/0000-0002-2754-3692" title: "Data Integrity Fingerprint (DIF) - A reference implementation in Python" doi: 10.5281/zenodo.5866698 url: "https://github.com/expyriment/dataintegrityfingerprint-python"
GitHub Events
Total
Last Year
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 126
- Total Committers: 5
- Avg Commits per committer: 25.2
- Development Distribution Score (DDS): 0.381
Top Committers
| Name | Commits | |
|---|---|---|
| Florian Krause | s****n@g****m | 78 |
| Oliver Lindemann | l****n@c****u | 19 |
| fladd | f****e@f****e | 10 |
| Oliver Lindemann | l****9@u****m | 10 |
| Oliver Lindemann | l****9@g****m | 9 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 6
- Total pull requests: 4
- Average time to close issues: 10 months
- Average time to close pull requests: 2 months
- Total issue authors: 2
- Total pull request authors: 2
- Average comments per issue: 1.83
- Average comments per pull request: 0.25
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- lindemann09 (3)
- fladd (3)
Pull Request Authors
- fladd (3)
- lindemann09 (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 46 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 8
- Total maintainers: 2
pypi.org: dataintegrityfingerprint
Data Integrity Fingerprint (DIF) - A reference implementation in Python
- Homepage: https://github.com/expyriment/dataintegrityfingerprint-python
- Documentation: https://dataintegrityfingerprint.readthedocs.io/
- License: MIT License
-
Latest release: 0.7.6
published about 4 years ago