https://github.com/dasch-swiss/fileidentification
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: dasch-swiss
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 6.81 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 2
- Releases: 0
Metadata Files
README.md
Fileidentification
A python CLI to identify file formats and bulk convert files. It is designed for digital preservation workflows and is basically a python wrapper around several programs. It uses siegfried, ffmpeg, imagemagick (inkscape) and LibreOffice, so you need to have those installed for this to work. Most probable use case might be when you need to test and possibly convert a huge amount of files and you don't know in advance what file types you are dealing with. It features:
- file format identification and extraction of technical metadata with siegfried, ffprobe and imagemagick
- file integrity testing with ffmpeg and imagemagick
- file conversion with ffmpeg, imagemagick and LibreOffice using a json file as a protocol
- detailed logging
Required Programs
Install siegfried, ffmpeg, imagemagick (inkscape) and LibreOffice if not already installed.
MacOS (using homebrew)
bash
brew install richardlehane/digipres/siegfried
brew install ffmpeg
brew install --cask inkscape
brew install imagemagick
brew install ghostscript
brew install --cask libreoffice
Linux
Depending on your distribution:
On Debian/Ubuntu, add siegfried to the apt sources:
bash
curl -sL "http://keyserver.ubuntu.com/pks/lookup?op=get&search=0x20F802FE798E6857" | gpg --dearmor | sudo tee /usr/share/keyrings/siegfried-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/siegfried-archive-keyring.gpg] https://www.itforarchivists.com/ buster main" | sudo tee -a /etc/apt/sources.list.d/siegfried.list
sudo apt-get update && sudo apt-get install siegfried
ffmpeg, inkscape imagemagick and LibreOffice:
bash
sudo apt-get update
sudo apt-get install ffmpeg imagemagick ghostscript inkscape libreoffice
Python Dependencies
If you don't have uv installed, install it with
bash
curl -LsSf https://astral.sh/uv/install.sh | sh
Then, you can use uv run to run the fileidentification script.
This creates a venv and installs all necessary python dependencies:
bash
uv run identify.py --help
Quick Start
Generate policies for your files:
uv run identify.py path/to/directoryReview generated policies: Edit
path/to/directory_policies.jsonto customize conversion rulesTest files and apply the policies:
uv run indentify.py path/to/directory -iar
Single Execution Steps
Detect File Formats - Generate Conversion Policies
uv run identify.py path/to/directory
Generate two json files:
path/to/directory_log.json : The technical metadata of all the files in the folder
path/to/directory_policies.json : A file conversion protocol for each file format
that was encountered in the folder according to the default policies located in
fileidentification/policies/default.py. Edit it to customize conversion rules.
File Integrity Tests
uv run identify.py path/to/directory -i
Test the files for their integrity and move corrupted files to the folder in path/to/directory_WORKINGDIR/_REMOVED.
You can also add the flag -v (--verbose) for more detailed inspection. (see options below)
NOTE: Currently only audio/video and image files are tested.
File Conversion
uv run identify.py path/to/directory -a
Apply the policies defined in path/to/directory_policies.json and convert files into their target file format.
The converted files are temporary stored in path/to/directory_WORKINGDIR (default) with the log output
of the program used as log.txt next to it.
Clean Up Temporary Files
uv run identify.py path/to/directory -r
Delete all temporary files and folders and move the converted files next to their parents.
Combining Steps - Custom Policies and Working Directory
If you don't need these intermediary steps, you can run the desired steps at once by combining their flags. Here is an example how to do verboose testing, applying a custom policy and set the location to the working directory other than default (see option below for more information about the flags):
uv run identify.py path/to/directory -ariv -p path/to/custom_policies.json -w path/to/workingdir
Log
The path/to/directory_log.json takes track of all modifications in the target folder.
Since with each execution of the script it checks whether such a log exists and read/appends to that file.
Iterations of file conversions such as A -> B, B -> C, ... are logged in the same file.
If you wish a simpler csv output, you can add the flag --csv anytime when you run the script,
which converts the log.json of the actual status of the directory to a csv.
Advanced Usage
You can also create your own policies file, and with that, customise the file conversion output.
Simply edit the generated default file path/to/directory_policies.json before applying.
If you want to start from scratch, run uv run indentify.py path/to/directory -b to create a
blank policies template with all the file formats encountered in the folder.
Policy Specification
A policy for a file type consists of the following fields and uses its PRONOM Unique Identifier (PUID) as a key
| Field | Type | | |----------------------|----------------|-------------------------------------| | format_name | str | optional | | bin | str | required | | accepted | bool | required | | target_container | str | required if field accepted is false | | processing_args | str | required if field accepted is false | | expected | list[str] | required if field accepted is false | | remove_original | bool | required if field accepted is false |
format_name: The name of the file format.bin: Program to convert or test the file. Literal["","magick","ffmpeg","soffice","inkscape"]. (Testing currently only is supported on image/audio/video, i.e. ffmpeg and magick.)accepted:falseif the file needs to be converted,trueif it doesn't.processing_args: The arguments used with bin. Can also be an empty string if there is no need for such arguments.expected: the expected file format for the converted file as PUIDremove_original: whether to keep the parent of the converted file in the directory, default isfalse
Policy Examples
A policy for Audio/Video Interleaved Format (avi) that need to be transcoded to MPEG-4 Media File (Codec: AVC/H.264, Audio: AAC) looks like this
json
{
"fmt/5": {
"format_name": "Audio/Video Interleaved Format",
"bin": "ffmpeg",
"accepted": false,
"target_container": "mp4",
"processing_args": "-c:v libx264 -crf 18 -pix_fmt yuv420p -c:a aac",
"expected": [
"fmt/199"
],
"remove_original": false
}
}
A policy for Portable Network Graphics that is accepted as it is, but gets tested
json
{
"fmt/13": {
"format_name": "Portable Network Graphics",
"bin": "magick",
"accepted": true
}
}
Policy Testing:
You can test the outcome of the conversion policies (given that the path is path/to/directory_policies.json, otherwise pass the path to the file with -p) with
uv run identify.py path/to/directory -t
The script takes the smallest file for each conversion policy and converts it. The converted files are located in WORKINGDIR/TEST.
If you just want to test a specific policy, append f and the puid
uv run identify.py path/to/directory -tf fmt/XXX
Modifying Default Settings
The default setting for file conversion are in fileidentification/policies/default.py, you can add or modify the entries there. All other settings such as default path values or hash algorithm are in fileidentification/conf/settings.py
Options
-i
[--integrity-tests] tests the files for their integrity
-v
[--verbose] catches more warnings on video and image files during the integrity tests.
this can take a significantly longer based on what files you have. As an addition,
it handles some warnings as an error.
-a
[--apply] applies the policies
-r
[--remove-tmp] removes all temporary items and adds the converted files next to their parents.
-x
[--remove-original] this overwrites the removeoriginal value in the policies and sets it to true when removing
the tmp files. the original files are moved to the WORKINGDIR/REMOVED folder.
When used in generating policies, it sets remove_original in the policies to true (default false).
-p
[--policies-path] load a custom policies json file
-w
[--working-dir] set a custom working directory. default is path/to/directory_WORKINGDIR
-s
[--strict] when run in strict mode, it moves the files that are not listed in policies.json to the folder _REMOVED
(instead of throwing a warning).
When used in generating policies, it does not add blank policies for formats that are not mentioned in
fileidentification/policies/default.py
-b
[--blank] creates a blank policies based on the files encountered in the given directory.
-e
[--extend-policies] append filetypes found in the directory to the given policies if they are missing in it.
-q
[--quiet] just print errors and warnings
--csv
get an additional output as csv aside from the log.json
--convert
re-convert the files that failed during file conversion
using it in your code
as long as you have all the dependencies installed and run python version >=3.8, have typer and pydanic installed in your project, you can copy the fileidentification folder into your project folder and import the FileHandler to your code
```python from fileidentification.filehandling import FileHandler
this runs it with default parameters (flags -ivarq), but change the parameters to your needs
fh = FileHandler() fh.run("path/to/directory")
or if you just want to do integrity tests
fh = FileHandler() fh.integrity_tests("path/to/directory")
log it at some point and have an additional csv
fh.writelogs("path/where/to/log", tocsv=True)
```
Updating Signatures
bash
uv run update.py
Useful Links
You'll find a good resource to query for fileformats on nationalarchives.gov.uk
The Homepage of siegfried itforarchivists.com/siegfried/
List of File Signatures on wikipedia
Preservation recommendations kost bundesarchiv
NOTE if you want to convert to pdf/A, you need LibreOffice version 7.4+
when you convert svg, you might run into errors as the default library of imagemagick is not that good.
easiest workaround is installing inkscape ( brew install --cask inkscape ), make sure that you reinstall imagemagick,
so its uses inkscape as default for converting svg ( brew remove imagemagick , brew install imagemagick)
Owner
- Name: DaSCH - Swiss National Data and Service Center for the Humanities
- Login: dasch-swiss
- Kind: organization
- Email: info@dasch.swiss
- Location: Switzerland
- Website: https://dasch.swiss
- Twitter: DaSCHSwiss
- Repositories: 35
- Profile: https://github.com/dasch-swiss
Development repositories of the DaSCH.
GitHub Events
Total
- Delete event: 10
- Issue comment event: 4
- Push event: 14
- Public event: 1
- Pull request review event: 1
- Pull request review comment event: 1
- Pull request event: 7
- Create event: 2
Last Year
- Delete event: 10
- Issue comment event: 4
- Push event: 14
- Public event: 1
- Pull request review event: 1
- Pull request review comment event: 1
- Pull request event: 7
- Create event: 2
Dependencies
- beautifulsoup4 >=4.13.1
- lxml >=5.1.0
- requests >=2.31.0
- rich >=13.7.1
- typer >=0.10.0
- beautifulsoup4 4.13.1
- certifi 2024.12.14
- charset-normalizer 3.4.1
- click 8.1.8
- colorama 0.4.6
- fileidentification 0.1.0
- idna 3.10
- lxml 5.3.0
- markdown-it-py 3.0.0
- mdurl 0.1.2
- pygments 2.19.1
- requests 2.32.3
- rich 13.9.4
- shellingham 1.5.4
- soupsieve 2.6
- typer 0.15.1
- typing-extensions 4.12.2
- urllib3 2.3.0