https://github.com/dasch-swiss/fileidentification

https://github.com/dasch-swiss/fileidentification

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: dasch-swiss
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Size: 6.81 MB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 2
  • Releases: 0
Created over 2 years ago · Last pushed 10 months ago
Metadata Files
Readme License

README.md

Fileidentification

A python CLI to identify file formats and bulk convert files. It is designed for digital preservation workflows and is basically a python wrapper around several programs. It uses siegfried, ffmpeg, imagemagick (inkscape) and LibreOffice, so you need to have those installed for this to work. Most probable use case might be when you need to test and possibly convert a huge amount of files and you don't know in advance what file types you are dealing with. It features:

  • file format identification and extraction of technical metadata with siegfried, ffprobe and imagemagick
  • file integrity testing with ffmpeg and imagemagick
  • file conversion with ffmpeg, imagemagick and LibreOffice using a json file as a protocol
  • detailed logging

Required Programs

Install siegfried, ffmpeg, imagemagick (inkscape) and LibreOffice if not already installed.

MacOS (using homebrew)

bash brew install richardlehane/digipres/siegfried brew install ffmpeg brew install --cask inkscape brew install imagemagick brew install ghostscript brew install --cask libreoffice

Linux

Depending on your distribution:

On Debian/Ubuntu, add siegfried to the apt sources:

bash curl -sL "http://keyserver.ubuntu.com/pks/lookup?op=get&search=0x20F802FE798E6857" | gpg --dearmor | sudo tee /usr/share/keyrings/siegfried-archive-keyring.gpg echo "deb [signed-by=/usr/share/keyrings/siegfried-archive-keyring.gpg] https://www.itforarchivists.com/ buster main" | sudo tee -a /etc/apt/sources.list.d/siegfried.list sudo apt-get update && sudo apt-get install siegfried

ffmpeg, inkscape imagemagick and LibreOffice:

bash sudo apt-get update sudo apt-get install ffmpeg imagemagick ghostscript inkscape libreoffice

Python Dependencies

If you don't have uv installed, install it with

bash curl -LsSf https://astral.sh/uv/install.sh | sh

Then, you can use uv run to run the fileidentification script. This creates a venv and installs all necessary python dependencies:

bash uv run identify.py --help

Quick Start

  1. Generate policies for your files: uv run identify.py path/to/directory

  2. Review generated policies: Edit path/to/directory_policies.json to customize conversion rules

  3. Test files and apply the policies: uv run indentify.py path/to/directory -iar

Single Execution Steps

Detect File Formats - Generate Conversion Policies

uv run identify.py path/to/directory

Generate two json files:

path/to/directory_log.json : The technical metadata of all the files in the folder

path/to/directory_policies.json : A file conversion protocol for each file format that was encountered in the folder according to the default policies located in fileidentification/policies/default.py. Edit it to customize conversion rules.

File Integrity Tests

uv run identify.py path/to/directory -i

Test the files for their integrity and move corrupted files to the folder in path/to/directory_WORKINGDIR/_REMOVED.

You can also add the flag -v (--verbose) for more detailed inspection. (see options below)

NOTE: Currently only audio/video and image files are tested.

File Conversion

uv run identify.py path/to/directory -a

Apply the policies defined in path/to/directory_policies.json and convert files into their target file format. The converted files are temporary stored in path/to/directory_WORKINGDIR (default) with the log output of the program used as log.txt next to it.

Clean Up Temporary Files

uv run identify.py path/to/directory -r

Delete all temporary files and folders and move the converted files next to their parents.

Combining Steps - Custom Policies and Working Directory

If you don't need these intermediary steps, you can run the desired steps at once by combining their flags. Here is an example how to do verboose testing, applying a custom policy and set the location to the working directory other than default (see option below for more information about the flags):

uv run identify.py path/to/directory -ariv -p path/to/custom_policies.json -w path/to/workingdir

Log

The path/to/directory_log.json takes track of all modifications in the target folder.
Since with each execution of the script it checks whether such a log exists and read/appends to that file.
Iterations of file conversions such as A -> B, B -> C, ... are logged in the same file.

If you wish a simpler csv output, you can add the flag --csv anytime when you run the script, which converts the log.json of the actual status of the directory to a csv.

Advanced Usage

You can also create your own policies file, and with that, customise the file conversion output. Simply edit the generated default file path/to/directory_policies.json before applying. If you want to start from scratch, run uv run indentify.py path/to/directory -b to create a blank policies template with all the file formats encountered in the folder.

Policy Specification

A policy for a file type consists of the following fields and uses its PRONOM Unique Identifier (PUID) as a key

| Field | Type | | |----------------------|----------------|-------------------------------------| | format_name | str | optional | | bin | str | required | | accepted | bool | required | | target_container | str | required if field accepted is false | | processing_args | str | required if field accepted is false | | expected | list[str] | required if field accepted is false | | remove_original | bool | required if field accepted is false |

  • format_name: The name of the file format.
  • bin: Program to convert or test the file. Literal["", "magick", "ffmpeg", "soffice", "inkscape"]. (Testing currently only is supported on image/audio/video, i.e. ffmpeg and magick.)
  • accepted: false if the file needs to be converted, true if it doesn't.
  • processing_args: The arguments used with bin. Can also be an empty string if there is no need for such arguments.
  • expected: the expected file format for the converted file as PUID
  • remove_original: whether to keep the parent of the converted file in the directory, default is false

Policy Examples

A policy for Audio/Video Interleaved Format (avi) that need to be transcoded to MPEG-4 Media File (Codec: AVC/H.264, Audio: AAC) looks like this

json { "fmt/5": { "format_name": "Audio/Video Interleaved Format", "bin": "ffmpeg", "accepted": false, "target_container": "mp4", "processing_args": "-c:v libx264 -crf 18 -pix_fmt yuv420p -c:a aac", "expected": [ "fmt/199" ], "remove_original": false } }

A policy for Portable Network Graphics that is accepted as it is, but gets tested

json { "fmt/13": { "format_name": "Portable Network Graphics", "bin": "magick", "accepted": true } }

Policy Testing:

You can test the outcome of the conversion policies (given that the path is path/to/directory_policies.json, otherwise pass the path to the file with -p) with

uv run identify.py path/to/directory -t

The script takes the smallest file for each conversion policy and converts it. The converted files are located in WORKINGDIR/TEST.

If you just want to test a specific policy, append f and the puid

uv run identify.py path/to/directory -tf fmt/XXX

Modifying Default Settings

The default setting for file conversion are in fileidentification/policies/default.py, you can add or modify the entries there. All other settings such as default path values or hash algorithm are in fileidentification/conf/settings.py

Options

-i [--integrity-tests] tests the files for their integrity

-v [--verbose] catches more warnings on video and image files during the integrity tests. this can take a significantly longer based on what files you have. As an addition, it handles some warnings as an error.

-a [--apply] applies the policies

-r [--remove-tmp] removes all temporary items and adds the converted files next to their parents.

-x [--remove-original] this overwrites the removeoriginal value in the policies and sets it to true when removing the tmp files. the original files are moved to the WORKINGDIR/REMOVED folder. When used in generating policies, it sets remove_original in the policies to true (default false).

-p [--policies-path] load a custom policies json file

-w [--working-dir] set a custom working directory. default is path/to/directory_WORKINGDIR

-s [--strict] when run in strict mode, it moves the files that are not listed in policies.json to the folder _REMOVED (instead of throwing a warning). When used in generating policies, it does not add blank policies for formats that are not mentioned in fileidentification/policies/default.py

-b [--blank] creates a blank policies based on the files encountered in the given directory.

-e [--extend-policies] append filetypes found in the directory to the given policies if they are missing in it.

-q [--quiet] just print errors and warnings

--csv get an additional output as csv aside from the log.json

--convert re-convert the files that failed during file conversion

using it in your code

as long as you have all the dependencies installed and run python version >=3.8, have typer and pydanic installed in your project, you can copy the fileidentification folder into your project folder and import the FileHandler to your code

```python from fileidentification.filehandling import FileHandler

this runs it with default parameters (flags -ivarq), but change the parameters to your needs

fh = FileHandler() fh.run("path/to/directory")

or if you just want to do integrity tests

fh = FileHandler() fh.integrity_tests("path/to/directory")

log it at some point and have an additional csv

fh.writelogs("path/where/to/log", tocsv=True)

```

Updating Signatures

bash uv run update.py

Useful Links

You'll find a good resource to query for fileformats on nationalarchives.gov.uk

The Homepage of siegfried itforarchivists.com/siegfried/

List of File Signatures on wikipedia

Preservation recommendations kost bundesarchiv

NOTE if you want to convert to pdf/A, you need LibreOffice version 7.4+

when you convert svg, you might run into errors as the default library of imagemagick is not that good. easiest workaround is installing inkscape ( brew install --cask inkscape ), make sure that you reinstall imagemagick, so its uses inkscape as default for converting svg ( brew remove imagemagick , brew install imagemagick)

Owner

  • Name: DaSCH - Swiss National Data and Service Center for the Humanities
  • Login: dasch-swiss
  • Kind: organization
  • Email: info@dasch.swiss
  • Location: Switzerland

Development repositories of the DaSCH.

GitHub Events

Total
  • Delete event: 10
  • Issue comment event: 4
  • Push event: 14
  • Public event: 1
  • Pull request review event: 1
  • Pull request review comment event: 1
  • Pull request event: 7
  • Create event: 2
Last Year
  • Delete event: 10
  • Issue comment event: 4
  • Push event: 14
  • Public event: 1
  • Pull request review event: 1
  • Pull request review comment event: 1
  • Pull request event: 7
  • Create event: 2

Dependencies

pyproject.toml pypi
  • beautifulsoup4 >=4.13.1
  • lxml >=5.1.0
  • requests >=2.31.0
  • rich >=13.7.1
  • typer >=0.10.0
uv.lock pypi
  • beautifulsoup4 4.13.1
  • certifi 2024.12.14
  • charset-normalizer 3.4.1
  • click 8.1.8
  • colorama 0.4.6
  • fileidentification 0.1.0
  • idna 3.10
  • lxml 5.3.0
  • markdown-it-py 3.0.0
  • mdurl 0.1.2
  • pygments 2.19.1
  • requests 2.32.3
  • rich 13.9.4
  • shellingham 1.5.4
  • soupsieve 2.6
  • typer 0.15.1
  • typing-extensions 4.12.2
  • urllib3 2.3.0