archiveautomation
Automate digital preservation workflow.
Science Score: 62.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
✓Institutional organization owner
Organization kaust-library has institutional domain (library.kaust.edu.sa) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary
Keywords
Repository
Automate digital preservation workflow.
Basic Info
Statistics
- Stars: 0
- Watchers: 4
- Forks: 0
- Open Issues: 9
- Releases: 1
Topics
Metadata Files
README.md
ArchiveAutomation

Automate the digital preservation workflow.
The workflow have the following steps:
- Run the antivirus.
- Create a bag file from the source folders.
- Create a XML file Dublin Core.
- Run Droid for extraction of the metadata.
- Run JHove as a complement of the metadata.
Next we describe the usage of the script, and the installation of dependencies are below
Usage
The usage assumes that the repository is already cloned, and we are ready to run the script.
Note. We had problems with long pathnames, and we found out that there is a limitation on the lentght of the path in Windows API. We changed the registry, rebooted the server, and now everything is working.
powershell New-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem" ` -Name "LongPathsEnabled" -Value 1 -PropertyType DWORD -Force
Update Local Repository
Update the repository to latest version
mgarcia@arda:~/Documents/Work/ArchiveAutomation$ git pull
Already up to date.
mgarcia@arda:~/Documents/Work/ArchiveAutomation$
Start Virtual Environment
Activate the virtual environment:
```
Windows
PS C:\Users\garcm0b\OneDrive - KAUST\Documents\Work\ArchiveAutomation> .\venv\Scripts\activate
Linux
mgarcia@mordor:~/Documents/Work/ArchiveAutomation$ . venv/bin/activate ```
Workflow Input File
The takes a single argument: a file describing all information for the workflow. The name (and extension) of file are irrelevant. Any valid filename would be acceptable.
The steps in the workflow are represented by sections of the input file, like ACCESSION, BAGGER, CLAMAV, etc. The order of the section does not matter. But the they can't be removed.
An example of input file
```
Configuration file for the digital preservation workflow
[ACCESSION] accessionid = 000000_0000
[BAGGER]
You can specify a comma separated list of directories as source: dir1, dir2, ...
sourcedir = C:\Users\joe\Work\boattrippictures, C:\Users\joe\Work\myevent_1
Using Python ExtendedInterpolation to use the 'accession_id' as target directory
destdir = C:\Users\joe\Work\${ACCESSION:accessionid}
If 'false' the script will stop the the destination directory already exists.
[CLAMAV] avdir = C:\Program Files\ClamAV avupdate = freshclam.exe avclamav = clamscan.exe avlogsroot = C:\Users\Desktop\john\clamscanlogs\clamAVlog quarantinedays = 30
Doesn't actually run the AV command, just print it.
run_it = no
[DROID] droiddir = C:\Users\joe\Downloads\droid-binary-6.5.2-bin-win32-with-jre droidbin = droid.bat
The profile is the database with the metadata in binary format.
keep_profile = yes
[JHOVE] jhovedir = C:\Users\joe\Work\jhove jhovebin = jhove.bat jhove_xml = yes
[JHOVE MODULES] AIFF-hul = no ASCII-hul = no GIF-hul = no GZip-kb = no HTML-hul = no JPEG-hul = yes JPEG2000-hul = no PDF-hul = no TIFF-hul = no UTF8-hul = no WARC-kb = no WAVE-hul = no XML-hul = no ```
The
BAGGER:dest_dircan't exist, otherwise the script will stop. By default, the directory name will be accession number, but it can be customized by the user. Like:dest_dir = /path/to/my/bag/dir
Droid and Jhove
There can't be any space in the
droidpath, otherwise thedroid.batscript will fail. This means that installingdroidin theC:\Program Fileswill not work.
Both sessions for Droid and Jhome are very similar. They have the path to the installation directory, the name of executable program, and some parameter. For droid, the parameter is whether the Droid profile should be kept after running the script or not. For jhove it's the option to save the output in XML format.
Running the Script
Once the input file is ready, simply call the script with the input file as parameter.
(venv) PS C:\Users\garcm0b\Work\ArchiveAutomation> python archiveautomation.py .\my_accession.cfg
Have a nice day.
(venv) mgarcia@wsl2:~/Documents/Work/ArchiveAutomation$
The script prints help message when no input file is provided
``` (venv) mgarcia@arda:~/Documents/Work/ArchiveAutomation$ archiveautomation Usage: archiveautomation [OPTIONS] INPUT Try 'archiveautomation --help' for help.
Error: Missing argument 'INPUT'. (venv) mgarcia@arda:~/Documents/Work/ArchiveAutomation$ ```
Or the script can be called with the --help parameter
``` venv) mgarcia@arda:~/Documents/Work/ArchiveAutomation$ archiveautomation --help Usage: archiveautomation [OPTIONS] INPUT
Automate digital preservation workflow.
From INPUT creates a BagIt directory, and DC core complaint file. (...) ```
Configuration
Clone the repository
Clone the repository
PS C:\Users\garcm0b\Work> git clone https://github.com/kaust-library/ArchiveAutomation.git
Virtual Environment and Dependencies
Create a virtual environment for the project
```
Windows
PS C:\Users\garcm0b\OneDrive - KAUST\Documents\Work\ArchiveAutomation> python -m venv venv
Linux
mgarcia@mordor:~/Documents/Work/ArchiveAutomation$ python3 -m venv venv ```
Setup the environment
pip install --editable .
This will install all dependencies listed in the section install_requires of the setup.py file.
Configuration File
The configuration details for the script are in the file etc/archiveautomation.cfg. When cloning the environment, the configuration file will be just a reminder (with an example extension) that it needs to be edited with the correct values, and save it as archiveautomation.cfg.
Currently the only parameter in the configuration file is the setup of ArchivEra API
```
Define the parameters for ArchivEra API
[API] url = 'https://path.to.archive/public' clientid = 'apiclient' grant_type = 'my grant type' username = 'my API user' database = 'my database' ```
ArchivEra API Password
The API password is handled in 2 ways: declaring it as an environment variable, or via .env file. For first case, set password according to your operating system:
```
Windows
(venv) PS C:\Users\garcm0b\OneDrive - KAUST\Documents\Work\ArchiveAutomation\src> $ENV:ARCHIVERAAPIPW='hello_mg'
Linux
(venv) mgarcia@mordor:~/Documents/Work/ArchiveAutomation/src$ export ARCHIVERAAPIPW="hello" ```
The second way of using the password is via a .env in the same directory as the main program. The file is a simple key=value pair:
(venv) PS C:\Users\garcm0b\Work\ArchiveAutomation> cat .env
ARCHIVERA_API_PW='hello_world'
(venv) PS C:\Users\garcm0b\Work\ArchiveAutomation>
Owner
- Name: KAUST University Library
- Login: kaust-library
- Kind: organization
- Location: Saudi Arabia
- Website: https://library.kaust.edu.sa/
- Repositories: 5
- Profile: https://github.com/kaust-library
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: Automate the digital preservation workflow
doi: 10.5281/zenodo.7052841
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Marcelo
family-names: Garcia
email: marcelo.garcia@kaust.edu.sa
affiliation: >-
King Abdullah University of Science and
Technology
orcid: 'https://orcid.org/0000-0002-2927-2371'
- given-names: Eamon
family-names: Smallwood
email: eamon.smallwood@kaust.edu.sa
affiliation: >-
King Abdullah University of Science and
Technology
GitHub Events
Total
Last Year
Dependencies
- bagit ==1.8.1
- certifi ==2021.10.8
- charset-normalizer ==2.0.7
- dcxml ==0.1.2
- idna ==3.3
- importlib-metadata ==4.8.1
- keyring ==23.2.1
- lxml ==4.6.4
- python-dotenv ==0.19.2
- pywin32-ctypes ==0.2.0
- requests ==2.26.0
- urllib3 ==1.26.7
- zipp ==3.6.0
- Click *
- bagit *
- dcxml *
- python-dotenv *
- requests *