archiveautomation

Automate digital preservation workflow.

https://github.com/kaust-library/archiveautomation

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
    Organization kaust-library has institutional domain (library.kaust.edu.sa)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.7%) to scientific vocabulary

Keywords

python
Last synced: 6 months ago · JSON representation ·

Repository

Automate digital preservation workflow.

Basic Info
  • Host: GitHub
  • Owner: kaust-library
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 400 KB
Statistics
  • Stars: 0
  • Watchers: 4
  • Forks: 0
  • Open Issues: 9
  • Releases: 1
Topics
python
Created over 4 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

ArchiveAutomation

Digital Preservation Workflow

DOI

Automate the digital preservation workflow.

The workflow have the following steps:

  1. Run the antivirus.
  2. Create a bag file from the source folders.
  3. Create a XML file Dublin Core.
  4. Run Droid for extraction of the metadata.
  5. Run JHove as a complement of the metadata.

Next we describe the usage of the script, and the installation of dependencies are below

Usage

The usage assumes that the repository is already cloned, and we are ready to run the script.

Note. We had problems with long pathnames, and we found out that there is a limitation on the lentght of the path in Windows API. We changed the registry, rebooted the server, and now everything is working. powershell New-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem" ` -Name "LongPathsEnabled" -Value 1 -PropertyType DWORD -Force

Update Local Repository

Update the repository to latest version

mgarcia@arda:~/Documents/Work/ArchiveAutomation$ git pull Already up to date. mgarcia@arda:~/Documents/Work/ArchiveAutomation$

Start Virtual Environment

Activate the virtual environment:

```

Windows

PS C:\Users\garcm0b\OneDrive - KAUST\Documents\Work\ArchiveAutomation> .\venv\Scripts\activate

Linux

mgarcia@mordor:~/Documents/Work/ArchiveAutomation$ . venv/bin/activate ```

Workflow Input File

The takes a single argument: a file describing all information for the workflow. The name (and extension) of file are irrelevant. Any valid filename would be acceptable.

The steps in the workflow are represented by sections of the input file, like ACCESSION, BAGGER, CLAMAV, etc. The order of the section does not matter. But the they can't be removed.

An example of input file

```

Configuration file for the digital preservation workflow

[ACCESSION] accessionid = 000000_0000

[BAGGER]

You can specify a comma separated list of directories as source: dir1, dir2, ...

sourcedir = C:\Users\joe\Work\boattrippictures, C:\Users\joe\Work\myevent_1

Using Python ExtendedInterpolation to use the 'accession_id' as target directory

destdir = C:\Users\joe\Work\${ACCESSION:accessionid}

If 'false' the script will stop the the destination directory already exists.

[CLAMAV] avdir = C:\Program Files\ClamAV avupdate = freshclam.exe avclamav = clamscan.exe avlogsroot = C:\Users\Desktop\john\clamscanlogs\clamAVlog quarantinedays = 30

Doesn't actually run the AV command, just print it.

run_it = no

[DROID] droiddir = C:\Users\joe\Downloads\droid-binary-6.5.2-bin-win32-with-jre droidbin = droid.bat

The profile is the database with the metadata in binary format.

keep_profile = yes

[JHOVE] jhovedir = C:\Users\joe\Work\jhove jhovebin = jhove.bat jhove_xml = yes

[JHOVE MODULES] AIFF-hul = no ASCII-hul = no GIF-hul = no GZip-kb = no HTML-hul = no JPEG-hul = yes JPEG2000-hul = no PDF-hul = no TIFF-hul = no UTF8-hul = no WARC-kb = no WAVE-hul = no XML-hul = no ```

The BAGGER:dest_dir can't exist, otherwise the script will stop. By default, the directory name will be accession number, but it can be customized by the user. Like: dest_dir = /path/to/my/bag/dir

Droid and Jhove

There can't be any space in the droid path, otherwise the droid.bat script will fail. This means that installing droid in the C:\Program Files will not work.

Both sessions for Droid and Jhome are very similar. They have the path to the installation directory, the name of executable program, and some parameter. For droid, the parameter is whether the Droid profile should be kept after running the script or not. For jhove it's the option to save the output in XML format.

Running the Script

Once the input file is ready, simply call the script with the input file as parameter.

(venv) PS C:\Users\garcm0b\Work\ArchiveAutomation> python archiveautomation.py .\my_accession.cfg Have a nice day. (venv) mgarcia@wsl2:~/Documents/Work/ArchiveAutomation$

The script prints help message when no input file is provided

``` (venv) mgarcia@arda:~/Documents/Work/ArchiveAutomation$ archiveautomation Usage: archiveautomation [OPTIONS] INPUT Try 'archiveautomation --help' for help.

Error: Missing argument 'INPUT'. (venv) mgarcia@arda:~/Documents/Work/ArchiveAutomation$ ```

Or the script can be called with the --help parameter

``` venv) mgarcia@arda:~/Documents/Work/ArchiveAutomation$ archiveautomation --help Usage: archiveautomation [OPTIONS] INPUT

Automate digital preservation workflow.

From INPUT creates a BagIt directory, and DC core complaint file. (...) ```

Configuration

Clone the repository

Clone the repository

PS C:\Users\garcm0b\Work> git clone https://github.com/kaust-library/ArchiveAutomation.git

Virtual Environment and Dependencies

Create a virtual environment for the project

```

Windows

PS C:\Users\garcm0b\OneDrive - KAUST\Documents\Work\ArchiveAutomation> python -m venv venv

Linux

mgarcia@mordor:~/Documents/Work/ArchiveAutomation$ python3 -m venv venv ```

Setup the environment

pip install --editable .

This will install all dependencies listed in the section install_requires of the setup.py file.

Configuration File

The configuration details for the script are in the file etc/archiveautomation.cfg. When cloning the environment, the configuration file will be just a reminder (with an example extension) that it needs to be edited with the correct values, and save it as archiveautomation.cfg.

Currently the only parameter in the configuration file is the setup of ArchivEra API

```

Define the parameters for ArchivEra API

[API] url = 'https://path.to.archive/public' clientid = 'apiclient' grant_type = 'my grant type' username = 'my API user' database = 'my database' ```

ArchivEra API Password

The API password is handled in 2 ways: declaring it as an environment variable, or via .env file. For first case, set password according to your operating system:

```

Windows

(venv) PS C:\Users\garcm0b\OneDrive - KAUST\Documents\Work\ArchiveAutomation\src> $ENV:ARCHIVERAAPIPW='hello_mg'

Linux

(venv) mgarcia@mordor:~/Documents/Work/ArchiveAutomation/src$ export ARCHIVERAAPIPW="hello" ```

The second way of using the password is via a .env in the same directory as the main program. The file is a simple key=value pair:

(venv) PS C:\Users\garcm0b\Work\ArchiveAutomation> cat .env ARCHIVERA_API_PW='hello_world' (venv) PS C:\Users\garcm0b\Work\ArchiveAutomation>

Owner

  • Name: KAUST University Library
  • Login: kaust-library
  • Kind: organization
  • Location: Saudi Arabia

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Automate the digital preservation workflow
doi: 10.5281/zenodo.7052841
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Marcelo
    family-names: Garcia
    email: marcelo.garcia@kaust.edu.sa
    affiliation: >-
      King Abdullah University of Science and
      Technology
    orcid: 'https://orcid.org/0000-0002-2927-2371'
  - given-names: Eamon
    family-names: Smallwood
    email: eamon.smallwood@kaust.edu.sa
    affiliation: >-
      King Abdullah University of Science and
      Technology

GitHub Events

Total
Last Year

Dependencies

requirements.txt pypi
  • bagit ==1.8.1
  • certifi ==2021.10.8
  • charset-normalizer ==2.0.7
  • dcxml ==0.1.2
  • idna ==3.3
  • importlib-metadata ==4.8.1
  • keyring ==23.2.1
  • lxml ==4.6.4
  • python-dotenv ==0.19.2
  • pywin32-ctypes ==0.2.0
  • requests ==2.26.0
  • urllib3 ==1.26.7
  • zipp ==3.6.0
setup.py pypi
  • Click *
  • bagit *
  • dcxml *
  • python-dotenv *
  • requests *