https://github.com/acdh-oeaw/arche-ingest

https://github.com/acdh-oeaw/arche-ingest

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    2 of 6 committers (33.3%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.3%) to scientific vocabulary

Keywords

arche
Last synced: 6 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: acdh-oeaw
  • License: mit
  • Language: PHP
  • Default Branch: master
  • Homepage:
  • Size: 109 KB
Statistics
  • Stars: 0
  • Watchers: 5
  • Forks: 1
  • Open Issues: 2
  • Releases: 46
Topics
arche
Created almost 6 years ago · Last pushed 7 months ago
Metadata Files
Readme License

README.md

Latest Stable Version Build status Coverage Status License

A collection of ARCHE ingestion script templates

The REST API provided by the ARCHE is quite a low-level from the point of view of real-world data ingestions. To make ingestions simpler, the arche-lib-ingest library has been developed. While it provides a convenient high-level data ingestion API, it's still only a library which requires you to write your own ingestion script.

This repository is aimed at closing this gap - it provides a set of data ingestion scripts (built on top of the the arche-lib-ingest) which can be used by people with almost no programming skills.

Scripts provided

There are two script variants provided:

  • Console scripts variant where where parameters are passed trough the command line.
    The benefit of this variant is easiness of use, especially in CI/CD workflows.
    • bin/arche-import-metadata imports metadata from an RDF file
    • bin/arche-import-binary (re)ingests a single resource's binary content (to be used when file name and/or location changed)
    • bin/arche-delete-resource removes a given repository resource (allows recursion, etc.)
    • bin/arche-delete-triples removes metadata triples specified in the ttl file (but doesn't remove repository resources)
    • bin/arche-update-redmine updates a Redmine issue describing the data curation/ingestion process (see a dedicated section at the bottom of the README)
  • Template variant where you adjust execution parameters and/or the way the script works by editign its content.
    The benefit of this variant is that it allows to treat the adjusted script as a documentation of the ingestion process and/or adjust it to your particular needs.
    • add_metadata_sample.php adds metadata triples specified in the ttl file preserving all existing metadata of repository resources
    • delete_metadata_sample.php removes metadata triples specified in the ttl file (but doesn't remove repository resources)
    • delete_resource_sample.php removes a given repository resource (allows recursion, etc.)
    • import_binary_sample.php imports binary data from the disk
    • import_metadata_sample.php imports metadata from an RDF file
    • reimport_single_binary.php reingests a single resource's binary content (to be used when file name and/or location changed)

Installation & Usage

Runtime environment

You need PHP and Composer.

You can also use the acdhch/arche-ingest Docker image (the {pathToDirectoryWithFilesToIngest} will be available at the /data location inside the Docker container):

bash docker run \ --rm \ -ti \ --name arche-ingest \ -v {pathToDirectoryWithFilesToIngest}:/data \ acdhch/arche-ingest

Console script variant

  • Install with: bash composer require acdh-oeaw/arche-ingest
  • Update regularly with: composer update --no-dev
  • Run with: bash vendor/bin/{scriptOfYourChoice} {parametersGoHere} e.g. bash vendor/bin/arche-import-metadata --concurrency 4 myRdf.ttl https://arche.acdh.oeaw.ac.at/api myLogin myPassword
    • To get the list of available parameters run bash vendor/bin/{scriptOfYourChoice} --help e.g. bash vendor/bin/arche-import-metadata --help

Running inside GitHub Actions

Do not store your ARCHE credentials in the workflow configuration file. Use repository secrets instead (see example below).

A fragment of your workflow's yaml config may look like that:

yaml - name: ingestion dependencies run: | composer require acdh-oeaw/arche-ingest - name: ingest arche run: | vendor/bin/arche-import-metadata myRdfFile.ttl https://arche-curation.acdh-dev.oeaw.ac.at/api ${{secrets.ARCHE_LOGIN}} ${{secrets.ARCHE_PASSWORD}} vendor/bin/arche-update-redmine --token ${{ secrets.REDMINE_TOKEN }} https://redmine.acdh.oeaw.ac.at 1234 'Upload AIP to Curation Instance (Minerva)'

Running on ACDH Cluster

First, get the arche-ingestion workload console as described here

Then:

  • Run screen -S mySessionName
  • Go to your ingestion directory
  • Run scripts using {scriptName}, e.g. bash arche-import-metadata myRdf.ttl https://arche.acdh.oeaw.ac.at/api myLogin myPassword
  • If the script will take long to run, you may safely quit the console with CTRL+a + d followed by exit.
    • To get back to the script log again into repo-ingestion@hephaistos and run bash screen -r mySessionName

Template variant

  • Clone this repository.
  • Run bash composer update --no-dev
  • Adjust the script of your choice.
    • Available parameters are provided at the beginning of the script.
    • Don't adjust anything below the php // NO CHANGES NEEDED BELOW THIS LINE line until you consider yourself a programmer and would like to change the way a script works.
  • Run the script with bash php -f {scriptOfYourChoice}
    • You can consider reading input from a file and/or saving output to a log file, e.g. with: php -f import_metadata_sample.php < inputData 2>&1 | tee logFile (see the section below for hints on the input file format)

Long runs

If you are performing time consuming operations, e.g. a large data ingestion, you may consider running scripts in a way they won't stop when you turn your computer off.

You can use nohup or screen for that, e.g.:

  • nohup - run with: # console script variant nohup vendor/bin/arche-import-metadata --concurrency 4 myRdf.ttl https://arche.acdh.oeaw.ac.at/api myLogin myPassword > logFile 2>&1 & # template variant nohup php -f import_metadata_sample.php < input > logFile 2>&1 &
    • If you want to run template script variants that way, you have to prepare the input data file.
      It should look as follows: {arche instance API URL} yes {login} {password} e.g. https://arche-dev.acdh-dev.oeaw.ac.at yes myLogin myPassword
  • screen
    • start a screen session with bash screen -S mySessionName
    • Then run your commands as usual
    • Hit CTRL+a followed by a d to leave the screen session.
    • You can get back to the screen session with bash screen -r mySessionName

Reporting errors

Create a subtask of the Redmine issue #17641.

  • Provide information on the exact location of the ingestion script location (including the script file itself) and any other information which may be required to replicated the problem.
  • Assign Mateusz and Norbert as watchers.

Using arche-update-redmine in a GitHub workflow

The basic idea is to execute data processing steps in a following way:

  • note down the step name so we can read it instead of a failure
  • perform the step
  • call the arche-update-redmine

and have a separate on-failure job step which makes an arche-update-redmine call noting the faillure.

Remarks:

  • As a good practice we should include the GitHub job URL in the Redmine issue note. For that we set up a dedicated environment variable.
  • It goes without saying Redmine access credentials are stored as a repository secret.
  • The way you store the main Redmine issue ID doesn't matter as it's not secret. Do it a way you want (here we just hardcode it in the workflow using an environment variable)

```yaml name: sample

on: push: ~

jobs: dockerhub: runs-on: ubuntu-latest env: REDMINEID: 21085 steps: - uses: actions/checkout@v4 - name: init run: | composer require acdh-oeaw/arche-ingest echo "RUNURL=$GITHUBSERVERURL/$GITHUBREPOSITORY/actions/runs/$GITHUBRUNID" >> $GITHUBENV - name: virus scan run: | echo 'STEP=Virus Scan' >> $GITHUBENV ...perform the virus scan... vendor/bin/arche-update-redmine --token ${{ secrets.REDMINETOKEN }} --append "$RUNURL" $REDMINEID 'Virus scan' - name: repo-filechecker run: | echo 'STEP=Run repo-file-checker' >> $GITHUBENV ...run the repo-filechecker... vendor/bin/arche-update-redmine --token ${{ secrets.REDMINETOKEN }} --append "$RUNURL" $REDMINEID 'Run repo-file-checker' - name: check3 run: | echo 'STEP=Upload AIP to Curation Instance (Minerva)' >> $GITHUBENV ...perform the ingestion... vendor/bin/arche-update-redmine --token ${{ secrets.REDMINETOKEN }} --append "$RUNURL" $REDMINEID 'Upload AIP to Curation Instance (Minerva)' - name: on failure if: ${{ failure() }} run: | vendor/bin/arche-update-redmine --token ${{ secrets.REDMINETOKEN }} --append "$RUNURL" --statusCode 1 $REDMINE_ID "$STEP"

```

Owner

  • Name: Austrian Centre for Digital Humanities & Cultural Heritage
  • Login: acdh-oeaw
  • Kind: organization
  • Email: acdh@oeaw.ac.at
  • Location: Vienna, Austria

GitHub Events

Total
  • Release event: 19
  • Delete event: 2
  • Push event: 15
  • Create event: 18
Last Year
  • Release event: 19
  • Delete event: 2
  • Push event: 15
  • Create event: 18

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 79
  • Total Committers: 6
  • Avg Commits per committer: 13.167
  • Development Distribution Score (DDS): 0.316
Top Committers
Name Email Commits
Mateusz Żółtak z****k@z****g 54
Mateusz Żółtak m****k@o****t 18
Peter Andorfer p****r@o****t 3
Martina b****s@u****m 2
aureon249 3****9@u****m 1
bellerophons-pegasus b****s@y****e 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 9
  • Total pull requests: 0
  • Average time to close issues: 19 days
  • Average time to close pull requests: N/A
  • Total issue authors: 4
  • Total pull request authors: 0
  • Average comments per issue: 1.11
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 1.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • zozlak (3)
  • csae8092 (3)
  • bellerophons-pegasus (2)
  • fsanzl (1)
Pull Request Authors
Top Labels
Issue Labels
enhancement (2) bug (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • packagist 819 total
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 50
  • Total maintainers: 1
packagist.org: acdh-oeaw/arche-ingest

A set of sample ARCHE ingestion scripts

  • Versions: 50
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 819 Total
Rankings
Dependent packages count: 19.1%
Forks count: 24.6%
Average: 28.2%
Downloads: 31.4%
Stargazers count: 32.6%
Dependent repos count: 33.5%
Maintainers (1)
Funding
Last synced: 6 months ago

Dependencies

composer.json packagist
  • phpstan/phpstan * development
  • acdh-oeaw/arche-lib >=4.3 <6
  • acdh-oeaw/arche-lib-ingest ^3.1
  • zozlak/argparse ^1
.github/workflows/build.yml actions
  • actions/checkout v3 composite
  • docker/login-action v2 composite
docker/Dockerfile docker
  • php 8.1 build