https://github.com/acdh-oeaw/arche-ingest
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
2 of 6 committers (33.3%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary
Keywords
Repository
Basic Info
Statistics
- Stars: 0
- Watchers: 5
- Forks: 1
- Open Issues: 2
- Releases: 46
Topics
Metadata Files
README.md
A collection of ARCHE ingestion script templates
The REST API provided by the ARCHE is quite a low-level from the point of view of real-world data ingestions. To make ingestions simpler, the arche-lib-ingest library has been developed. While it provides a convenient high-level data ingestion API, it's still only a library which requires you to write your own ingestion script.
This repository is aimed at closing this gap - it provides a set of data ingestion scripts (built on top of the the arche-lib-ingest) which can be used by people with almost no programming skills.
Scripts provided
There are two script variants provided:
- Console scripts variant where where parameters are passed trough the command line.
The benefit of this variant is easiness of use, especially in CI/CD workflows.bin/arche-import-metadataimports metadata from an RDF filebin/arche-import-binary(re)ingests a single resource's binary content (to be used when file name and/or location changed)bin/arche-delete-resourceremoves a given repository resource (allows recursion, etc.)bin/arche-delete-triplesremoves metadata triples specified in the ttl file (but doesn't remove repository resources)bin/arche-update-redmineupdates a Redmine issue describing the data curation/ingestion process (see a dedicated section at the bottom of the README)
- Template variant where you adjust execution parameters and/or the way the script works by editign its content.
The benefit of this variant is that it allows to treat the adjusted script as a documentation of the ingestion process and/or adjust it to your particular needs.add_metadata_sample.phpadds metadata triples specified in the ttl file preserving all existing metadata of repository resourcesdelete_metadata_sample.phpremoves metadata triples specified in the ttl file (but doesn't remove repository resources)delete_resource_sample.phpremoves a given repository resource (allows recursion, etc.)import_binary_sample.phpimports binary data from the diskimport_metadata_sample.phpimports metadata from an RDF filereimport_single_binary.phpreingests a single resource's binary content (to be used when file name and/or location changed)
Installation & Usage
Runtime environment
You can also use the acdhch/arche-ingest Docker image
(the {pathToDirectoryWithFilesToIngest} will be available at the /data location inside the Docker container):
bash
docker run \
--rm \
-ti \
--name arche-ingest \
-v {pathToDirectoryWithFilesToIngest}:/data \
acdhch/arche-ingest
Console script variant
- Install with:
bash composer require acdh-oeaw/arche-ingest - Update regularly with:
composer update --no-dev - Run with:
bash vendor/bin/{scriptOfYourChoice} {parametersGoHere}e.g.bash vendor/bin/arche-import-metadata --concurrency 4 myRdf.ttl https://arche.acdh.oeaw.ac.at/api myLogin myPassword- To get the list of available parameters run
bash vendor/bin/{scriptOfYourChoice} --helpe.g.bash vendor/bin/arche-import-metadata --help
- To get the list of available parameters run
Running inside GitHub Actions
Do not store your ARCHE credentials in the workflow configuration file. Use repository secrets instead (see example below).
A fragment of your workflow's yaml config may look like that:
yaml
- name: ingestion dependencies
run: |
composer require acdh-oeaw/arche-ingest
- name: ingest arche
run: |
vendor/bin/arche-import-metadata myRdfFile.ttl https://arche-curation.acdh-dev.oeaw.ac.at/api ${{secrets.ARCHE_LOGIN}} ${{secrets.ARCHE_PASSWORD}}
vendor/bin/arche-update-redmine --token ${{ secrets.REDMINE_TOKEN }} https://redmine.acdh.oeaw.ac.at 1234 'Upload AIP to Curation Instance (Minerva)'
Running on ACDH Cluster
First, get the arche-ingestion workload console as described here
Then:
- Run
screen -S mySessionName - Go to your ingestion directory
- Run scripts using
{scriptName}, e.g.bash arche-import-metadata myRdf.ttl https://arche.acdh.oeaw.ac.at/api myLogin myPassword - If the script will take long to run, you may safely quit the console with
CTRL+a+dfollowed byexit.- To get back to the script log again into
repo-ingestion@hephaistosand runbash screen -r mySessionName
- To get back to the script log again into
Template variant
- Clone this repository.
- Run
bash composer update --no-dev - Adjust the script of your choice.
- Available parameters are provided at the beginning of the script.
- Don't adjust anything below the
php // NO CHANGES NEEDED BELOW THIS LINEline until you consider yourself a programmer and would like to change the way a script works.
- Run the script with
bash php -f {scriptOfYourChoice}- You can consider reading input from a file and/or saving output to a log file, e.g. with:
php -f import_metadata_sample.php < inputData 2>&1 | tee logFile(see the section below for hints on the input file format)
- You can consider reading input from a file and/or saving output to a log file, e.g. with:
Long runs
If you are performing time consuming operations, e.g. a large data ingestion, you may consider running scripts in a way they won't stop when you turn your computer off.
You can use nohup or screen for that, e.g.:
- nohup - run with:
# console script variant nohup vendor/bin/arche-import-metadata --concurrency 4 myRdf.ttl https://arche.acdh.oeaw.ac.at/api myLogin myPassword > logFile 2>&1 & # template variant nohup php -f import_metadata_sample.php < input > logFile 2>&1 &- If you want to run template script variants that way, you have to prepare the input data file.
It should look as follows:{arche instance API URL} yes {login} {password}e.g.https://arche-dev.acdh-dev.oeaw.ac.at yes myLogin myPassword
- If you want to run template script variants that way, you have to prepare the input data file.
- screen
- start a
screensession withbash screen -S mySessionName - Then run your commands as usual
- Hit
CTRL+afollowed by adto leave thescreensession. - You can get back to the
screensession withbash screen -r mySessionName
- start a
Reporting errors
Create a subtask of the Redmine issue #17641.
- Provide information on the exact location of the ingestion script location (including the script file itself) and any other information which may be required to replicated the problem.
- Assign Mateusz and Norbert as watchers.
Using arche-update-redmine in a GitHub workflow
The basic idea is to execute data processing steps in a following way:
- note down the step name so we can read it instead of a failure
- perform the step
- call the arche-update-redmine
and have a separate on-failure job step which makes an arche-update-redmine call noting the faillure.
Remarks:
- As a good practice we should include the GitHub job URL in the Redmine issue note. For that we set up a dedicated environment variable.
- It goes without saying Redmine access credentials are stored as a repository secret.
- The way you store the main Redmine issue ID doesn't matter as it's not secret. Do it a way you want (here we just hardcode it in the workflow using an environment variable)
```yaml name: sample
on: push: ~
jobs: dockerhub: runs-on: ubuntu-latest env: REDMINEID: 21085 steps: - uses: actions/checkout@v4 - name: init run: | composer require acdh-oeaw/arche-ingest echo "RUNURL=$GITHUBSERVERURL/$GITHUBREPOSITORY/actions/runs/$GITHUBRUNID" >> $GITHUBENV - name: virus scan run: | echo 'STEP=Virus Scan' >> $GITHUBENV ...perform the virus scan... vendor/bin/arche-update-redmine --token ${{ secrets.REDMINETOKEN }} --append "$RUNURL" $REDMINEID 'Virus scan' - name: repo-filechecker run: | echo 'STEP=Run repo-file-checker' >> $GITHUBENV ...run the repo-filechecker... vendor/bin/arche-update-redmine --token ${{ secrets.REDMINETOKEN }} --append "$RUNURL" $REDMINEID 'Run repo-file-checker' - name: check3 run: | echo 'STEP=Upload AIP to Curation Instance (Minerva)' >> $GITHUBENV ...perform the ingestion... vendor/bin/arche-update-redmine --token ${{ secrets.REDMINETOKEN }} --append "$RUNURL" $REDMINEID 'Upload AIP to Curation Instance (Minerva)' - name: on failure if: ${{ failure() }} run: | vendor/bin/arche-update-redmine --token ${{ secrets.REDMINETOKEN }} --append "$RUNURL" --statusCode 1 $REDMINE_ID "$STEP"
```
Owner
- Name: Austrian Centre for Digital Humanities & Cultural Heritage
- Login: acdh-oeaw
- Kind: organization
- Email: acdh@oeaw.ac.at
- Location: Vienna, Austria
- Website: https://www.oeaw.ac.at/acdh
- Repositories: 476
- Profile: https://github.com/acdh-oeaw
GitHub Events
Total
- Release event: 19
- Delete event: 2
- Push event: 15
- Create event: 18
Last Year
- Release event: 19
- Delete event: 2
- Push event: 15
- Create event: 18
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 79
- Total Committers: 6
- Avg Commits per committer: 13.167
- Development Distribution Score (DDS): 0.316
Top Committers
| Name | Commits | |
|---|---|---|
| Mateusz Żółtak | z****k@z****g | 54 |
| Mateusz Żółtak | m****k@o****t | 18 |
| Peter Andorfer | p****r@o****t | 3 |
| Martina | b****s@u****m | 2 |
| aureon249 | 3****9@u****m | 1 |
| bellerophons-pegasus | b****s@y****e | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 9
- Total pull requests: 0
- Average time to close issues: 19 days
- Average time to close pull requests: N/A
- Total issue authors: 4
- Total pull request authors: 0
- Average comments per issue: 1.11
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 1.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- zozlak (3)
- csae8092 (3)
- bellerophons-pegasus (2)
- fsanzl (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- packagist 819 total
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 50
- Total maintainers: 1
packagist.org: acdh-oeaw/arche-ingest
A set of sample ARCHE ingestion scripts
- Homepage: https://github.com/acdh-oeaw/arche-ingest
- License: MIT
-
Latest release: 3.0.1
published about 4 years ago
Rankings
Maintainers (1)
Funding
Dependencies
- phpstan/phpstan * development
- acdh-oeaw/arche-lib >=4.3 <6
- acdh-oeaw/arche-lib-ingest ^3.1
- zozlak/argparse ^1
- actions/checkout v3 composite
- docker/login-action v2 composite
- php 8.1 build