https://github.com/datacite/shiba-inu
Pipeline for DOI Resolution Logs procesing
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 5 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary
Repository
Pipeline for DOI Resolution Logs procesing
Basic Info
- Host: GitHub
- Owner: datacite
- License: mit
- Language: Ruby
- Default Branch: master
- Size: 256 KB
Statistics
- Stars: 5
- Watchers: 5
- Forks: 6
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
Pipeline for DOI Resolution Logs processing
Shiba-Inu is pipeline for DOI Resolution Logs processing. The pipeline processes DOI resolution logs following the Code of practice for research data usage metrics. Its based in Logstash.

Installation
Requirements
- A Elasticsearch instance
- Single line logs with DOI names.
One can run the logs processor using Docker. you will need to set the following enviroment variables:
``` ESHOST=http://elasticsearch:9200 ESINDEX=resolutions INPUTDIR=/usr/share/logstash/tmp/DataCite-access.log-201805 OUTPUTDIR=/usr/share/logstash/tmp/output.json LOGSTASH_HOST = localhost:9600
S3MERGEDLOGSBUCKET = /usr/share/logstash/monthlylogs S3RESOLUTIONLOGSBUCKET = /usr/share/logstash/ ELASTICPASSWORD=changeme LOGS_TAG=[Resolution Logs]
HUBTOKEN=eyJhbGciOiJSUzI1NiJ9 HUBURL=https://api.test.datacite.org ```
and run the container like this:
docker run -p 8090:9200 datacite/shiba-inu
Alternatively you can use docker-compose to use the log processor without an elasticsearch instace:
docker-compose up
Usage logs
Your logs need to fulling a 2 of requerimentes:
- The logs must be single line logs.
- MUST include the following data:
- doi => DOI name
- occurred_at => timestamp (ISO8601)
- clientip => IP address (IPV4 or IPV6)
- user_agent => user agent
You will need to provide the configuration of your log lines following the grok filter documentation. You can enter the configuration in the file /vendor/docker/log_configuration.tmpl.
For example for logs file with the following style:
```text 46.229.168.146 HTTP:HDL "2018-09-30 23:40:39.132Z" 1 1 3ms 10.5277/ppmp1850 "300:10.admin/codata" "" "Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8) Gecko/20051111 Firefox/1.5" 131.180.162.29 HTTP:HDL "2018-09-30 23:40:42.731Z" 1 1 71ms 10.4233/uuid:9798fb4a-9201-4efa-b324-3e50bbdc7ca5 "300:10.admin/codata" "" "" 131.180.162.29 HTTP:HDL "2018-09-30 23:40:44.846Z" 1 100 111ms 10.4233/uuid:a92fc858-da92-4339-8f80-b608aaa09741 "" "" ""
``` One would need the following configuration:
```logstash
"^%{IP:clientip} (?
```
How to create reports
There are 3 basics steps to create a report.
- Copy your usage logs to
/usage_logs - Trigger the logs processing.
- Generate the report.
1. Copying the usage logs
The logs processor is restricted to processes logs in a monthly basis and with individual files or ordered files. You would need to merge all your logs in a single file or rename them in order. Logs files must be places in /usage_logs.
2. Trigger the logs processing
The logs processor will start working automatically once a new logs get to the logs folder.
3. Generate the report.
Usage reports can be generated locally, pushed and/or streamed to the MDC Hub. We can use the kishu client for logs processing to generate a report in any of these ways. To run the kishu client you need to be inside the logstash docker container. The kishu client does not need paramaters about the report that need be generate (i.e. month) as automatically will generate the report with whatever is in the logs processor pipeline.
shell
source /usr/local/rvm/scripts/rvm
rvm user gemsets
To generate a usage report in JSON format following the Code of Practice for Usage Metrics, you can use the following command. This will generate a usage report in the folder /reports.
shell
bundle exec kishu sushi generate_report --created_by {YOUR DATACITE CLIENT ID}
To generate and push a usage report in JSON format following the Code of Practice for Usage Metrics, you can use the following command.
shell
bundle exec kishu sushi push_report --created_by {YOUR DATACITE CLIENT ID}
To stream a usage report in JSON format following the Code of Practice for Usage Metrics, you can use the following command. This option should be only used with reports with more than 50,000 datasets or larger than 10MB. We compress all reports that are streammed to the the MDC Hub.
shell
bundle exec kishu sushi stream --created_by {YOUR DATACITE CLIENT ID} --schema resolution --aggs_size 200 --report_size 90000
Further information about parametrizing the streaming can be found in the kishu client.
Development
We use Rspec for unit and acceptance testing:
ruby -S bundle exec rspec
Follow along via Github Issues.
Note on Patches/Pull Requests
- Fork the project
- Write tests for your new feature or a test that reproduces a bug
- Implement your feature or make a bug fix
- Do not mess with Rakefile, version or history
- Commit, push and make a pull request. Bonus points for topical branches.
License
shiba-inu is released under the MIT License.
Owner
- Name: DataCite
- Login: datacite
- Kind: organization
- Email: info@datacite.org
- Website: https://www.datacite.org
- Twitter: DataCite
- Repositories: 111
- Profile: https://github.com/datacite
Connecting research, identifying knowledge