https://github.com/datacite/shiba-inu

Pipeline for DOI Resolution Logs procesing

https://github.com/datacite/shiba-inu

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.7%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Pipeline for DOI Resolution Logs procesing

Basic Info
  • Host: GitHub
  • Owner: datacite
  • License: mit
  • Language: Ruby
  • Default Branch: master
  • Size: 256 KB
Statistics
  • Stars: 5
  • Watchers: 5
  • Forks: 6
  • Open Issues: 1
  • Releases: 0
Created about 8 years ago · Last pushed about 3 years ago
Metadata Files
Readme License

README.md

Pipeline for DOI Resolution Logs processing

Build Status [Docker Build Status] Test Coverage Maintainability

Shiba-Inu is pipeline for DOI Resolution Logs processing. The pipeline processes DOI resolution logs following the Code of practice for research data usage metrics. Its based in Logstash.

The Shiba Inu is the smallest of the six original and distinct spitz breeds of dog from Japan.

Installation

Requirements

  • A Elasticsearch instance
  • Single line logs with DOI names.

One can run the logs processor using Docker. you will need to set the following enviroment variables:

``` ESHOST=http://elasticsearch:9200 ESINDEX=resolutions INPUTDIR=/usr/share/logstash/tmp/DataCite-access.log-201805 OUTPUTDIR=/usr/share/logstash/tmp/output.json LOGSTASH_HOST = localhost:9600

S3MERGEDLOGSBUCKET = /usr/share/logstash/monthlylogs S3RESOLUTIONLOGSBUCKET = /usr/share/logstash/ ELASTICPASSWORD=changeme LOGS_TAG=[Resolution Logs]

HUBTOKEN=eyJhbGciOiJSUzI1NiJ9 HUBURL=https://api.test.datacite.org ```

and run the container like this:

docker run -p 8090:9200 datacite/shiba-inu

Alternatively you can use docker-compose to use the log processor without an elasticsearch instace:

docker-compose up

Usage logs

Your logs need to fulling a 2 of requerimentes:

  • The logs must be single line logs.
  • MUST include the following data:
    • doi => DOI name
    • occurred_at => timestamp (ISO8601)
    • clientip => IP address (IPV4 or IPV6)
    • user_agent => user agent

You will need to provide the configuration of your log lines following the grok filter documentation. You can enter the configuration in the file /vendor/docker/log_configuration.tmpl.

For example for logs file with the following style:

```text 46.229.168.146 HTTP:HDL "2018-09-30 23:40:39.132Z" 1 1 3ms 10.5277/ppmp1850 "300:10.admin/codata" "" "Mozilla/5.0 (Windows; U; Windows NT 5.1; fr; rv:1.8) Gecko/20051111 Firefox/1.5" 131.180.162.29 HTTP:HDL "2018-09-30 23:40:42.731Z" 1 1 71ms 10.4233/uuid:9798fb4a-9201-4efa-b324-3e50bbdc7ca5 "300:10.admin/codata" "" "" 131.180.162.29 HTTP:HDL "2018-09-30 23:40:44.846Z" 1 100 111ms 10.4233/uuid:a92fc858-da92-4339-8f80-b608aaa09741 "" "" ""

``` One would need the following configuration:

```logstash

"^%{IP:clientip} (?(HTTP:HDL)) %{QS:occurredat} %{INT:ld} %{INT:respcode} (?((.+ms))) %{DOI:doi} %{QS:server} %{QS:something} %{QS:user_agent}"

```

How to create reports

There are 3 basics steps to create a report.

  1. Copy your usage logs to /usage_logs
  2. Trigger the logs processing.
  3. Generate the report.

1. Copying the usage logs

The logs processor is restricted to processes logs in a monthly basis and with individual files or ordered files. You would need to merge all your logs in a single file or rename them in order. Logs files must be places in /usage_logs.

2. Trigger the logs processing

The logs processor will start working automatically once a new logs get to the logs folder.

3. Generate the report.

Usage reports can be generated locally, pushed and/or streamed to the MDC Hub. We can use the kishu client for logs processing to generate a report in any of these ways. To run the kishu client you need to be inside the logstash docker container. The kishu client does not need paramaters about the report that need be generate (i.e. month) as automatically will generate the report with whatever is in the logs processor pipeline.

shell source /usr/local/rvm/scripts/rvm rvm user gemsets

To generate a usage report in JSON format following the Code of Practice for Usage Metrics, you can use the following command. This will generate a usage report in the folder /reports.

shell bundle exec kishu sushi generate_report --created_by {YOUR DATACITE CLIENT ID}

To generate and push a usage report in JSON format following the Code of Practice for Usage Metrics, you can use the following command.

shell bundle exec kishu sushi push_report --created_by {YOUR DATACITE CLIENT ID}

To stream a usage report in JSON format following the Code of Practice for Usage Metrics, you can use the following command. This option should be only used with reports with more than 50,000 datasets or larger than 10MB. We compress all reports that are streammed to the the MDC Hub.

shell bundle exec kishu sushi stream --created_by {YOUR DATACITE CLIENT ID} --schema resolution --aggs_size 200 --report_size 90000

Further information about parametrizing the streaming can be found in the kishu client.

Development

We use Rspec for unit and acceptance testing:

ruby -S bundle exec rspec

Follow along via Github Issues.

Note on Patches/Pull Requests

  • Fork the project
  • Write tests for your new feature or a test that reproduces a bug
  • Implement your feature or make a bug fix
  • Do not mess with Rakefile, version or history
  • Commit, push and make a pull request. Bonus points for topical branches.

License

shiba-inu is released under the MIT License.

Owner

  • Name: DataCite
  • Login: datacite
  • Kind: organization
  • Email: info@datacite.org

Connecting research, identifying knowledge

GitHub Events

Total
Last Year