affiliation-matcher

Matcher for affiliations - link raw affiliation to ROR ids, country and RNSR

https://github.com/dataesr/affiliation-matcher

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.4%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Matcher for affiliations - link raw affiliation to ROR ids, country and RNSR

Basic Info
  • Host: GitHub
  • Owner: dataesr
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 5.5 MB
Statistics
  • Stars: 24
  • Watchers: 5
  • Forks: 1
  • Open Issues: 8
  • Releases: 48
Created over 5 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

Affiliation matcher

Discord Follow license GitHub release (latest by date) Tests Build

Goal

The affiliation matcher aims to automatically align an affiliation with different reference systems, including :

And specifically for French affiliations :

Methodology

The methodology is fully explained in a publication freely available on HAL: https://hal.archives-ouvertes.fr/hal-03365806.

Run it locally

:warning: Please use docker-compose version 1.27.0 up to 1.19.2.

shell git clone git@github.com:dataesr/affiliation-matcher.git cd affiliation-matcher make docker-build start

Wait for Elasticsearch to be up. Then run :

shell make load

In your browser, you now have :

  • Elasticsearch : http://localhost:9200/
  • RabbitMQ : http://localhost:9181/
  • Matcher : http://localhost:5004/

In python, you can call the matcher this way:

shell import requests url = 'http://localhost:5004/match' r=requests.post(url, json={ "type": "ror", "name": "Paris Dauphine University", "city": "Paris", "country": "France", "verbose": False} ) r.json()

For RoR, available criteria are: id, gridid, name, city, country, supervisorname, acronym, cityzoneemploi, citynutslevel2, weburl, webdomain. Default strategies are detauked https://github.com/dataesr/affiliation-matcher/blob/master/project/server/main/match_ror.py

For RNSR, available criteria are: year, id, codenumber, acronym, name, supervisorname, supervisoracronym, zoneemploi, city, weburl. Default strategies are detailed in https://github.com/dataesr/affiliation-matcher/blob/master/project/server/main/matchrnsr.py

Run unit tests

shell make test

Build docker image

shell make docker-build

Build python package

To generate the tarball package into the dist folder :

shell make python-build

To install the generated package into your project :

shell pip install /path/to/your/package.tar.gz

Then import the package into your python file

python import affiliation-matcher

Release

It uses semver.

To create a new release: shell make release VERSION=x.x.x

API

Match a single query /match

Query the API by setting your own strategies :

curl "YOUR_API_IP/match" -X POST -d '{"type": "YOUR_TYPE", "query": "YOUR_QUERY", "strategies": "YOUR_STRATEGIES", "year": "YOUR_YEAR"}'

YOUR_TYPE is optional, has to be a string and can be one of : * "country" * "grid" * "rnsr" * "ror"

By default, YOUR_TYPE is equal to "rnsr".

YOUR_QUERY is mandatory, has to be a string and is your affiliation text.

By example : IPAG Institut de Planétologie et d'Astrophysique de Grenoble.

YOUR_STRATEGIES is optional, has to be a 3 dimensional arrays of criteria (see next paragraph).

By example : [[["grid_name", "grid_country"], ["grid_name", "grid_country_code"]]].

YOUR_YEAR is optional, and can be used only if you use the "rnsr" matcher type, has te be a string.

By example : 1998.

By default, YOUR_YEAR is not set ie. it will be match over all years.

Match multiple queries /match_list

curl "YOUR_API_IP/match_list" -X POST -d '{"match_types": "YOUR_TYPES", "affiliations": "YOUR_AFFILIATIONS"}'

YOUR_TYPES is optional, has to be a list of string and can contain one of : * "country" * "grid" * "rnsr" * "ror"

By default, YOUR_TYPES is equal to ["grid", "rnsr"].

YOURAFFILIATIONS is optional, has to be a list of string. By example : `["affiliation01", "affiliation_02"]`.

By default, YOUR_AFFILIATIONS is equal to [].

Criteria

Here is a list of the criteria available for the country matcher: * countryalpha3 * countryname * countrysubdivisioncode * countrysubdivisionname

Here is a list of the criteria available for the grid matcher: * gridacronym * gridacronymunique * gridcitiesbyregion [indirect] * gridcity * gridcountry * gridcountrycode * griddepartment * gridid * gridname * gridnameunique * gridparent * grid_region

Here is a list of the criteria available for the rnsr matcher: * rnsracronym * rnsrcity * rnsrcodenumber * rnsrcodeprefix * rnsrcountrycode * rnsrid * rnsrname * rnsrnametxt * rnsrsupervisoracronym * rnsrsupervisorname * rnsrurbanunit * rnsrweburl * rnsryear * rnsrzone_emploi [indirect]

Here is a list of the criteria available for the ror matcher: * roracronym * roracronymunique * rorcity * rorcountry * rorcountrycode * rorgridid * rorid * rorname * rorname_unique

  1. You can combine criteria to create a strategy.
  2. You can cumulate strategies to create a family of strategies.
  3. And then you can cumulate families of strategies to create the final object.
  4. This final object strategies is then a 3 dimensional array that you will give as an argument to the "/match" API endpoint. By example : [[["grid_name", "grid_country"], ["grid_name", "grid_country_code"]]].

Results

| matcher | precision | recall | | ----- | ----- | ----- | | country | 0.9953 | 0.9690 | | grid | 0.7946 | 0.5944 | | rnsr | 0.9654 | 0.8192 | | ror | 0.8891 | 0.2356 | (TBC ???)

Owner

  • Name: #dataESR
  • Login: dataesr
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite our publication as below."
title: "Using Elasticsearch for entity recognition in affiliation disambiguation"
authors:
- family-names: "L'Hôte"
  given-names: "Anne"
  orcid: "https://orcid.org/0000-0003-1353-5584"
- family-names: "Jeangirard"
  given-names: "Eric"
  orcid: "https://orcid.org/0000-0002-3767-7125"
preferred-citation:
  type: unpublished
  title: "Using Elasticsearch for entity recognition in affiliation disambiguation"
  authors:
  - family-names: "L'Hôte"
    given-names: "Anne"
    orcid: "https://orcid.org/0000-0003-1353-5584"
  - family-names: "Jeangirard"
    given-names: "Eric"
    orcid: "https://orcid.org/0000-0002-3767-7125"
  url: "https://hal.science/hal-03365806"
  notes: "working paper or preprint"
  year: 2021
  month: Oct,
  keywords:
  - Elasticsearch
  - affiliation disambiguation
  - entity recognition
  - open science
  start: 1 # First page number
  end: 12 # Last page number
  version: 1.0.0
  date-released: 2021-10-07

GitHub Events

Total
  • Release event: 1
  • Watch event: 2
  • Push event: 5
  • Create event: 1
Last Year
  • Release event: 1
  • Watch event: 2
  • Push event: 5
  • Create event: 1

Dependencies

.github/workflows/build.yml actions
  • actions/checkout v3 composite
  • dataesr/mm-notifier-action v1 composite
  • svenstaro/upload-release-action v2 composite
.github/workflows/tests.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v2 composite
Dockerfile docker
  • python 3.6 build
docker-compose.yml docker
  • dataesr/dashboard-crawler 1.1
  • dataesr/es_icu 7.12.0
  • redis 5.0.7-alpine
requirements.txt pypi
  • Flask ==1.1.1
  • Flask-Bootstrap ==3.3.7.1
  • XlsxWriter ==1.0.4
  • beautifulsoup4 ==4.8.2
  • elasticsearch ==7.8.0
  • elasticsearch-dsl ==7.2.1
  • fuzzywuzzy ==0.18.0
  • geopy ==2.1.0
  • lxml ==4.9.1
  • pandas ==0.25.3
  • pycountry ==20.7.3
  • pytest ==6.2.3
  • pytest-mock ==3.5.1
  • python-Levenshtein ==0.21.1
  • redis ==3.5.3
  • requests ==2.25.0
  • requests-mock ==1.9.2
  • rq ==1.9.0
  • xlrd ==1.1.0
setup.py pypi
  • Flask ==1.1.1
  • Flask-Bootstrap ==3.3.7.1
  • XlsxWriter ==1.0.4
  • beautifulsoup4 ==4.8.2
  • elasticsearch ==7.8.0
  • elasticsearch-dsl ==7.2.1
  • geopy ==2.1.0
  • lxml ==4.9.1
  • pandas ==0.25.3
  • pycountry ==20.7.3
  • redis ==3.3.11
  • requests ==2.20.0
  • rq ==1.1.0
  • xlrd ==1.1.0