affiliation-matcher

Matcher for affiliations - link raw affiliation to ROR ids, country and RNSR

https://github.com/dataesr/affiliation-matcher

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Matcher for affiliations - link raw affiliation to ROR ids, country and RNSR

Basic Info

Host: GitHub
Owner: dataesr
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 5.5 MB

Statistics

Stars: 24
Watchers: 5
Forks: 1
Open Issues: 8
Releases: 48

Created almost 6 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

Affiliation matcher

license GitHub release (latest by date)

Goal

The affiliation matcher aims to automatically align an affiliation with different reference systems, including :

And specifically for French affiliations :

Methodology

The methodology is fully explained in a publication freely available on HAL: https://hal.archives-ouvertes.fr/hal-03365806.

Run it locally

:warning: Please use docker-compose version 1.27.0 up to 1.19.2.

shell git clone git@github.com:dataesr/affiliation-matcher.git cd affiliation-matcher make docker-build start

Wait for Elasticsearch to be up. Then run :

shell make load

In your browser, you now have :

Elasticsearch : http://localhost:9200/
RabbitMQ : http://localhost:9181/
Matcher : http://localhost:5004/

In python, you can call the matcher this way:

shell import requests url = 'http://localhost:5004/match' r=requests.post(url, json={ "type": "ror", "name": "Paris Dauphine University", "city": "Paris", "country": "France", "verbose": False} ) r.json()

For RoR, available criteria are: id, gridid, name, city, country, supervisorname, acronym, cityzoneemploi, citynutslevel2, weburl, webdomain. Default strategies are detauked https://github.com/dataesr/affiliation-matcher/blob/master/project/server/main/match_ror.py

For RNSR, available criteria are: year, id, codenumber, acronym, name, supervisorname, supervisoracronym, zoneemploi, city, weburl. Default strategies are detailed in https://github.com/dataesr/affiliation-matcher/blob/master/project/server/main/matchrnsr.py

Run unit tests

shell make test

Build docker image

shell make docker-build

Build python package

To generate the tarball package into the dist folder :

shell make python-build

To install the generated package into your project :

shell pip install /path/to/your/package.tar.gz

Then import the package into your python file

python import affiliation-matcher

Release

It uses semver.

To create a new release: shell make release VERSION=x.x.x

API

Match a single query `/match`

Query the API by setting your own strategies :

curl "YOUR_API_IP/match" -X POST -d '{"type": "YOUR_TYPE", "query": "YOUR_QUERY", "strategies": "YOUR_STRATEGIES", "year": "YOUR_YEAR"}'

YOUR_TYPE is optional, has to be a string and can be one of : * "country" * "grid" * "rnsr" * "ror"

By default, YOUR_TYPE is equal to "rnsr".

YOUR_QUERY is mandatory, has to be a string and is your affiliation text.

By example : IPAG Institut de Planétologie et d'Astrophysique de Grenoble.

YOUR_STRATEGIES is optional, has to be a 3 dimensional arrays of criteria (see next paragraph).

By example : [[["grid_name", "grid_country"], ["grid_name", "grid_country_code"]]].

YOUR_YEAR is optional, and can be used only if you use the "rnsr" matcher type, has te be a string.

By example : 1998.

By default, YOUR_YEAR is not set ie. it will be match over all years.

Match multiple queries `/match_list`

curl "YOUR_API_IP/match_list" -X POST -d '{"match_types": "YOUR_TYPES", "affiliations": "YOUR_AFFILIATIONS"}'

YOUR_TYPES is optional, has to be a list of string and can contain one of : * "country" * "grid" * "rnsr" * "ror"

By default, YOUR_TYPES is equal to ["grid", "rnsr"].

YOURAFFILIATIONS is optional, has to be a list of string. By example : `["affiliation01", "affiliation_02"]`.

By default, YOUR_AFFILIATIONS is equal to [].

Criteria

Here is a list of the criteria available for the country matcher: * countryalpha3 * countryname * countrysubdivisioncode * countrysubdivisionname

Here is a list of the criteria available for the grid matcher: * gridacronym * gridacronymunique * gridcitiesbyregion [indirect] * gridcity * gridcountry * gridcountrycode * griddepartment * gridid * gridname * gridnameunique * gridparent * grid_region

Here is a list of the criteria available for the rnsr matcher: * rnsracronym * rnsrcity * rnsrcodenumber * rnsrcodeprefix * rnsrcountrycode * rnsrid * rnsrname * rnsrnametxt * rnsrsupervisoracronym * rnsrsupervisorname * rnsrurbanunit * rnsrweburl * rnsryear * rnsrzone_emploi [indirect]

Here is a list of the criteria available for the ror matcher: * roracronym * roracronymunique * rorcity * rorcountry * rorcountrycode * rorgridid * rorid * rorname * rorname_unique

You can combine criteria to create a strategy.
You can cumulate strategies to create a family of strategies.
And then you can cumulate families of strategies to create the final object.
This final object strategies is then a 3 dimensional array that you will give as an argument to the "/match" API endpoint. By example : [[["grid_name", "grid_country"], ["grid_name", "grid_country_code"]]].

Results

| matcher | precision | recall | | ----- | ----- | ----- | | country | 0.9953 | 0.9690 | | grid | 0.7946 | 0.5944 | | rnsr | 0.9654 | 0.8192 | | ror | 0.8891 | 0.2356 | (TBC ???)

Owner

Name: #dataESR
Login: dataesr
Kind: organization

Repositories: 62
Profile: https://github.com/dataesr

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite our publication as below."
title: "Using Elasticsearch for entity recognition in affiliation disambiguation"
authors:
- family-names: "L'Hôte"
  given-names: "Anne"
  orcid: "https://orcid.org/0000-0003-1353-5584"
- family-names: "Jeangirard"
  given-names: "Eric"
  orcid: "https://orcid.org/0000-0002-3767-7125"
preferred-citation:
  type: unpublished
  title: "Using Elasticsearch for entity recognition in affiliation disambiguation"
  authors:
  - family-names: "L'Hôte"
    given-names: "Anne"
    orcid: "https://orcid.org/0000-0003-1353-5584"
  - family-names: "Jeangirard"
    given-names: "Eric"
    orcid: "https://orcid.org/0000-0002-3767-7125"
  url: "https://hal.science/hal-03365806"
  notes: "working paper or preprint"
  year: 2021
  month: Oct,
  keywords:
  - Elasticsearch
  - affiliation disambiguation
  - entity recognition
  - open science
  start: 1 # First page number
  end: 12 # Last page number
  version: 1.0.0
  date-released: 2021-10-07

GitHub Events

Total

Release event: 1
Watch event: 2
Push event: 5
Create event: 1

Last Year

Release event: 1
Watch event: 2
Push event: 5
Create event: 1

Dependencies

.github/workflows/build.yml actions

actions/checkout v3 composite
dataesr/mm-notifier-action v1 composite
svenstaro/upload-release-action v2 composite

.github/workflows/tests.yml actions

actions/checkout v3 composite
actions/setup-python v2 composite

Dockerfile docker

python 3.6 build

docker-compose.yml docker

dataesr/dashboard-crawler 1.1
dataesr/es_icu 7.12.0
redis 5.0.7-alpine

requirements.txt pypi

Flask ==1.1.1
Flask-Bootstrap ==3.3.7.1
XlsxWriter ==1.0.4
beautifulsoup4 ==4.8.2
elasticsearch ==7.8.0
elasticsearch-dsl ==7.2.1
fuzzywuzzy ==0.18.0
geopy ==2.1.0
lxml ==4.9.1
pandas ==0.25.3
pycountry ==20.7.3
pytest ==6.2.3
pytest-mock ==3.5.1
python-Levenshtein ==0.21.1
redis ==3.5.3
requests ==2.25.0
requests-mock ==1.9.2
rq ==1.9.0
xlrd ==1.1.0

setup.py pypi

Flask ==1.1.1
Flask-Bootstrap ==3.3.7.1
XlsxWriter ==1.0.4
beautifulsoup4 ==4.8.2
elasticsearch ==7.8.0
elasticsearch-dsl ==7.2.1
geopy ==2.1.0
lxml ==4.9.1
pandas ==0.25.3
pycountry ==20.7.3
redis ==3.3.11
requests ==2.20.0
rq ==1.1.0
xlrd ==1.1.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

affiliation-matcher

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Affiliation matcher

Goal

Methodology

Run it locally

Run unit tests

Build docker image

Build python package

Release

API

Match a single query `/match`

Match multiple queries `/match_list`

Criteria

Results

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

affiliation-matcher

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Affiliation matcher

Goal

Methodology

Run it locally

Run unit tests

Build docker image

Build python package

Release

API

Match a single query /match

Match multiple queries /match_list

Criteria

Results

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

Match a single query `/match`

Match multiple queries `/match_list`