affiliation-matcher
Matcher for affiliations - link raw affiliation to ROR ids, country and RNSR
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.4%) to scientific vocabulary
Repository
Matcher for affiliations - link raw affiliation to ROR ids, country and RNSR
Basic Info
Statistics
- Stars: 24
- Watchers: 5
- Forks: 1
- Open Issues: 8
- Releases: 48
Metadata Files
README.md
Affiliation matcher
Goal
The affiliation matcher aims to automatically align an affiliation with different reference systems, including :
And specifically for French affiliations :
Methodology
The methodology is fully explained in a publication freely available on HAL: https://hal.archives-ouvertes.fr/hal-03365806.
Run it locally
:warning: Please use docker-compose version 1.27.0 up to 1.19.2.
shell
git clone git@github.com:dataesr/affiliation-matcher.git
cd affiliation-matcher
make docker-build start
Wait for Elasticsearch to be up. Then run :
shell
make load
In your browser, you now have :
- Elasticsearch : http://localhost:9200/
- RabbitMQ : http://localhost:9181/
- Matcher : http://localhost:5004/
In python, you can call the matcher this way:
shell
import requests
url = 'http://localhost:5004/match'
r=requests.post(url, json={
"type": "ror",
"name": "Paris Dauphine University",
"city": "Paris",
"country": "France",
"verbose": False}
)
r.json()
For RoR, available criteria are: id, gridid, name, city, country, supervisorname, acronym, cityzoneemploi, citynutslevel2, weburl, webdomain. Default strategies are detauked https://github.com/dataesr/affiliation-matcher/blob/master/project/server/main/match_ror.py
For RNSR, available criteria are: year, id, codenumber, acronym, name, supervisorname, supervisoracronym, zoneemploi, city, weburl. Default strategies are detailed in https://github.com/dataesr/affiliation-matcher/blob/master/project/server/main/matchrnsr.py
Run unit tests
shell
make test
Build docker image
shell
make docker-build
Build python package
To generate the tarball package into the dist folder :
shell
make python-build
To install the generated package into your project :
shell
pip install /path/to/your/package.tar.gz
Then import the package into your python file
python
import affiliation-matcher
Release
It uses semver.
To create a new release:
shell
make release VERSION=x.x.x
API
Match a single query /match
Query the API by setting your own strategies :
curl "YOUR_API_IP/match" -X POST -d '{"type": "YOUR_TYPE", "query": "YOUR_QUERY", "strategies": "YOUR_STRATEGIES", "year": "YOUR_YEAR"}'
YOUR_TYPE is optional, has to be a string and can be one of : * "country" * "grid" * "rnsr" * "ror"
By default, YOUR_TYPE is equal to "rnsr".
YOUR_QUERY is mandatory, has to be a string and is your affiliation text.
By example : IPAG Institut de Planétologie et d'Astrophysique de Grenoble.
YOUR_STRATEGIES is optional, has to be a 3 dimensional arrays of criteria (see next paragraph).
By example : [[["grid_name", "grid_country"], ["grid_name", "grid_country_code"]]].
YOUR_YEAR is optional, and can be used only if you use the "rnsr" matcher type, has te be a string.
By example : 1998.
By default, YOUR_YEAR is not set ie. it will be match over all years.
Match multiple queries /match_list
curl "YOUR_API_IP/match_list" -X POST -d '{"match_types": "YOUR_TYPES", "affiliations": "YOUR_AFFILIATIONS"}'
YOUR_TYPES is optional, has to be a list of string and can contain one of : * "country" * "grid" * "rnsr" * "ror"
By default, YOUR_TYPES is equal to ["grid", "rnsr"].
YOURAFFILIATIONS is optional, has to be a list of string. By example : `["affiliation01", "affiliation_02"]`.
By default, YOUR_AFFILIATIONS is equal to [].
Criteria
Here is a list of the criteria available for the country matcher: * countryalpha3 * countryname * countrysubdivisioncode * countrysubdivisionname
Here is a list of the criteria available for the grid matcher: * gridacronym * gridacronymunique * gridcitiesbyregion [indirect] * gridcity * gridcountry * gridcountrycode * griddepartment * gridid * gridname * gridnameunique * gridparent * grid_region
Here is a list of the criteria available for the rnsr matcher: * rnsracronym * rnsrcity * rnsrcodenumber * rnsrcodeprefix * rnsrcountrycode * rnsrid * rnsrname * rnsrnametxt * rnsrsupervisoracronym * rnsrsupervisorname * rnsrurbanunit * rnsrweburl * rnsryear * rnsrzone_emploi [indirect]
Here is a list of the criteria available for the ror matcher: * roracronym * roracronymunique * rorcity * rorcountry * rorcountrycode * rorgridid * rorid * rorname * rorname_unique
- You can combine criteria to create a strategy.
- You can cumulate strategies to create a family of strategies.
- And then you can cumulate families of strategies to create the final object.
- This final object
strategiesis then a 3 dimensional array that you will give as an argument to the "/match" API endpoint. By example :[[["grid_name", "grid_country"], ["grid_name", "grid_country_code"]]].
Results
| matcher | precision | recall | | ----- | ----- | ----- | | country | 0.9953 | 0.9690 | | grid | 0.7946 | 0.5944 | | rnsr | 0.9654 | 0.8192 | | ror | 0.8891 | 0.2356 | (TBC ???)
Owner
- Name: #dataESR
- Login: dataesr
- Kind: organization
- Repositories: 62
- Profile: https://github.com/dataesr
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite our publication as below."
title: "Using Elasticsearch for entity recognition in affiliation disambiguation"
authors:
- family-names: "L'Hôte"
given-names: "Anne"
orcid: "https://orcid.org/0000-0003-1353-5584"
- family-names: "Jeangirard"
given-names: "Eric"
orcid: "https://orcid.org/0000-0002-3767-7125"
preferred-citation:
type: unpublished
title: "Using Elasticsearch for entity recognition in affiliation disambiguation"
authors:
- family-names: "L'Hôte"
given-names: "Anne"
orcid: "https://orcid.org/0000-0003-1353-5584"
- family-names: "Jeangirard"
given-names: "Eric"
orcid: "https://orcid.org/0000-0002-3767-7125"
url: "https://hal.science/hal-03365806"
notes: "working paper or preprint"
year: 2021
month: Oct,
keywords:
- Elasticsearch
- affiliation disambiguation
- entity recognition
- open science
start: 1 # First page number
end: 12 # Last page number
version: 1.0.0
date-released: 2021-10-07
GitHub Events
Total
- Release event: 1
- Watch event: 2
- Push event: 5
- Create event: 1
Last Year
- Release event: 1
- Watch event: 2
- Push event: 5
- Create event: 1
Dependencies
- actions/checkout v3 composite
- dataesr/mm-notifier-action v1 composite
- svenstaro/upload-release-action v2 composite
- actions/checkout v3 composite
- actions/setup-python v2 composite
- python 3.6 build
- dataesr/dashboard-crawler 1.1
- dataesr/es_icu 7.12.0
- redis 5.0.7-alpine
- Flask ==1.1.1
- Flask-Bootstrap ==3.3.7.1
- XlsxWriter ==1.0.4
- beautifulsoup4 ==4.8.2
- elasticsearch ==7.8.0
- elasticsearch-dsl ==7.2.1
- fuzzywuzzy ==0.18.0
- geopy ==2.1.0
- lxml ==4.9.1
- pandas ==0.25.3
- pycountry ==20.7.3
- pytest ==6.2.3
- pytest-mock ==3.5.1
- python-Levenshtein ==0.21.1
- redis ==3.5.3
- requests ==2.25.0
- requests-mock ==1.9.2
- rq ==1.9.0
- xlrd ==1.1.0
- Flask ==1.1.1
- Flask-Bootstrap ==3.3.7.1
- XlsxWriter ==1.0.4
- beautifulsoup4 ==4.8.2
- elasticsearch ==7.8.0
- elasticsearch-dsl ==7.2.1
- geopy ==2.1.0
- lxml ==4.9.1
- pandas ==0.25.3
- pycountry ==20.7.3
- redis ==3.3.11
- requests ==2.20.0
- rq ==1.1.0
- xlrd ==1.1.0