nedextract
Extract information on persons and organisation from Dutch PDF files
https://github.com/transparency-in-the-non-profit-sector/nedextract
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 8 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary
Keywords
Repository
Extract information on persons and organisation from Dutch PDF files
Basic Info
- Host: GitHub
- Owner: Transparency-in-the-non-profit-sector
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://research-software-directory.org/projects/transparency-in-non-profit
- Size: 5.32 MB
Statistics
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 2
- Releases: 4
Topics
Metadata Files
README.md
Nedextract
nedextract is being developed to extract specific information from annual report PDF files that are written in Dutch. Currently it tries to do the following:
- Read the PDF file, and perform Named Entity Recognition (NER) using Stanza to extract all persons and all organisations named in the document, which are then processed by the processes listed below.
Extract persons: using a rule-based method that searches for specific keywords, this module tries to identify:
- Ambassadors
People in important positions in the organisation. The code tries to determine a main job description (e.g. director or board) and a sub-job description (e.g. chairman or treasurer). Note that these positions are identified and outputted in Dutch.
The main jobs that are considered are:- directeur
- raad van toezicht
- bestuur
- ledenraad
- kascommissie
- controlecommisie.
The sub positions that are considered are: - directeur - voorzitter - vicevoorzitter - lid - penningmeester - commissaris - adviseur
For each person that is identified, the code searches for keywords in the sentences in which the name appears to determine the main position, or the sentence directly before or after that. Subjobs are determine based on words appearing directly before or after the name of a person for whom a main job has been determined. For the main jobs and sub positions, various ways of writing are considered in the keywords. Also before the search for the job-identification starts, name-deduplication is performed by creating lists of names that (likely) refer to one and the same person (e.g. Jane Doe and J. Doe).
Extract related organisations:
- After Stanza NER collects all candidates for mentioned organisations, postprocessing tasks try to determine which of these candidates are most likely true candidates. This is done by considering: how often the terms is mentioned in the document, how often the term was identified as an organisation by Stanza NER, whether the term contains keywords that make it likely to be a true positive, and whether the term contains keywords that make it likely to be a false positive. For candidates that are mentioned only once in the text, it is also considered whether the term by itself (i.e. without context) is identified as an organisation by Stanza NER. Additionally, for candidates that are mentioned only once, an extra check is performed to determine whether part of the candidate org is found to be a in the list of orgs that are already identified as true, and whether that true org is common within the text. In that case the candidate is found to be 'already part of another true org', and not added to the true orgs. This is done, because sometimes an additional random word is identified by NER as being part of an organisation's name.
- For those terms that are identified as true organisations, the number of occurrences in the document of each of them (in it's entirety, enclosed by word boudaries) is determined.
- Finally, the identified organisations are attempted to be matched on a list of provided organisations using the
anbisargument, to collect their rsin number for further analysis. An empty file./Data/Anbis_clean.csvis availble that serves as template for such a file. Matching is attempted both on currentStatutoryName and shortBusinessName. Only full matches (independent of capitals) and full matches with the additional term 'Stichting' at the start of the identified organisation (again independent of capitals) are considered for matching. Fuzzy matching is not used here, because during testing, this was found to lead to a significant amount of false positives.
Classify the sector in which the organisation is active. The code uses a pre-trained model to identify one of eight sectors in which the organisation is active. The model is trained on the 2020 annual report pdf files of CBF certified organisations.
Prerequisites
- Python 3.8, 3.9, 3.10, 3.11
- Poppler; poppler is a prerequisite to install pdftotext, instructions can be found here: https://pypi.org/project/pdftotext/. Please note that to install poppler on a Windows machine using conda-forge, Microsoft Visual C++ build tools have to be installed first.
Installation
nedextract can be installed using pip:
console
pip install nedextract
The required packages that are installed are: FuzzyWuzzy, NumPy, openpyxl, poppler, pandas, pdftotext, python-Levenshtein, scikit-learn, Stanza, and xlsxwriter.[^1]
[^1]: If you encounter problems with the installation, these often arise from the installation of poppler, which is a requirement for pdftotext. Help can generally be found on pdftotext.
Usage
Input
The full pipeline can be executed from the command line using:
python3 -m nedextract.run_nedextract
Followed by one or more of the following arguments:
- Input data, one or more pdf files, using one of the following arguments:
-ffile: path to a single pdf file-ddirectory: path to a directory containing pdf files-uurl: link to a pdf file-ufurlf: text file containing one or multiple urls to pdf files. The text file should contain one url per line, without headers and footers.
-ttasks (optional): can either be 'people', 'orgs', 'sectors' or 'all'. Indicates which tasks to be performed. Defualts to 'people'.-aanbis (option): path to a .csv file which will be used with theorgstask. The file should contain (at least) the columns rsin, currentStatutoryName, and shortBusinessName. An empty example file, that is also the default file, can be found in the folder 'Data'. The data in the file will be used to try to match identified named organisations on to collect their rsin number provided in the file.- model (
-m), labels (-l), vectors (-v) (optional): each referring to a path containing a pretraining classifyer model, label encoding and tf-idf vectors respectively. These will be used for the sector classification task. A model can be trained using theclassify_organisation.trainfunction. -wowrite_output: TRUE/FALSE, defaults to TRUE, setting weither to write the output data to an excel file.
For example:
python3 -m nedextract.run_nedextract -f pathtomypdf.pdf -t all -a ansbis.csv
Returns:
Three dataframes, one for the 'people' task, one for the 'sectors' task, and one for the 'orgs' task. If write_output=True, the gathered information is written to auto-named xlsx files in de folder Output. The output of the different tasks are written to separate xlsx files with the following naming convention:
- './Output/outputYYYYMMDDHHMMSSpeople.xlsx'
- './Output/outputYYYYMMDDHHMMSSrelated_organisations.xlsx'
- './Output/outputYYYYMMDDHHMMSSgeneral.xlsx'
Here YYYYMMDD and HHMMSS refer to the date and time at which the execution started.
Turorials
Tutorials on the full pipeline and (individual) useful analysis tools can be found in the Tutorials folder.
Contributing
If you want to contribute to the development of nedextract,
have a look at the contribution guidelines.
How to cite us
If you use this package for your scientific work, please consider citing it as:
Ootes, L.S. (2023). nedextract ([VERSION YOU USED]). Zenodo. https://doi.org/10.5281/zenodo.8286578
See also the Zenodo page for exporting the citation to BibTteX and other formats.
Credits
This package was created with Cookiecutter and the NLeSC/python-template.
Owner
- Name: Transparency in the non-profit sector
- Login: Transparency-in-the-non-profit-sector
- Kind: organization
- Website: https://research-software-directory.org/projects/transparency-in-non-profit
- Repositories: 1
- Profile: https://github.com/Transparency-in-the-non-profit-sector
Software written for the 'Transparency' project by Vrij Universiteit Amsterdam and the Netherlands eScience Center
Citation (CITATION.cff)
# YAML 1.2
---
cff-version: "1.1.0"
title: "nedextract"
authors:
-
family-names: Ootes
given-names: Laura
orcid: "https://orcid.org/0000-0002-2800-8309"
date-released: 2022-03-16
doi: 10.0000/FIXME
version: "0.2.0"
repository-code: "https://github.com/Transparency-in-the-non-profit-sector/nedextract"
keywords:
- NLP
- Philanthropy
message: "If you use this software, please cite it using these metadata."
license: Apache-2.0
GitHub Events
Total
- Create event: 1
Last Year
- Create event: 1
Issues and Pull Requests
Last synced: 6 months ago
Packages
- Total packages: 1
-
Total downloads:
- pypi 13 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 2
- Total maintainers: 1
pypi.org: nedextract
extract specific information from annual report files
- Homepage: https://github.com/Transparency-in-the-non-profit-sector/nedextract
- Documentation: https://nedextract.readthedocs.io/
- License: Apache Software License
-
Latest release: 0.2.1
published almost 2 years ago
Rankings
Maintainers (1)
Dependencies
- fuzzywuzzy ==0.18.0
- numpy >=1.21.5
- openpyxl ==3.0.9
- pandas >=1.3.5
- pdftotext ==2.2.2
- python-Levenshtein ==0.12.2
- scikit-learn ==1.0.2
- stanza ==1.3.0
- xlsxwriter *
- actions/checkout v1 composite
- actions/checkout v2 composite
- conda-incubator/setup-miniconda v2 composite
- coverallsapp/github-action v2 composite
- actions/checkout v2 composite
- citation-file-format/cffconvert-github-action main composite
- actions/checkout main composite
- gaurav-nelson/github-action-markdown-link-check v1 composite
- poppler *