nedextract

Extract information on persons and organisation from Dutch PDF files

https://github.com/transparency-in-the-non-profit-sector/nedextract

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 8 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary

Keywords

information-extraction nlp transparency

Last synced: 6 months ago · JSON representation ·

Repository

Extract information on persons and organisation from Dutch PDF files

Basic Info

Host: GitHub
Owner: Transparency-in-the-non-profit-sector
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://research-software-directory.org/projects/transparency-in-non-profit
Size: 5.32 MB

Statistics

Stars: 2
Watchers: 0
Forks: 0
Open Issues: 2
Releases: 4

Topics

information-extraction nlp transparency

Created about 4 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog Contributing License Code of conduct Citation

Nedextract

nedextract is being developed to extract specific information from annual report PDF files that are written in Dutch. Currently it tries to do the following:

Read the PDF file, and perform Named Entity Recognition (NER) using Stanza to extract all persons and all organisations named in the document, which are then processed by the processes listed below.
Extract persons: using a rule-based method that searches for specific keywords, this module tries to identify:
- Ambassadors
- People in important positions in the organisation. The code tries to determine a main job description (e.g. director or board) and a sub-job description (e.g. chairman or treasurer). Note that these positions are identified and outputted in Dutch.
  The main jobs that are considered are:
  - directeur
  - raad van toezicht
  - bestuur
  - ledenraad
  - kascommissie
  - controlecommisie.
  The sub positions that are considered are: - directeur - voorzitter - vicevoorzitter - lid - penningmeester - commissaris - adviseur
For each person that is identified, the code searches for keywords in the sentences in which the name appears to determine the main position, or the sentence directly before or after that. Subjobs are determine based on words appearing directly before or after the name of a person for whom a main job has been determined. For the main jobs and sub positions, various ways of writing are considered in the keywords. Also before the search for the job-identification starts, name-deduplication is performed by creating lists of names that (likely) refer to one and the same person (e.g. Jane Doe and J. Doe).
Extract related organisations:
- After Stanza NER collects all candidates for mentioned organisations, postprocessing tasks try to determine which of these candidates are most likely true candidates. This is done by considering: how often the terms is mentioned in the document, how often the term was identified as an organisation by Stanza NER, whether the term contains keywords that make it likely to be a true positive, and whether the term contains keywords that make it likely to be a false positive. For candidates that are mentioned only once in the text, it is also considered whether the term by itself (i.e. without context) is identified as an organisation by Stanza NER. Additionally, for candidates that are mentioned only once, an extra check is performed to determine whether part of the candidate org is found to be a in the list of orgs that are already identified as true, and whether that true org is common within the text. In that case the candidate is found to be 'already part of another true org', and not added to the true orgs. This is done, because sometimes an additional random word is identified by NER as being part of an organisation's name.
- For those terms that are identified as true organisations, the number of occurrences in the document of each of them (in it's entirety, enclosed by word boudaries) is determined.
- Finally, the identified organisations are attempted to be matched on a list of provided organisations using the anbis argument, to collect their rsin number for further analysis. An empty file ./Data/Anbis_clean.csv is availble that serves as template for such a file. Matching is attempted both on currentStatutoryName and shortBusinessName. Only full matches (independent of capitals) and full matches with the additional term 'Stichting' at the start of the identified organisation (again independent of capitals) are considered for matching. Fuzzy matching is not used here, because during testing, this was found to lead to a significant amount of false positives.
Classify the sector in which the organisation is active. The code uses a pre-trained model to identify one of eight sectors in which the organisation is active. The model is trained on the 2020 annual report pdf files of CBF certified organisations.

Prerequisites

Python 3.8, 3.9, 3.10, 3.11
Poppler; poppler is a prerequisite to install pdftotext, instructions can be found here: https://pypi.org/project/pdftotext/. Please note that to install poppler on a Windows machine using conda-forge, Microsoft Visual C++ build tools have to be installed first.

Installation

nedextract can be installed using pip:

console pip install nedextract

The required packages that are installed are: FuzzyWuzzy, NumPy, openpyxl, poppler, pandas, pdftotext, python-Levenshtein, scikit-learn, Stanza, and xlsxwriter.[^1]

[^1]: If you encounter problems with the installation, these often arise from the installation of poppler, which is a requirement for pdftotext. Help can generally be found on pdftotext.

Usage

Input

The full pipeline can be executed from the command line using: python3 -m nedextract.run_nedextract Followed by one or more of the following arguments:

Input data, one or more pdf files, using one of the following arguments:
- -f file: path to a single pdf file
- -d directory: path to a directory containing pdf files
- -u url: link to a pdf file
- -uf urlf: text file containing one or multiple urls to pdf files. The text file should contain one url per line, without headers and footers.
-t tasks (optional): can either be 'people', 'orgs', 'sectors' or 'all'. Indicates which tasks to be performed. Defualts to 'people'.
-a anbis (option): path to a .csv file which will be used with the orgs task. The file should contain (at least) the columns rsin, currentStatutoryName, and shortBusinessName. An empty example file, that is also the default file, can be found in the folder 'Data'. The data in the file will be used to try to match identified named organisations on to collect their rsin number provided in the file.
model (-m), labels (-l), vectors (-v) (optional): each referring to a path containing a pretraining classifyer model, label encoding and tf-idf vectors respectively. These will be used for the sector classification task. A model can be trained using the classify_organisation.train function.
-wo write_output: TRUE/FALSE, defaults to TRUE, setting weither to write the output data to an excel file.

For example: python3 -m nedextract.run_nedextract -f pathtomypdf.pdf -t all -a ansbis.csv

Returns:

Three dataframes, one for the 'people' task, one for the 'sectors' task, and one for the 'orgs' task. If write_output=True, the gathered information is written to auto-named xlsx files in de folder Output. The output of the different tasks are written to separate xlsx files with the following naming convention:

'./Output/outputYYYYMMDDHHMMSSpeople.xlsx'

'./Output/outputYYYYMMDDHHMMSSrelated_organisations.xlsx'

'./Output/outputYYYYMMDDHHMMSSgeneral.xlsx'

Here YYYYMMDD and HHMMSS refer to the date and time at which the execution started.

Turorials

Tutorials on the full pipeline and (individual) useful analysis tools can be found in the Tutorials folder.

Contributing

If you want to contribute to the development of nedextract, have a look at the contribution guidelines.

How to cite us

If you use this package for your scientific work, please consider citing it as:

Ootes, L.S. (2023). nedextract ([VERSION YOU USED]). Zenodo. https://doi.org/10.5281/zenodo.8286578

See also the Zenodo page for exporting the citation to BibTteX and other formats.

Credits

This package was created with Cookiecutter and the NLeSC/python-template.

Owner

Name: Transparency in the non-profit sector
Login: Transparency-in-the-non-profit-sector
Kind: organization

Website: https://research-software-directory.org/projects/transparency-in-non-profit
Repositories: 1
Profile: https://github.com/Transparency-in-the-non-profit-sector

Software written for the 'Transparency' project by Vrij Universiteit Amsterdam and the Netherlands eScience Center

Citation (CITATION.cff)

# YAML 1.2
---
cff-version: "1.1.0"
title: "nedextract"
authors:
  -
    family-names: Ootes
    given-names: Laura
    orcid: "https://orcid.org/0000-0002-2800-8309"
date-released: 2022-03-16
doi: 10.0000/FIXME
version: "0.2.0"
repository-code: "https://github.com/Transparency-in-the-non-profit-sector/nedextract"
keywords:
  - NLP
  - Philanthropy
message: "If you use this software, please cite it using these metadata."
license: Apache-2.0

GitHub Events

Total

Create event: 1

Last Year

Create event: 1

Issues and Pull Requests

Last synced: 6 months ago

Packages

Total packages: 1
Total downloads:
- pypi 13 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 2
Total maintainers: 1

pypi.org: nedextract

extract specific information from annual report files

Homepage: https://github.com/Transparency-in-the-non-profit-sector/nedextract
Documentation: https://nedextract.readthedocs.io/
License: Apache Software License
Latest release: 0.2.1
published almost 2 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 13 Last month

Rankings

Dependent packages count: 7.5%

Forks count: 30.2%

Average: 36.6%

Stargazers count: 39.1%

Dependent repos count: 69.8%

Maintainers (1)

laurasootes

Last synced: 6 months ago

Dependencies

setup.py pypi

fuzzywuzzy ==0.18.0
numpy >=1.21.5
openpyxl ==3.0.9
pandas >=1.3.5
pdftotext ==2.2.2
python-Levenshtein ==0.12.2
scikit-learn ==1.0.2
stanza ==1.3.0
xlsxwriter *

.github/workflows/build.yml actions

actions/checkout v1 composite
actions/checkout v2 composite
conda-incubator/setup-miniconda v2 composite
coverallsapp/github-action v2 composite

.github/workflows/cffconvert.yml actions

actions/checkout v2 composite
citation-file-format/cffconvert-github-action main composite

.github/workflows/markdown-link-check.yml actions

actions/checkout main composite
gaurav-nelson/github-action-markdown-link-check v1 composite

conda-requirements.txt pypi

poppler *

pyproject.toml pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science