https://github.com/centrefordigitalhumanities/textminer

A script to detect named entities and store them in an Elasticsearch annotated_text field

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.9%) to scientific vocabulary

Keywords

annotation elasticsearch ner spacy

Last synced: 9 months ago · JSON representation

Repository

A script to detect named entities and store them in an Elasticsearch annotated_text field

Basic Info

Host: GitHub
Owner: CentreForDigitalHumanities
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 45.9 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 1
Releases: 0

Topics

annotation elasticsearch ner spacy

Created over 2 years ago · Last pushed 12 months ago

Metadata Files

Readme License

TextMiNER

TextMiNER is a collection of scripts to perform named entity recognition (NER) in text, using the Python library spaCy. The detected named entities are saved in an Elasticsearch annotated-text field.

Requirements

Python 3.10 or newer
Elasticsearch 8 or newer
Elasticsearch's annotated-field plugin. To install, run: sudo bin/elasticsearch-plugin install mapper-annotated-text

Docker

This repository contains Docker images and a docker-compose file for runnig and testing the scripts locally. docker-compose requires an .env file, to be created next to docker-compose.yaml, with the following values: ES_HOST=elasticsearch ELASTIC_ROOT_PASSWORD={password-of-your-choice}

Usage

Environment

Before running the script, define your environment variables to set correct values for ES_HOST if you don't run Elasticsearch on localhost, and API_ID, API_KEY and CERTS_LOCATION, if you access an Elasticsearch cluster using an API key.

Run the script (without Docker)

To analyze data from an Elasticsearch index with SpaCy, and save this data back into an annotated field, change to the code directory (cd code) and then run the following command: python process_documents.py -i {index_name} -f {field_name} -l {language_code} -o {output_dir}

To run this for an English language corpus indexed as "test", which has text data saved in field "content", you could run python process_documents.py -i test -f content -l english

Run the script locally (with Docker)

Altenatively, running with Docker, without changing to code first, run docker-compose run --rm backend python process_documents.py -i {index_name} -f {field_name} -l {language}

Owner

Name: Centre for Digital Humanities
Login: CentreForDigitalHumanities
Kind: organization
Email: cdh@uu.nl
Location: Netherlands

Website: https://cdh.uu.nl/
Repositories: 39
Profile: https://github.com/CentreForDigitalHumanities

Interdisciplinary centre for research and education in computational and data-driven methods in the humanities.

GitHub Events

Total

Issues event: 4
Member event: 1
Push event: 2
Create event: 1

Last Year

Issues event: 4
Member event: 1
Push event: 2
Create event: 1

Committers

Last synced: 11 months ago

All Time

Total Commits: 29
Total Committers: 1
Avg Commits per committer: 29.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 16
Committers: 1
Avg Commits per committer: 16.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
BeritJanssen	b**n@g**m	29

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 7
Total pull requests: 1
Average time to close issues: 6 months
Average time to close pull requests: 1 minute
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.43
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 1
Average time to close issues: 4 months
Average time to close pull requests: 1 minute
Issue authors: 1
Pull request authors: 1
Average comments per issue: 0.5
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

BeritJanssen (8)

Pull Request Authors

BeritJanssen (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Dockerfile docker

python 3.10-buster build

requirements.in pypi

click *
elasticsearch *
pytest *
spacy *

requirements.txt pypi

annotated-types ==0.5.0
blis ==0.7.11
catalogue ==2.0.10
certifi ==2023.7.22
charset-normalizer ==3.3.0
click ==8.1.7
cloudpathlib ==0.15.1
confection ==0.1.3
cymem ==2.0.8
elastic-transport ==8.4.1
elasticsearch ==8.10.0
exceptiongroup ==1.1.3
idna ==3.4
iniconfig ==2.0.0
jinja2 ==3.1.2
langcodes ==3.3.0
markupsafe ==2.1.3
murmurhash ==1.0.10
numpy ==1.26.0
packaging ==23.2
pathy ==0.10.2
pluggy ==1.3.0
preshed ==3.0.9
pydantic ==2.4.2
pydantic-core ==2.10.1
pytest ==7.4.2
requests ==2.31.0
smart-open ==6.4.0
spacy ==3.7.1
spacy-legacy ==3.0.12
spacy-loggers ==1.0.5
srsly ==2.4.8
thinc ==8.2.1
tomli ==2.0.1
tqdm ==4.66.1
typer ==0.9.0
typing-extensions ==4.8.0
urllib3 ==1.26.17
wasabi ==1.1.2
weasel ==0.3.2

https://github.com/centrefordigitalhumanities/textminer

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

TextMiNER

Requirements

Docker

Usage

Environment

Run the script (without Docker)

Run the script locally (with Docker)

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies