https://github.com/centrefordigitalhumanities/textminer
A script to detect named entities and store them in an Elasticsearch annotated_text field
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.9%) to scientific vocabulary
Keywords
Repository
A script to detect named entities and store them in an Elasticsearch annotated_text field
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
- Releases: 0
Topics
Metadata Files
README.md
TextMiNER
TextMiNER is a collection of scripts to perform named entity recognition (NER) in text, using the Python library spaCy. The detected named entities are saved in an Elasticsearch annotated-text field.
Requirements
- Python 3.10 or newer
- Elasticsearch 8 or newer
- Elasticsearch's annotated-field plugin. To install, run:
sudo bin/elasticsearch-plugin install mapper-annotated-text
Docker
This repository contains Docker images and a docker-compose file for runnig and testing the scripts locally. docker-compose requires an .env file, to be created next to docker-compose.yaml, with the following values:
ES_HOST=elasticsearch
ELASTIC_ROOT_PASSWORD={password-of-your-choice}
Usage
Environment
Before running the script, define your environment variables to set correct values for ES_HOST if you don't run Elasticsearch on localhost, and API_ID, API_KEY and CERTS_LOCATION, if you access an Elasticsearch cluster using an API key.
Run the script (without Docker)
To analyze data from an Elasticsearch index with SpaCy, and save this data back into an annotated field, change to the code directory (cd code) and then run the following command:
python process_documents.py -i {index_name} -f {field_name} -l {language_code} -o {output_dir}
To run this for an English language corpus indexed as "test", which has text data saved in field "content", you could run
python process_documents.py -i test -f content -l english
Run the script locally (with Docker)
Altenatively, running with Docker, without changing to code first, run
docker-compose run --rm backend python process_documents.py -i {index_name} -f {field_name} -l {language}
Owner
- Name: Centre for Digital Humanities
- Login: CentreForDigitalHumanities
- Kind: organization
- Email: cdh@uu.nl
- Location: Netherlands
- Website: https://cdh.uu.nl/
- Repositories: 39
- Profile: https://github.com/CentreForDigitalHumanities
Interdisciplinary centre for research and education in computational and data-driven methods in the humanities.
GitHub Events
Total
- Issues event: 4
- Member event: 1
- Push event: 2
- Create event: 1
Last Year
- Issues event: 4
- Member event: 1
- Push event: 2
- Create event: 1
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| BeritJanssen | b****n@g****m | 29 |
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 7
- Total pull requests: 1
- Average time to close issues: 6 months
- Average time to close pull requests: 1 minute
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 0.43
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 1
- Average time to close issues: 4 months
- Average time to close pull requests: 1 minute
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 0.5
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- BeritJanssen (8)
Pull Request Authors
- BeritJanssen (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- python 3.10-buster build
- click *
- elasticsearch *
- pytest *
- spacy *
- annotated-types ==0.5.0
- blis ==0.7.11
- catalogue ==2.0.10
- certifi ==2023.7.22
- charset-normalizer ==3.3.0
- click ==8.1.7
- cloudpathlib ==0.15.1
- confection ==0.1.3
- cymem ==2.0.8
- elastic-transport ==8.4.1
- elasticsearch ==8.10.0
- exceptiongroup ==1.1.3
- idna ==3.4
- iniconfig ==2.0.0
- jinja2 ==3.1.2
- langcodes ==3.3.0
- markupsafe ==2.1.3
- murmurhash ==1.0.10
- numpy ==1.26.0
- packaging ==23.2
- pathy ==0.10.2
- pluggy ==1.3.0
- preshed ==3.0.9
- pydantic ==2.4.2
- pydantic-core ==2.10.1
- pytest ==7.4.2
- requests ==2.31.0
- smart-open ==6.4.0
- spacy ==3.7.1
- spacy-legacy ==3.0.12
- spacy-loggers ==1.0.5
- srsly ==2.4.8
- thinc ==8.2.1
- tomli ==2.0.1
- tqdm ==4.66.1
- typer ==0.9.0
- typing-extensions ==4.8.0
- urllib3 ==1.26.17
- wasabi ==1.1.2
- weasel ==0.3.2