semanticat

Annotation tool (NER) for XML documents (TEI, EAD) - WIP

https://github.com/lucaterre/semanticat

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.6%) to scientific vocabulary
Last synced: 8 months ago · JSON representation ·

Repository

Annotation tool (NER) for XML documents (TEI, EAD) - WIP

Basic Info
  • Host: GitHub
  • Owner: Lucaterre
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 22.2 MB
Statistics
  • Stars: 10
  • Watchers: 3
  • Forks: 0
  • Open Issues: 4
  • Releases: 1
Created almost 4 years ago · Last pushed over 3 years ago
Metadata Files
Readme License Citation

README.md

Python Version MIT License Semantic@ CI build

Semanti🐱

WORK-IN-PROGRESS! (experimental usage)

Semantic@ (or sematicat) is a GUI prototype to annotate and to embed annotations in XML documents in TEI or EAD.

This tool follows a linear workflow: Import the document(s), apply a NER model and correct prediction or annotate manually from-zero and export and/or publish XML with annotations directly inside.

This platform is also designed to adapt generically to the diversity of publishing projects and its components are highly customisable.

:movie_camera: Demo

semanticat_demo

:battery: Installation

  1. Clone the Github repository

bash git clone https://github.com/Lucaterre/semanticat.git

  1. Move inside the directory

bash cd ./semanticat

  1. Create a virtual environment with virtualenv

bash virtualenv --python=/usr/bin/python3.8 venv

  1. Activate the virtual environment

bash source venv/bin/activate

  1. Install dependencies

bash pip install -r requirements.txt

:rocket: Run Locally

:fire: This application is intended to be simple and local for the moment. Please note that the application is currently optimized for the Firefox browser.

Use the semantic@ CLI; inside the semanticat/ directory, launch the command :

bash python run.py Application is available on port 3000.

Others arguments :

| Type | Details | |-----------------------|------------------------------------------| | --dev_mode | Launch application in development mode | | --erase_recreate_db | Clean and Restore all database :warning: |

:arrow_forward: Getting started

  • Start by creating a project with the button Create a new project and open your project;
  • Go to Menu > Manage documents and import your XML, now you can see your documents in Project workflow view (You can mix EAD and TEI);
  • In Project workflow view: Apply parse feature on document one by one or apply Parse All on all documents;

  • Go to Menu > configuration, two use cases :

  1. You don't want to apply a NER model, and you want to manually annotate your data :
  2. Define Annotation mapping (see the "Mapping details" section);
  3. Add labels with Add new pair to mapping scope;
  4. Go to Project workflow > correct named entities and start annotation.

  5. You want to use an NER (recommender) model to predict named entities and correct afterwards (see the "NER configuration details" section):

  6. First, select checkbox NER Recommenders;

  7. Choose the correct language that corresponding to your ressources;

  8. Select the model and save;

  9. Wait, the pre-mapping appears, you can then adapt it (see the "Mapping details" section);

  10. Go to Project workflow > Launch Ner (or Launch Ner on all);

  11. When the process is complete, go to correct named entities and correct the predictions or add annotations.

Whatever the chosen scenario, once the correction is finished, you can export your document (see the "Export details" section) !

:dart: Detail sections

Mapping

the mapping table makes it possible to match a NER category with a tag used in the output XML markup:

  • Ner Label: The default label use to annotate or use by your model;
  • Prefered Index label: The label that will appear in the output;
  • Color: label color in annotation view.

You can add new labels to your existing schema via Add new pair to mapping scope.

Be careful if you remove a label from table and your model has already made predictions or if you have started to correct document, all annotations will be destroyed.

NER configuration

Currently, Semantic@ uses the NER SpaCy framework.

When installing the Semantic@, two pre-trained models for French (frcorenewssm) and English (encorewebsm) are already available

For add a new available SpaCy pre-trained model, before starting Semantic@, launch in terminal :

bash python -m spacy download <name-pretrained-model> then restart application.

The new pre-trained model will be directly available in model list from configuration view.

Sometimes, SpaCy’s default in-built pre-trained NER model are too slow and too generic for your data (the model is far from perfect so it doesn't necessarily detect your labels). If you have training a better statistical NER model with SpaCy, you can place your NER model folder under /instance_config/my_features/my_models/

Your model will be directly available in model list from configuration.

Export

There are different XML export solutions :

  • annotations inline (based on characters offsets) (TEI specific): This export mode uses standoff-converter and uses the position of annotations in the text to produce an output. It is precise but sometimes it takes a lot of time.
  • annotations to controlaccess level (EAD specific): This export mode inserts tags annotations in a level of type .
  • annotations inline (based on surface form matching) (TEI & EAD): This export mode uses the surface form of annotated mentions to tag the output. It is fast but sometimes less precise (correct your document before export it).

  • annotations in JSON: This export allows to keep track annotations in a JSON format. Import this directly into the annotation view.

:cryingcatface: Bug reports

Feel free to create a new issue (new features, bug reports, documentation etc.).

:computer: Stack

Interface

Flask SQLite Bootstrap

Main components

  • Spacy

  • RecogitoJS

  • Standoffconverter

:bustinsilhouette: Mainteners

:black_nib: How to cite

Please use the following citation:

@misc{terriel-2022-semanticat,
    title = "Semanticat: Annotation tool (NER) for XML documents",
    author = "Terriel, Lucas",
    year = "2022",
    url = "https://github.com/Lucaterre/semanticat",
}

Owner

  • Name: Lucas Terriel
  • Login: Lucaterre
  • Kind: user
  • Location: Paris, France
  • Company: École Nationale des Chartes

Engineer @chartes | before @ INRIA (ALMAnaCH team)

Citation (CITATION.CFF)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Terriel"
  given-names: "Lucas"
  orcid: "https://orcid.org/0000-0002-9189-258X"
title: "Semanticat"
version: 0.0.1
date-released: 2022-05-5
url: "https://github.com/Lucaterre/semanticat"

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 38
  • Total Committers: 2
  • Avg Commits per committer: 19.0
  • Development Distribution Score (DDS): 0.342
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Lucas Terriel 4****e 25
Lucaterre l****l@g****m 13

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 7
  • Total pull requests: 3
  • Average time to close issues: 1 day
  • Average time to close pull requests: 19 minutes
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Lucaterre (6)
  • ancatmara (1)
Pull Request Authors
  • Lucaterre (3)
Top Labels
Issue Labels
:new: Enhancement (3) :hammer: Refactoring (2) python (2) :dizzy: Nice to have (2) Module: Annotation (2) Needs attention (2) Module: Export - Publication (2) javascript (1) :sunglasses: UX (1) Module: NER (1) Critical (1) :bug: Bug (1) Module: Workflow - Dashboard (1) Module: Import (1) :bookmark_tabs: Documentation (1) :large_orange_diamond: Quality (1) :robot: Tests (1) WIP (1)
Pull Request Labels
:new: Enhancement (2) :hammer: Refactoring (1) python (1) Module: Export - Publication (1) Module: Import (1)

Dependencies

requirements.txt pypi
  • Flask ==2.1.2
  • Flask-SQLAlchemy ==2.5.1
  • Jinja2 ==3.1.2
  • MarkupSafe ==2.1.1
  • SQLAlchemy ==1.4.36
  • Werkzeug ==2.1.2
  • attrs ==21.4.0
  • blis ==0.7.7
  • catalogue ==2.0.7
  • cchardet ==2.1.7
  • certifi ==2021.10.8
  • charset-normalizer ==2.0.12
  • click ==8.1.3
  • colorlog ==6.6.0
  • cymem ==2.0.6
  • greenlet ==1.1.2
  • idna ==3.3
  • importlib-metadata ==4.11.3
  • iniconfig ==1.1.1
  • itsdangerous ==2.1.2
  • langcodes ==3.3.0
  • lxml ==4.8.0
  • murmurhash ==1.0.7
  • numpy ==1.22.3
  • packaging ==21.3
  • pandas ==1.4.2
  • pathy ==0.6.1
  • pluggy ==1.0.0
  • preshed ==3.0.6
  • py ==1.11.0
  • pydantic ==1.8.2
  • pyfiglet ==0.8.post1
  • pylint ==2.13.8
  • pyparsing ==3.0.8
  • pytest ==7.1.2
  • pytest-flask ==1.2.0
  • python-dateutil ==2.8.2
  • pytz ==2022.1
  • requests ==2.27.1
  • six ==1.16.0
  • smart-open ==5.2.1
  • spacy ==3.3.0
  • spacy-legacy ==3.0.9
  • spacy-loggers ==1.0.2
  • srsly ==2.4.3
  • thinc ==8.0.15
  • tomli ==2.0.1
  • tqdm ==4.64.0
  • typer ==0.4.1
  • typing_extensions ==4.2.0
  • urllib3 ==1.26.9
  • wasabi ==0.9.1
  • zipp ==3.8.0