inception2corpus

A CLI for retrieving a corpus annotated with named entities from INCEpTION to an archived, reusable and versionable corpus designed as part of the NER4Archives project.

https://github.com/ner4archives-project/inception2corpus-cli

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.5%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

A CLI for retrieving a corpus annotated with named entities from INCEpTION to an archived, reusable and versionable corpus designed as part of the NER4Archives project.

Basic Info
  • Host: GitHub
  • Owner: NER4Archives-project
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 2.6 MB
Statistics
  • Stars: 1
  • Watchers: 0
  • Forks: 0
  • Open Issues: 5
  • Releases: 6
Created almost 4 years ago · Last pushed over 3 years ago
Metadata Files
Readme License Citation

README.md

inception2corpus-CLI

A CLI for retrieving a corpus annotated with named entities from INCEpTION instance to an archived and reusable corpus in context of any NER project.

This tool was created in the context of the NER4Archives project (INRIA/Archives nationales); it is adaptable and reusable for any other project under the terms of the MIT license.

Python Version License: MIT PyPI version

The CLI launches a linear process, called a "pipeline", which executes the components in the following order:

  • Fetch curated documents (XMI format) from an INCEpTION instance (check state of document in Inception > "Monitoring" window);

curated-doc

  • preprocessing curated documents (retokenize, remove unprintable characters etc.);
  • Convert XMI to CONLL files (inception2corpus use xmi2conll cli as a module);
  • Merge CONLL files in one;
  • Provides a report containing statistics and metadata about the corpus;
  • Reduce (get only sentences annotated and reject other) and serialize dataset in 2 (train/dev) and 3 sets (train/dev/test) according to a ratio defined by the user

At the end of the execution of the program, an output_annotated_corpus folder/ is provided at the root working directory, for more details see this section.

🛠️ Installation (easy way)

  1. You need Python 3.7 or higher installed (if not, install it here).

  2. First, create a new directory and set up a code environment with virtualenv and correct Python version, follow these steps (depending on your OS):

    MacOSx / Linux

    bash virtualenv --python=/usr/bin/python3.7 venv

    then, activate this new code environment with:

    bash source venv/bin/activate

    Windows

    bash py -m venv venv

    then, activate this new code environment with:

    bash .\venv\Scripts\activate

  3. Finally, install inception2corpus CLI via pip with:

    bash pip install inception2corpus

🛠️ Installation (for developers only)

```bash

1. clone git repository

git clone https://github.com/NER4Archives-project/inception2corpus-CLI.git

2. Go to repository and create a new virtual env (follow steps in easy way installation)

3. install packages

(on MACOSx/LINUX):

pip install -r requirements.txt

(on Windows):

pip install -r .\requirements.txt ```

▶️ Usage

  1. inception2corpus CLI use a YAML file as argument to specify INCEpTION HOST information, corpus metadata, conll format, serialization options etc. You can use and update the template here USERVARENV.yml.

  2. When configuration YAML file is completed use this command: bash inception2corpus ./USER_VAR_ENV.yml

  3. At the end of this process, a new output directory is created at the root of working directory (./output_annotated_corpus folder/) that contains your final corpus, ready to train. Also, a new temp_files/ folder is created at the root, leave it or delete it as you want.

📁 Full output folder description

``` ./outputannotatedcorpus folder/ | |- outputannotatedcorpus folder.zip/ | | | |- datasplitn2/ : The allreduced.conll divided into 2 sets (train, dev) | | | |- datasplitn3/ : The allreduced.conll divided into 3 sets (train, dev, test) | | | |- datasplitn3idx/ : The allreduced.conll divided into 3 sets (train, dev, test) with sentences ID | | | |- datasplitn2idx/ : The allreduced.conll divided into 2 sets (train, dev) with sentences ID | | | |- XMIcurated/ : Original XMI to import into INCEpTION | | | |- all.conll : All documents in CONLL format | |- allreduced.conll : All documents in CONLL format reduced to only annotated sentences | |- meta_corpus.json : corpus metadata and statistics

```

Citation (CITATION.CFF)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Terriel"
  given-names: "Lucas"
  orcid: "https://orcid.org/0000-0002-9189-258X"
title: "Inception2Corpus-CLI"
description: "A pipeline CLI for retrieving a corpus annotated with named entities from INCEpTION instance to an archived and reusable corpus designed for NER project."
date-released: 2022-07-01
url: "https://github.com/NER4Archives-project/inception2corpus-CLI"

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: about 3 years ago

All Time
  • Total Commits: 32
  • Total Committers: 4
  • Avg Commits per committer: 8.0
  • Development Distribution Score (DDS): 0.438
Top Committers
Name Email Commits
Lucaterre l****l@g****m 18
Lucas Terriel 4****e@u****m 11
PaulineCharbo p****r@g****m 2
dependabot[bot] 4****]@u****m 1

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 5
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 minute
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 0.2
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Lucaterre (5)
Pull Request Authors
  • dependabot[bot] (1)
Top Labels
Issue Labels
enhancement (3) documentation (2)
Pull Request Labels
dependencies (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 36 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 6
  • Total maintainers: 1
pypi.org: inception2corpus

A CLI for retrieving a corpus annotated with named entities from INCEpTION to an archived, reusable and versionable corpus.

  • Versions: 6
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 36 Last month
Rankings
Dependent packages count: 6.6%
Downloads: 26.3%
Average: 26.6%
Forks count: 30.5%
Dependent repos count: 30.6%
Stargazers count: 39.1%
Maintainers (1)
Last synced: 8 months ago

Dependencies

requirements.txt pypi
  • PyYAML ==6.0
  • altgraph ==0.17.2
  • attrs ==21.2.0
  • certifi ==2022.6.15
  • charset-normalizer ==2.1.0
  • click ==8.1.3
  • deprecation ==2.1.0
  • dkpro-cassis ==0.7.2
  • idna ==3.3
  • importlib-metadata ==4.12.0
  • importlib-resources ==5.4.0
  • joblib ==1.1.0
  • lxml ==4.9.1
  • more-itertools ==8.12.0
  • nltk ==3.7
  • numpy ==1.21.6
  • packaging ==21.3
  • pandas ==1.3.5
  • prompt-toolkit ==3.0.30
  • pycaprio ==0.2.1
  • pyfiglet ==0.8.post1
  • pyinstaller ==5.1
  • pyinstaller-hooks-contrib ==2022.7
  • pyparsing ==3.0.9
  • python-dateutil ==2.8.2
  • pytz ==2022.1
  • pyzmq ==23.2.0
  • regex ==2022.6.2
  • requests ==2.28.1
  • requests-toolbelt ==0.9.1
  • scikit-learn ==1.0.2
  • scipy ==1.7.3
  • six ==1.16.0
  • sortedcontainers ==2.4.0
  • tenacity ==5.1.5
  • termcolor ==2.0.1
  • threadpoolctl ==3.1.0
  • toposort ==1.7
  • tqdm ==4.64.1
  • typing_extensions ==4.2.0
  • urllib3 ==1.26.9
  • wcwidth ==0.2.5
  • xmi2conll ==0.1.2
  • zipp ==3.8.0
setup.py pypi
  • requirements *