architxt

ArchiTXT is an open source Python library that transforms unstructured text into structured, searchable, and AI-ready data. It enables automated database generation and seamless data integration.

https://github.com/neplex/architxt

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.7%) to scientific vocabulary

Keywords

architxt data-analysis database nlp open-source python python-library research structured-data text-analysis text-mining
Last synced: 4 months ago · JSON representation ·

Repository

ArchiTXT is an open source Python library that transforms unstructured text into structured, searchable, and AI-ready data. It enables automated database generation and seamless data integration.

Basic Info
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 0
  • Open Issues: 9
  • Releases: 6
Topics
architxt data-analysis database nlp open-source python python-library research structured-data text-analysis text-mining
Created over 1 year ago · Last pushed 4 months ago
Metadata Files
Readme Contributing License Code of conduct Citation Authors Codemeta

README.md

ArchiTXT: Text-to-Database Structuring Tool

PyPI - Status PyPI - Version PyPI - Python Version GitHub Actions Workflow Status SWH

ArchiTXT is a robust tool designed to convert unstructured textual data into structured formats that are ready for database storage. It automates the generation of database schemas and creates corresponding data instances, simplifying the integration of text-based information into database systems.

Working with unstructured text can be challenging when you need to store and query it in a structured database. ArchiTXT bridges this gap by transforming raw text into organized, query-friendly structures. By automating both schema generation and data instance creation, it streamlines the entire process of managing textual information in databases.

Installation

To install ArchiTXT, make sure you have Python 3.10+ and pip installed. Then, run:

sh pip install architxt

For the development version, you can install it directly through GIT using

sh pip install git+https://github.com/Neplex/ArchiTXT.git

Usage

ArchiTXT is built to work seamlessly with BRAT-annotated corpora that includes pre-labeled named entities. It also requires access to a CoreNLP server, which you can set up using the Docker configuration available in the source repository.

```sh $ architxt --help

Usage: architxt [OPTIONS] COMMAND [ARGS]...

ArchiTXT is a tool for structuring textual data into a valid database model. It is guided by a meta-grammar and uses an iterative process of tree rewriting.

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ --install-completion Install completion for the current shell. │ │ --show-completion Show completion for the current shell, to copy it or customize the installation. │ │ --help Show this message and exit. │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Commands ─────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ run Extract a database schema form a corpus. │ │ ui Launch the web-based UI. │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ```

```sh $ architxt run --help

Usage: architxt run [OPTIONS] CORPUS_PATH

Extract a database schema form a corpus.

╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ * corpus_path PATH Path to the input corpus. [default: None] [required] │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ --tau FLOAT The similarity threshold. [default: 0.7] │ │ --epoch INTEGER Number of iteration for tree rewriting. [default: 100] │ │ --min-support INTEGER Minimum support for tree patterns. [default: 20] │ │ --corenlp-url TEXT URL of the CoreNLP server. [default: http://localhost:9000] │ │ --gen-instances INTEGER Number of synthetic instances to generate. [default: 0] │ │ --language TEXT Language of the input corpus. [default: French] │ │ --debug --no-debug Enable debug mode for more verbose output. [default: no-debug] │ │ --help Show this message and exit. │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ```

To deploy the CoreNLP server using the source repository, you can use Docker Compose with the following command:

sh docker compose up -d corenlp

Owner

  • Name: Nicolas Hiot
  • Login: Neplex
  • Kind: user
  • Location: France
  • Company: Ennov

PhD student

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: ArchiTXT
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Jacques
    family-names: Chabin
    orcid: 'https://orcid.org/0000-0003-1460-9979'
    email: jacques.chabin@univ-orleans.fr
    affiliation: 'Université d''Orléans, INSA CVL, LIFO, UR 4022, Orléans, France'
  - given-names: Mirian
    family-names: Halfeld-Ferrari
    email: mirian@univ-orleans.fr
    affiliation: 'Université d''Orléans, INSA CVL, LIFO, UR 4022, Orléans, France'
    orcid: 'https://orcid.org/0000-0003-2601-3224'
  - given-names: Nicolas
    family-names: Hiot
    email: nicolas.hiot@univ-orleans.fr
    affiliation: 'Université d''Orléans, INSA CVL, LIFO, UR 4022, Orléans, France'
    orcid: 'https://orcid.org/0000-0003-4318-4906'
repository-code: 'https://github.com/Neplex/ArchiTXT'
license: CC-BY-NC-4.0
identifiers:
  - type: swh
    value: 'swh:1:dir:8f5b7716e68a8b4261e93ac40be26ce0d9593f40'

CodeMeta (codemeta.json)

{
  "@context": [
    "https://w3id.org/codemeta/3.0",
    "https://w3id.org/software-iodata",
    "https://raw.githubusercontent.com/jantman/repostatus.org/master/badges/latest/ontology.jsonld",
    "https://schema.org",
    "https://w3id.org/software-types"
  ],
  "@type": "SoftwareSourceCode",
  "author": [
    {
      "@id": "https://orcid.org/0000-0003-1460-9979",
      "@type": "Person",
      "affiliation": {
        "@type": "Organization",
        "legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
      },
      "familyName": "Chabin",
      "givenName": "Jacques"
    },
    {
      "@id": "https://orcid.org/0000-0003-2601-3224",
      "@type": "Person",
      "affiliation": {
        "@type": "Organization",
        "legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
      },
      "familyName": "Halfeld-Ferrari",
      "givenName": "Mirian"
    },
    {
      "@id": "https://orcid.org/0000-0003-4318-4906",
      "@type": "Person",
      "affiliation": {
        "@type": "Organization",
        "legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
      },
      "familyName": "Hiot",
      "givenName": "Nicolas"
    }
  ],
  "codeRepository": "https://github.com/Neplex/ArchiTXT",
  "contributor": [
    {
      "@id": "https://orcid.org/0000-0003-1460-9979",
      "@type": "Person",
      "affiliation": {
        "@type": "Organization",
        "legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
      },
      "familyName": "Chabin",
      "givenName": "Jacques"
    },
    {
      "@id": "https://orcid.org/0000-0003-2601-3224",
      "@type": "Person",
      "affiliation": {
        "@type": "Organization",
        "legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
      },
      "familyName": "Halfeld-Ferrari",
      "givenName": "Mirian"
    },
    {
      "@id": "https://orcid.org/0000-0003-4318-4906",
      "@type": "Person",
      "affiliation": {
        "@type": "Organization",
        "legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
      },
      "familyName": "Hiot",
      "givenName": "Nicolas"
    }
  ],
  "developmentStatus": "https://www.repostatus.org/#wip",
  "identifier": "data",
  "license": "http://spdx.org/licenses/CC-BY-NC-4.0",
  "maintainer": {
    "@id": "https://orcid.org/0000-0003-1460-9979",
    "@type": "Person",
    "affiliation": {
      "@type": "Organization",
      "legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
    },
    "familyName": "Chabin",
    "givenName": "Jacques"
  },
  "name": "ArchiTXT",
  "producer": {
    "@type": "Organization",
    "legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
  },
  "url": "https://github.com/Neplex/ArchiTXT"
}

GitHub Events

Total
  • Create event: 69
  • Release event: 6
  • Issues event: 10
  • Watch event: 2
  • Delete event: 72
  • Member event: 1
  • Issue comment event: 30
  • Push event: 380
  • Pull request review comment event: 23
  • Pull request review event: 21
  • Pull request event: 142
Last Year
  • Create event: 69
  • Release event: 6
  • Issues event: 10
  • Watch event: 2
  • Delete event: 72
  • Member event: 1
  • Issue comment event: 30
  • Push event: 380
  • Pull request review comment event: 23
  • Pull request review event: 21
  • Pull request event: 142

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 9
  • Total pull requests: 68
  • Average time to close issues: 6 days
  • Average time to close pull requests: 8 days
  • Total issue authors: 2
  • Total pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.13
  • Merged pull requests: 46
  • Bot issues: 1
  • Bot pull requests: 29
Past Year
  • Issues: 9
  • Pull requests: 68
  • Average time to close issues: 6 days
  • Average time to close pull requests: 8 days
  • Issue authors: 2
  • Pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.13
  • Merged pull requests: 46
  • Bot issues: 1
  • Bot pull requests: 29
Top Authors
Issue Authors
  • Neplex (8)
  • dependabot[bot] (1)
Pull Request Authors
  • dependabot[bot] (39)
  • Neplex (36)
  • bap-haudebourg (5)
Top Labels
Issue Labels
enhancement (5) bug (2) dependencies (1)
Pull Request Labels
dependencies (44) python (25) enhancement (13) bug (4) documentation (1)

Dependencies

corenlp/Dockerfile docker
  • openjdk 8u272 build
docker-compose.yml docker
  • corenlp latest
  • nlpbox/corenlp latest
pyproject.toml pypi
.github/workflows/python-test.yml actions
  • rcmdnk/python-action v1 composite