architxt
ArchiTXT is an open source Python library that transforms unstructured text into structured, searchable, and AI-ready data. It enables automated database generation and seamless data integration.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary
Keywords
Repository
ArchiTXT is an open source Python library that transforms unstructured text into structured, searchable, and AI-ready data. It enables automated database generation and seamless data integration.
Basic Info
- Host: GitHub
- Owner: Neplex
- License: other
- Language: Python
- Default Branch: main
- Homepage: https://neplex.github.io/ArchiTXT/
- Size: 4.45 MB
Statistics
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 9
- Releases: 6
Topics
Metadata Files
README.md
ArchiTXT: Text-to-Database Structuring Tool
ArchiTXT is a robust tool designed to convert unstructured textual data into structured formats that are ready for database storage. It automates the generation of database schemas and creates corresponding data instances, simplifying the integration of text-based information into database systems.
Working with unstructured text can be challenging when you need to store and query it in a structured database. ArchiTXT bridges this gap by transforming raw text into organized, query-friendly structures. By automating both schema generation and data instance creation, it streamlines the entire process of managing textual information in databases.
Installation
To install ArchiTXT, make sure you have Python 3.10+ and pip installed. Then, run:
sh
pip install architxt
For the development version, you can install it directly through GIT using
sh
pip install git+https://github.com/Neplex/ArchiTXT.git
Usage
ArchiTXT is built to work seamlessly with BRAT-annotated corpora that includes pre-labeled named entities. It also requires access to a CoreNLP server, which you can set up using the Docker configuration available in the source repository.
```sh $ architxt --help
Usage: architxt [OPTIONS] COMMAND [ARGS]...
ArchiTXT is a tool for structuring textual data into a valid database model. It is guided by a meta-grammar and uses an iterative process of tree rewriting.
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ --install-completion Install completion for the current shell. │ │ --show-completion Show completion for the current shell, to copy it or customize the installation. │ │ --help Show this message and exit. │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Commands ─────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ run Extract a database schema form a corpus. │ │ ui Launch the web-based UI. │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ```
```sh $ architxt run --help
Usage: architxt run [OPTIONS] CORPUS_PATH
Extract a database schema form a corpus.
╭─ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ * corpus_path PATH Path to the input corpus. [default: None] [required] │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ --tau FLOAT The similarity threshold. [default: 0.7] │ │ --epoch INTEGER Number of iteration for tree rewriting. [default: 100] │ │ --min-support INTEGER Minimum support for tree patterns. [default: 20] │ │ --corenlp-url TEXT URL of the CoreNLP server. [default: http://localhost:9000] │ │ --gen-instances INTEGER Number of synthetic instances to generate. [default: 0] │ │ --language TEXT Language of the input corpus. [default: French] │ │ --debug --no-debug Enable debug mode for more verbose output. [default: no-debug] │ │ --help Show this message and exit. │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ```
To deploy the CoreNLP server using the source repository, you can use Docker Compose with the following command:
sh
docker compose up -d corenlp
Owner
- Name: Nicolas Hiot
- Login: Neplex
- Kind: user
- Location: France
- Company: Ennov
- Website: https://fr.linkedin.com/in/nicolas-hiot-b092b5108
- Repositories: 22
- Profile: https://github.com/Neplex
PhD student
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: ArchiTXT
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Jacques
family-names: Chabin
orcid: 'https://orcid.org/0000-0003-1460-9979'
email: jacques.chabin@univ-orleans.fr
affiliation: 'Université d''Orléans, INSA CVL, LIFO, UR 4022, Orléans, France'
- given-names: Mirian
family-names: Halfeld-Ferrari
email: mirian@univ-orleans.fr
affiliation: 'Université d''Orléans, INSA CVL, LIFO, UR 4022, Orléans, France'
orcid: 'https://orcid.org/0000-0003-2601-3224'
- given-names: Nicolas
family-names: Hiot
email: nicolas.hiot@univ-orleans.fr
affiliation: 'Université d''Orléans, INSA CVL, LIFO, UR 4022, Orléans, France'
orcid: 'https://orcid.org/0000-0003-4318-4906'
repository-code: 'https://github.com/Neplex/ArchiTXT'
license: CC-BY-NC-4.0
identifiers:
- type: swh
value: 'swh:1:dir:8f5b7716e68a8b4261e93ac40be26ce0d9593f40'
CodeMeta (codemeta.json)
{
"@context": [
"https://w3id.org/codemeta/3.0",
"https://w3id.org/software-iodata",
"https://raw.githubusercontent.com/jantman/repostatus.org/master/badges/latest/ontology.jsonld",
"https://schema.org",
"https://w3id.org/software-types"
],
"@type": "SoftwareSourceCode",
"author": [
{
"@id": "https://orcid.org/0000-0003-1460-9979",
"@type": "Person",
"affiliation": {
"@type": "Organization",
"legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
},
"familyName": "Chabin",
"givenName": "Jacques"
},
{
"@id": "https://orcid.org/0000-0003-2601-3224",
"@type": "Person",
"affiliation": {
"@type": "Organization",
"legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
},
"familyName": "Halfeld-Ferrari",
"givenName": "Mirian"
},
{
"@id": "https://orcid.org/0000-0003-4318-4906",
"@type": "Person",
"affiliation": {
"@type": "Organization",
"legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
},
"familyName": "Hiot",
"givenName": "Nicolas"
}
],
"codeRepository": "https://github.com/Neplex/ArchiTXT",
"contributor": [
{
"@id": "https://orcid.org/0000-0003-1460-9979",
"@type": "Person",
"affiliation": {
"@type": "Organization",
"legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
},
"familyName": "Chabin",
"givenName": "Jacques"
},
{
"@id": "https://orcid.org/0000-0003-2601-3224",
"@type": "Person",
"affiliation": {
"@type": "Organization",
"legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
},
"familyName": "Halfeld-Ferrari",
"givenName": "Mirian"
},
{
"@id": "https://orcid.org/0000-0003-4318-4906",
"@type": "Person",
"affiliation": {
"@type": "Organization",
"legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
},
"familyName": "Hiot",
"givenName": "Nicolas"
}
],
"developmentStatus": "https://www.repostatus.org/#wip",
"identifier": "data",
"license": "http://spdx.org/licenses/CC-BY-NC-4.0",
"maintainer": {
"@id": "https://orcid.org/0000-0003-1460-9979",
"@type": "Person",
"affiliation": {
"@type": "Organization",
"legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
},
"familyName": "Chabin",
"givenName": "Jacques"
},
"name": "ArchiTXT",
"producer": {
"@type": "Organization",
"legalName": "Universit d'Orlans, INSA CVL, LIFO, UR 4022, Orlans, France"
},
"url": "https://github.com/Neplex/ArchiTXT"
}
GitHub Events
Total
- Create event: 69
- Release event: 6
- Issues event: 10
- Watch event: 2
- Delete event: 72
- Member event: 1
- Issue comment event: 30
- Push event: 380
- Pull request review comment event: 23
- Pull request review event: 21
- Pull request event: 142
Last Year
- Create event: 69
- Release event: 6
- Issues event: 10
- Watch event: 2
- Delete event: 72
- Member event: 1
- Issue comment event: 30
- Push event: 380
- Pull request review comment event: 23
- Pull request review event: 21
- Pull request event: 142
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 9
- Total pull requests: 68
- Average time to close issues: 6 days
- Average time to close pull requests: 8 days
- Total issue authors: 2
- Total pull request authors: 3
- Average comments per issue: 0.0
- Average comments per pull request: 0.13
- Merged pull requests: 46
- Bot issues: 1
- Bot pull requests: 29
Past Year
- Issues: 9
- Pull requests: 68
- Average time to close issues: 6 days
- Average time to close pull requests: 8 days
- Issue authors: 2
- Pull request authors: 3
- Average comments per issue: 0.0
- Average comments per pull request: 0.13
- Merged pull requests: 46
- Bot issues: 1
- Bot pull requests: 29
Top Authors
Issue Authors
- Neplex (8)
- dependabot[bot] (1)
Pull Request Authors
- dependabot[bot] (39)
- Neplex (36)
- bap-haudebourg (5)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- openjdk 8u272 build
- corenlp latest
- nlpbox/corenlp latest
- rcmdnk/python-action v1 composite