etymdb

[LREC 2020] EtymDB, an Etymological DataBase (v2.1)

https://github.com/clefourrier/etymdb

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.7%) to scientific vocabulary

Keywords

borrowings cognates database etymology etymology-data extract lrec2020 tei wiktionary wiktionary-parser
Last synced: 6 months ago · JSON representation ·

Repository

[LREC 2020] EtymDB, an Etymological DataBase (v2.1)

Basic Info
  • Host: GitHub
  • Owner: clefourrier
  • License: cc-by-sa-4.0
  • Language: Perl
  • Default Branch: master
  • Homepage:
  • Size: 25.2 MB
Statistics
  • Stars: 24
  • Watchers: 3
  • Forks: 2
  • Open Issues: 1
  • Releases: 0
Topics
borrowings cognates database etymology etymology-data extract lrec2020 tei wiktionary wiktionary-parser
Created almost 6 years ago · Last pushed about 4 years ago
Metadata Files
Readme Citation

README.md

EtymDB 2.1

EtymDB 2.1 : An etymological database extracted from the Wiktionary (described in Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB-2.0).

Previous versions available here. Logo upgraded by Alix Chagu.

Organisation of the repo (and the base)

  • data

    • etymdb.csv is the raw extracted DB csv file
      • Extracted from wiktionary.xml, itself extracted from enwiktionary-latest-pages-articles.xml - neither have been added to the repo because of their size, if you need them, please contact the repo owner
    • split_etymdb contains the extracted database, separated in several files for easier data analysis
      • etymdb_values: Word ix, Lang identifier (in wiki code), Lexeme, Gloss (English translation)
      • etymdb_links_info: Direct relation type, child word ix, parent word ix
        • If the parent index is negative (usually for derivation or compounding relations), it means that several parents are implied: the negative index will be found in etymdb_links_index, in association with the several parents indices
      • etymdb_links_index: Multiple parents relation ix, parent 1 ix, parent 2 ix, ... parent n ix
  • extraction_scripts contains all the scripts used for data extraction, included for reproducibility

  • analysis_notebooks contains 2 Jupyter notebooks to help you get a quick start with the database. One is the reproduction of part 7 of the paper

  • static contains the logos

Data extraction

You can reproduce all steps of data extraction by using the following commands on your data dump of interest.

Extract your data dump

Download and extract the xml data dump that you want to use, and put it in data/.

tar -xvjf enwiktionary-date-pages-articles.xml.bz2 mv enwiktionary-date-pages-articles.xml data/

From xml to csv

Then, from the script folder.

cat ../data/enwiktionary-date-pages-articles.xml | perl enwiktionary2xml.pl > ../data/enwiktionary.xml cat ../data/enwiktionary.xml | perl etymology_analyser.pl > ../data/enwiktionary.csv

From csv to split csv

From the data folder.

```

Get only links_info

awk '$1 ~ /^-/'etymdb.csv > splitetymdb/etymdblinks_index.csv

Get no links info

awk '$1 !~ /^-/'etymdb.csv > splitetymdb/etymdbnotlinksindex.csv

Get only lexeme info

awk 'NF > 3 { print $0 }' splitetymdb/etymdbnotlinksindex.csv > splitetymdb/etymdbvalues.csv

Get values info

awk 'NF == 3 { print $0 }' splitetymdb/etymdbnotlinksindex.csv > splitetymdb/etymdblinks_info.csv ```

Citation

@inproceedings{fourrier-sagot-2020-methodological, title = "Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing {E}tym{DB}-2.0", author = "Fourrier, Cl{\'e}mentine and Sagot, Beno{\^\i}t", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.392", pages = "3207--3216", ISBN = "979-10-95546-34-4", }

Owner

  • Name: Clémentine Fourrier
  • Login: clefourrier
  • Kind: user
  • Location: France
  • Company: @huggingface

Researcher at 🤗

Citation (CITATION.bib)

@inproceedings{fourrier-sagot-2020-methodological,
    title = "Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing {E}tym{DB}-2.0",
    author = "Fourrier, Cl{\'e}mentine  and
      Sagot, Beno{\^\i}t",
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.lrec-1.392",
    pages = "3207--3216",
    abstract = "Diachronic lexical information is not only important in the field of historical linguistics, but is also increasingly used in NLP, most recently for machine translation of low resource languages. Therefore, there is a need for fine-grained, large-coverage and accurate etymological lexical resources. In this paper, we propose a set of guidelines to generate such resources, for each step of the life-cycle of an etymological lexicon: creation, update, evaluation, dissemination, and exploitation. To illustrate the guidelines, we introduce EtymDB 2.0, an etymological database automatically generated from the Wiktionary, which contains 1.8 million lexemes, linked by more than 700,000 fine-grained etymological relations, across 2,536 living and dead languages. We also introduce use cases for which EtymDB 2.0 could represent a key resource, such as phylogenetic tree generation, low resource machine translation or medieval languages study.",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

GitHub Events

Total
  • Watch event: 3
Last Year
  • Watch event: 3

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 6.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 6.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • p-acharya (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels