dbnary

DBnary extractor mirror - See https://gitlab.com/gilles.serasset/dbnary

https://github.com/serasset/dbnary

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.9%) to scientific vocabulary

Keywords

java lua ontolex-lexicography rdf wiktionary wiktionary-parser
Last synced: 4 months ago · JSON representation

Repository

DBnary extractor mirror - See https://gitlab.com/gilles.serasset/dbnary

Basic Info
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 9
Topics
java lua ontolex-lexicography rdf wiktionary wiktionary-parser
Created almost 4 years ago · Last pushed 5 months ago
Metadata Files
Readme License Codemeta

README.md

Codacy Badge

DBnary extractor

DBnary is an attempt to extract as many lexical data as possible from as many Wiktionary Language Editions as possible, in a structured (RDF) way, using standard lexicon ontology vocabulary (ontolex).

The extracted data is kept in sync with Wiktionary each time a new dump is generated and is available from http://kaiko.getalp.org/about-dbnary (more info is contained there).

The current repository contains the extraction programs, currently handling 26 language editions.

Using the extracted data

Extracted data is available in RDF. You will have to load it in an RDF database or using an RDF API (Jena in Java or others in other languages...). You may download the data from the above web page.

You may also query the data from the above web page, using SPARQL.

This repository hosts the programs that extracted the data from Wiktionary. It does not contain tools to use it.

Installing the extractor (without compiling)

The DBnary extractor utilities are now packaged as a homebrew java application. Just install it with : bash brew install serasset/tap/dbnary

This will install the DBnary commands along with all dependencies in the homebrew directory.

Compiling the extractor

First, you do not have to compile this extractor if your only purpose is to use the extracted data. As stated in the previous section, the extracted data is made available in sync with wiktionary dumps.

However, you are free (and encouraged) to compile and enhance the extractors.

  • DBnary extractor uses maven and is written in Java (with small parts in scala)
  • Dependencies should be taken care of by maven
  • There is no database to configure, the extractor directly uses the dump files

Using the extractor?

Easiest way is to use the Command Line Interfaces packaged as a java app.

You should either install DBnary extractor using homebrew or, if you are debugging the extractor, make sure the dbnary shell script defined at YOUR_SOURCE_DIRECTORY/dbnary/dbnary-commands/target/appassembler/bin/dbnary is selected first by your PATH. Note, this file only exists after a full mvn package run from the dbnary source code root folder.

```bash $ dbnary --version 3.0.9 $ dbnary help Usage: dbnary [-hvV] [--dir=] [--debug=[,...]]... [--trace=]... [@...] [COMMAND] DBnary is a set of tools used to extract lexical data from several editions of wiktionaries. All extracted data is made available as Linked Open Data, using ontolex, lexinfo, olia and several other specialized vocabularies. [@...] One or more argument files containing options. --debug=[,...]

  --dir=<dbnaryDir>

-h, --help Show this help message and exit. --trace= -v Print extra information. -V, --version Print version information and exit.

Commands:

The dbnary commands are: check check the mediawiki syntax of all pages of a dump. extract extract all pages from a dump and write resulting RDF files. help Displays help information about the specified command update Update dumps for all specified languages, then extract them. sample extract the specified pages from a dump and write resulting RDF files to stdout. tree Parse the specified entries wikitext and display the parse tree to stdout. source get the wikitext source of the specified pages. compare fetch and compare extracts from different dates. grep grep a given pattern in all pages of a dump. ```

All subcommands are also documented using the help subcommand. E.g. ```bash $ dbnary help grep grep a given pattern in all pages of a dump. Usage: dbnary grep [-hlvV] [--all-matches] [--[no-]compress] [--[no-]tdb] [--plain] [--dir=] [-F=NUMBER] [-T=NUMBER] [--debug=[,...]]... [--trace=]... This command looks for a given pattern in all pages of a dump and output the matching pages. The dump file of the wiki to be extracted. The pattern to be searched for. --all-matches show all matches. --debug=[,...]

  --dir=<dbnaryDir>

-F, --frompage=NUMBER Begin the extraction at the specified page number. -h, --help Show this help message and exit. -l, --pagename only show the name of the page. --[no-]compress Compress the resulting extracted files using BZip2. set by default. --[no-]tdb Use TDB2 (temporary file storage for extracted models, usefull/necessary for big dumps. set by default. --plain match is displayed without specific formatting. -T, --topage=NUMBER Stop the extraction at the specified page number. --trace= -v Print extra information. -V, --version Print version information and exit. ```

Performing releases

The DBnary project uses the git flow branching model. To successfully release the code using maven, we use the git flow plugin.

```bash mvn gitflow:release-start

edit all scripts in kaiko/ to use the correct (non SNAPSHOT) version.

mvn deploy site:site site:deploy mvn gitflow:release-finish ```

Using CI/CD to validate changes in the extractors

As DBnary now extracts 22 different languages editions which use very diverse microstructure for their entry descriptions, it is very likely that a change (especially one at the DataHandler level) breaks the extraction of another language.

Hence, it is essential to be able to evaluate the impact of a set of changes to the extraction of all languages. In oder to evaluate this, a CI/CD setup has been created that will launch the extraction of a SAMPLE of 10000 pages from each languages and compute the diffs between the new and previous versions.

This CI/CD pipeline is triggered when a Merge Request is created on the gitlab platform.

As we are using the gitflow strategy, here are the different steps to be performed :

  • Features
    • mvn gitflow:feature-start -DpushRemote=true
    • Develop the feature on its branch (don't forget to push the feature branch)
    • Create a Pull Request to develop branch on gitlab (this will trigger CI/CD evaluation of the pull request, the pipeline extracts a sample of pages from latest wiktionary dumps and compares these. The ttl files are available as an artefact in the pipeline, available for 14 days after evaluation, please keep in mind that evaluation can take a very long time (several hours))
    • When the PR has been evaluated, checked and approved, then finnish it using gitflow plugin
    • mvn gitflow:feature-finnish
    • OR, merge it using the MR on gitlab (and delete the feature branch).
  • Releases
    • TDB

Controlling CI/CD extractors validation

In order to avoid all languages to be re-evaluated when it is not necessary, it is possible to control the validation process in 2 different manners :

  1. Globally setting VALIDATION_LANGUAGES variable on the repository (see repository variables on gitlab)
  2. Specifying the languages in the COMMIT MESSAGE
    • The commit message THAT TRIGGERS THE EVALUATION (the last message of the PR), should contain the string : VALIDATION_LANGUAGES="la es fr" (note that the quotes are mandatory)

Contribution guidelines

  • Writing tests
  • Code review
  • Other guidelines

Contacts

  • Contact Gilles Sérasset <Gilles.Serasset@imag.fr>

Owner

  • Name: Gilles Sérasset
  • Login: serasset
  • Kind: user
  • Location: Grenoble, France
  • Company: Université Grenoble Alpes

Teacher/Researcher at Université Grenoble Alpes, I'm the author and maintainer of the DBnary dataset (wiktionaries in RDF)

CodeMeta (codemeta.json)

{
  "@context": "https://w3id.org/codemeta/3.0",
  "type": "SoftwareSourceCode",
  "applicationCategory": "Natural Language Processing",
  "author": [
    {
      "id": "https://orcid.org/0000-0003-2761-7353",
      "type": "Person",
      "affiliation": {
        "type": "Organization",
        "name": "Laboratoire d'Informatique de Grenoble, Universit Grenoble Alpes"
      },
      "email": "gilles.serasset@imag.fr",
      "familyName": "Srasset",
      "givenName": "Gilles"
    },
    {
      "type": "Role",
      "schema:author": "https://orcid.org/0000-0003-2761-7353",
      "roleName": "Developper",
      "startDate": "2010-04-27"
    }
  ],
  "codeRepository": "git+https://gitlab.com/gilles.serasset/dbnary.git",
  "dateCreated": "2010-04-27",
  "dateModified": "2024-08-20",
  "datePublished": "2013-05-07",
  "description": "DBnary is an attempt to extract as many lexical data as possible from as many Wiktionary Language Editions as possible, in a structured (RDF) way, using standard lexicon ontology vocabulary (ontolex).\nThe extracted data is kept in sync with Wiktionary each time a new dump is generated and is available from http://kaiko.getalp.org/about-dbnary (more info is contained there).\nThe current repository contains the extraction programs, currently handling 25 language editions.",
  "downloadUrl": "https://github.com/serasset/dbnary/releases/download/v3.1.23/dbnary-commands-3.1.23.zip",
  "isPartOf": "https://kaiko.getalp.org/about-dbnary",
  "keywords": [
    "Lexicon",
    "ontolex",
    "wiktionary",
    "dbnary"
  ],
  "license": "https://spdx.org/licenses/MIT",
  "name": "DBnary extractor",
  "operatingSystem": [
    "Linux",
    "MacOS",
    "Windows"
  ],
  "programmingLanguage": [
    "Java",
    "scala",
    "bash",
    "SPARQL"
  ],
  "runtimePlatform": "JVM",
  "version": "3.1.23",
  "codemeta:contIntegration": {
    "id": "https://gitlab.com/gilles.serasset/dbnary/-/pipelines"
  },
  "continuousIntegration": "https://gitlab.com/gilles.serasset/dbnary/-/pipelines",
  "developmentStatus": "active",
  "isSourceCodeOf": "DBnary dataset",
  "issueTracker": "https://gitlab.com/gilles.serasset/dbnary/-/issues/"
}

GitHub Events

Total
  • Release event: 22
  • Watch event: 4
  • Delete event: 15
  • Push event: 59
  • Create event: 28
Last Year
  • Release event: 22
  • Watch event: 4
  • Delete event: 15
  • Push event: 59
  • Create event: 28

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 0
  • Total pull requests: 11
  • Average time to close issues: N/A
  • Average time to close pull requests: 21 days
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • snyk-bot (11)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

dbnary-commands/pom.xml maven
  • info.picocli:picocli
  • org.apache.commons:commons-compress
  • org.getalp:dbnary-extractor 3.0.11
  • org.getalp:rdf-utils 3.0.11
  • org.slf4j:slf4j-api
  • org.slf4j:slf4j-simple
dbnary-commons/pom.xml maven
  • org.slf4j:slf4j-api
  • junit:junit test
  • org.junit.jupiter:junit-jupiter-api test
  • org.junit.jupiter:junit-jupiter-engine test
  • org.junit.vintage:junit-vintage-engine test
dbnary-enhancer/pom.xml maven
  • com.h2database:h2 2.1.210
  • com.memetix:microsoft-translator-java-api 0.6.2
  • com.wcohen:secondstring 20120620
  • commons-cli:commons-cli
  • org.apache.commons:commons-compress
  • org.apache.jena:apache-jena-libs
  • org.getalp:dbnary-commons 3.0.11
  • org.getalp:dbnary-ontology 3.0.11
  • org.slf4j:slf4j-api
dbnary-extractor/pom.xml maven
  • com.fasterxml.jackson.core:jackson-databind
  • com.fasterxml.woodstox:woodstox-core
  • com.github.rwitzel.streamflyer:streamflyer-core 1.2.0
  • com.sun.xml.bind:jaxb-impl
  • com.typesafe.scala-logging:scala-logging_${scalaBinaryVersion}
  • commons-cli:commons-cli
  • commons-io:commons-io
  • info.bliki.wiki:bliki-core
  • jakarta.xml.bind:jakarta.xml.bind-api
  • net.sf.ehcache:ehcache 2.10.9.2
  • org.apache.commons:commons-compress
  • org.apache.commons:commons-text
  • org.apache.httpcomponents:httpclient
  • org.apache.httpcomponents:httpcore
  • org.apache.jena:apache-jena-libs
  • org.codehaus.woodstox:stax2-api
  • org.getalp:dbnary-commons ${project.version}
  • org.getalp:dbnary-enhancer ${project.version}
  • org.getalp:dbnary-hdt ${project.version}
  • org.getalp:dbnary-ontology ${project.version}
  • org.getalp:dbnary-wikitext ${project.version}
  • org.jsoup:jsoup 1.14.3
  • org.scala-lang.modules:scala-parser-combinators_${scalaBinaryVersion}
  • org.scala-lang:scala-library
  • org.slf4j:jcl-over-slf4j
  • org.slf4j:slf4j-api
  • junit:junit test
  • org.hamcrest:hamcrest-core test
  • org.junit.jupiter:junit-jupiter-api test
  • org.junit.jupiter:junit-jupiter-engine test
  • org.junit.vintage:junit-vintage-engine test
  • org.slf4j:slf4j-simple test
dbnary-hdt/pom.xml maven
  • org.apache.commons:commons-compress
  • org.apache.jena:apache-jena-libs
  • org.rdfhdt:hdt-java-core 2.1.2
  • org.slf4j:slf4j-api
  • org.hamcrest:hamcrest-core test
  • org.junit.jupiter:junit-jupiter-api test
  • org.junit.jupiter:junit-jupiter-engine test
  • org.slf4j:slf4j-simple test
dbnary-ontology/pom.xml maven
  • org.apache.jena:apache-jena-libs
  • org.getalp:rdf-utils 3.0.11 test
  • org.slf4j:slf4j-api test
  • org.slf4j:slf4j-simple test
dbnary-wikitext/pom.xml maven
  • org.apache.commons:commons-text
  • org.slf4j:slf4j-api
  • junit:junit test
  • org.junit.jupiter:junit-jupiter-api test
  • org.junit.jupiter:junit-jupiter-engine test
  • org.junit.vintage:junit-vintage-engine test
  • org.slf4j:slf4j-simple test
experiments/OUPLinker/pom.xml maven
  • commons-cli:commons-cli 1.2 compile
  • org.apache.commons:commons-compress 1.0
  • org.apache.commons:commons-lang3 3.0
  • org.apache.httpcomponents:httpclient 4.2.6
  • org.apache.jena:apache-jena-libs 2.12.1
  • org.getalp.dbnary:ontology 1.5-SNAPSHOT
  • org.getalp:org.getalp.lexsema-ontolex-api 1.0-SNAPSHOT
  • org.getalp:org.getalp.lexsema-ontolex-dbnary 1.0-SNAPSHOT
  • org.getalp:org.getalp.lexsema-similarity 1.0-SNAPSHOT
  • org.slf4j:slf4j-api 1.7.7
  • org.slf4j:slf4j-simple 1.7.7
  • junit:junit 4.9 test
experiments/ldl2014/pom.xml maven
  • com.h2database:h2 1.4.177
  • com.memetix:microsoft-translator-java-api 0.6.2
  • com.wcohen:secondstring 20120620
  • commons-cli:commons-cli 1.2
  • org.getalp:dbnary-extractor 2.0-SNAPSHOT
  • junit:junit 4.9 test
experiments/trans2links/pom.xml maven
  • com.h2database:h2 1.4.177
  • com.memetix:microsoft-translator-java-api 0.6.2
  • com.wcohen:secondstring 20120620
  • commons-cli:commons-cli 1.2
  • org.getalp:dbnary-extractor 2.0-SNAPSHOT
  • org.jgrapht:jgrapht-core 1.0.1
  • org.jgrapht:jgrapht-ext 1.0.1
  • junit:junit 4.9 test
pom.xml maven
  • org.junit:junit-bom 5.8.2 import
  • com.fasterxml.jackson.core:jackson-databind 2.13.2.1
  • com.fasterxml.woodstox:woodstox-core 6.2.8
  • com.sun.xml.bind:jaxb-impl 3.0.1
  • com.typesafe.scala-logging:scala-logging_2.13 3.9.4
  • commons-cli:commons-cli 1.5.0
  • commons-io:commons-io 2.11.0
  • info.bliki.wiki:bliki-core 3.1.2G
  • info.picocli:picocli 4.6.3
  • jakarta.xml.bind:jakarta.xml.bind-api 3.0.1
  • org.apache.commons:commons-compress 1.21
  • org.apache.commons:commons-text 1.9
  • org.apache.httpcomponents:httpclient 4.5.13
  • org.apache.httpcomponents:httpcore 4.4.15
  • org.apache.jena:apache-jena-base 4.5.0
  • org.apache.jena:apache-jena-libs 4.5.0
  • org.apache.jena:jena-cmds 4.5.0
  • org.codehaus.woodstox:stax2-api 4.2.1
  • org.scala-lang.modules:scala-parser-combinators_2.13 1.1.2
  • org.scala-lang:scala-library 2.13.8
  • org.slf4j:jcl-over-slf4j 1.7.36
  • org.slf4j:slf4j-api 1.7.36
  • org.slf4j:slf4j-simple 1.7.36
  • junit:junit 4.13.2 test
  • org.hamcrest:hamcrest-core 2.2 test
rdf-utils/pom.xml maven
  • com.slack.api:slack-api-client 1.23.0
  • commons-cli:commons-cli
  • commons-io:commons-io
  • org.apache.commons:commons-compress
  • org.apache.jena:apache-jena-libs
  • org.apache.jena:jena-cmds
  • org.getalp:dbnary-commons ${project.version}
  • org.slf4j:slf4j-api
  • org.slf4j:slf4j-simple
  • junit:junit test
tutorials/jupyter-python/docker-compose.yml docker
  • serasset/sparql-jupyterlab latest
dbnary-jena-libs/pom.xml maven
  • com.fasterxml.jackson.core:jackson-databind
  • com.fasterxml.woodstox:woodstox-core
  • com.google.protobuf:protobuf-java
  • org.apache.commons:commons-compress
  • org.apache.jena:apache-jena-libs ${jena.version}
  • org.codehaus.woodstox:stax2-api
  • org.slf4j:slf4j-api
  • org.slf4j:slf4j-simple test
build-tools/pom.xml maven