dbnary

DBnary extractor mirror - See https://gitlab.com/gilles.serasset/dbnary

https://github.com/serasset/dbnary

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary

Keywords

java lua ontolex-lexicography rdf wiktionary wiktionary-parser

Last synced: 6 months ago · JSON representation

Repository

DBnary extractor mirror - See https://gitlab.com/gilles.serasset/dbnary

Basic Info

Host: GitHub
Owner: serasset
License: mit
Language: Java
Default Branch: master
Homepage: http://kaiko.getalp.org/about-dbnary
Size: 52.3 MB

Statistics

Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 9

Topics

java lua ontolex-lexicography rdf wiktionary wiktionary-parser

Created about 4 years ago · Last pushed 7 months ago

Metadata Files

Readme License Codemeta

DBnary extractor

DBnary is an attempt to extract as many lexical data as possible from as many Wiktionary Language Editions as possible, in a structured (RDF) way, using standard lexicon ontology vocabulary (ontolex).

The extracted data is kept in sync with Wiktionary each time a new dump is generated and is available from http://kaiko.getalp.org/about-dbnary (more info is contained there).

The current repository contains the extraction programs, currently handling 26 language editions.

Using the extracted data

Extracted data is available in RDF. You will have to load it in an RDF database or using an RDF API (Jena in Java or others in other languages...). You may download the data from the above web page.

You may also query the data from the above web page, using SPARQL.

This repository hosts the programs that extracted the data from Wiktionary. It does not contain tools to use it.

Installing the extractor (without compiling)

The DBnary extractor utilities are now packaged as a homebrew java application. Just install it with : bash brew install serasset/tap/dbnary

This will install the DBnary commands along with all dependencies in the homebrew directory.

Compiling the extractor

First, you do not have to compile this extractor if your only purpose is to use the extracted data. As stated in the previous section, the extracted data is made available in sync with wiktionary dumps.

However, you are free (and encouraged) to compile and enhance the extractors.

DBnary extractor uses maven and is written in Java (with small parts in scala)
Dependencies should be taken care of by maven
There is no database to configure, the extractor directly uses the dump files

Using the extractor?

Easiest way is to use the Command Line Interfaces packaged as a java app.

You should either install DBnary extractor using homebrew or, if you are debugging the extractor, make sure the dbnary shell script defined at YOUR_SOURCE_DIRECTORY/dbnary/dbnary-commands/target/appassembler/bin/dbnary is selected first by your PATH. Note, this file only exists after a full mvn package run from the dbnary source code root folder.

```bash $ dbnary --version 3.0.9 $ dbnary help Usage: dbnary [-hvV] [--dir=] [--debug=[,...]]... [--trace=]... [@...] [COMMAND] DBnary is a set of tools used to extract lexical data from several editions of wiktionaries. All extracted data is made available as Linked Open Data, using ontolex, lexinfo, olia and several other specialized vocabularies. [@...] One or more argument files containing options. --debug=[,...]

  --dir=<dbnaryDir>

-h, --help Show this help message and exit. --trace= -v Print extra information. -V, --version Print version information and exit.

Commands:

The dbnary commands are: check check the mediawiki syntax of all pages of a dump. extract extract all pages from a dump and write resulting RDF files. help Displays help information about the specified command update Update dumps for all specified languages, then extract them. sample extract the specified pages from a dump and write resulting RDF files to stdout. tree Parse the specified entries wikitext and display the parse tree to stdout. source get the wikitext source of the specified pages. compare fetch and compare extracts from different dates. grep grep a given pattern in all pages of a dump. ```

All subcommands are also documented using the help subcommand. E.g. ```bash $ dbnary help grep grep a given pattern in all pages of a dump. Usage: dbnary grep [-hlvV] [--all-matches] [--[no-]compress] [--[no-]tdb] [--plain] [--dir=] [-F=NUMBER] [-T=NUMBER] [--debug=[,...]]... [--trace=]... This command looks for a given pattern in all pages of a dump and output the matching pages. The dump file of the wiki to be extracted. The pattern to be searched for. --all-matches show all matches. --debug=[,...]

  --dir=<dbnaryDir>

-F, --frompage=NUMBER Begin the extraction at the specified page number. -h, --help Show this help message and exit. -l, --pagename only show the name of the page. --[no-]compress Compress the resulting extracted files using BZip2. set by default. --[no-]tdb Use TDB2 (temporary file storage for extracted models, usefull/necessary for big dumps. set by default. --plain match is displayed without specific formatting. -T, --topage=NUMBER Stop the extraction at the specified page number. --trace= -v Print extra information. -V, --version Print version information and exit. ```

Performing releases

The DBnary project uses the git flow branching model. To successfully release the code using maven, we use the git flow plugin.

```bash mvn gitflow:release-start

edit all scripts in kaiko/ to use the correct (non SNAPSHOT) version.

mvn deploy site:site site:deploy mvn gitflow:release-finish ```

Using CI/CD to validate changes in the extractors

As DBnary now extracts 22 different languages editions which use very diverse microstructure for their entry descriptions, it is very likely that a change (especially one at the DataHandler level) breaks the extraction of another language.

Hence, it is essential to be able to evaluate the impact of a set of changes to the extraction of all languages. In oder to evaluate this, a CI/CD setup has been created that will launch the extraction of a SAMPLE of 10000 pages from each languages and compute the diffs between the new and previous versions.

This CI/CD pipeline is triggered when a Merge Request is created on the gitlab platform.

As we are using the gitflow strategy, here are the different steps to be performed :

Features
- mvn gitflow:feature-start -DpushRemote=true
- Develop the feature on its branch (don't forget to push the feature branch)
- Create a Pull Request to develop branch on gitlab (this will trigger CI/CD evaluation of the pull request, the pipeline extracts a sample of pages from latest wiktionary dumps and compares these. The ttl files are available as an artefact in the pipeline, available for 14 days after evaluation, please keep in mind that evaluation can take a very long time (several hours))
- When the PR has been evaluated, checked and approved, then finnish it using gitflow plugin
- mvn gitflow:feature-finnish
- OR, merge it using the MR on gitlab (and delete the feature branch).
Releases
- TDB

Controlling CI/CD extractors validation

In order to avoid all languages to be re-evaluated when it is not necessary, it is possible to control the validation process in 2 different manners :

Globally setting VALIDATION_LANGUAGES variable on the repository (see repository variables on gitlab)
Specifying the languages in the COMMIT MESSAGE
- The commit message THAT TRIGGERS THE EVALUATION (the last message of the PR), should contain the string : VALIDATION_LANGUAGES="la es fr" (note that the quotes are mandatory)

Contribution guidelines

Writing tests
Code review
Other guidelines

Contacts

Contact Gilles Sérasset <Gilles.Serasset@imag.fr>

Owner

Name: Gilles Sérasset
Login: serasset
Kind: user
Location: Grenoble, France
Company: Université Grenoble Alpes

Website: http://serasset.bitbucket.io/
Repositories: 18
Profile: https://github.com/serasset

Teacher/Researcher at Université Grenoble Alpes, I'm the author and maintainer of the DBnary dataset (wiktionaries in RDF)

CodeMeta (codemeta.json)

{
  "@context": "https://w3id.org/codemeta/3.0",
  "type": "SoftwareSourceCode",
  "applicationCategory": "Natural Language Processing",
  "author": [
    {
      "id": "https://orcid.org/0000-0003-2761-7353",
      "type": "Person",
      "affiliation": {
        "type": "Organization",
        "name": "Laboratoire d'Informatique de Grenoble, Universit Grenoble Alpes"
      },
      "email": "gilles.serasset@imag.fr",
      "familyName": "Srasset",
      "givenName": "Gilles"
    },
    {
      "type": "Role",
      "schema:author": "https://orcid.org/0000-0003-2761-7353",
      "roleName": "Developper",
      "startDate": "2010-04-27"
    }
  ],
  "codeRepository": "git+https://gitlab.com/gilles.serasset/dbnary.git",
  "dateCreated": "2010-04-27",
  "dateModified": "2024-08-20",
  "datePublished": "2013-05-07",
  "description": "DBnary is an attempt to extract as many lexical data as possible from as many Wiktionary Language Editions as possible, in a structured (RDF) way, using standard lexicon ontology vocabulary (ontolex).\nThe extracted data is kept in sync with Wiktionary each time a new dump is generated and is available from http://kaiko.getalp.org/about-dbnary (more info is contained there).\nThe current repository contains the extraction programs, currently handling 25 language editions.",
  "downloadUrl": "https://github.com/serasset/dbnary/releases/download/v3.1.23/dbnary-commands-3.1.23.zip",
  "isPartOf": "https://kaiko.getalp.org/about-dbnary",
  "keywords": [
    "Lexicon",
    "ontolex",
    "wiktionary",
    "dbnary"
  ],
  "license": "https://spdx.org/licenses/MIT",
  "name": "DBnary extractor",
  "operatingSystem": [
    "Linux",
    "MacOS",
    "Windows"
  ],
  "programmingLanguage": [
    "Java",
    "scala",
    "bash",
    "SPARQL"
  ],
  "runtimePlatform": "JVM",
  "version": "3.1.23",
  "codemeta:contIntegration": {
    "id": "https://gitlab.com/gilles.serasset/dbnary/-/pipelines"
  },
  "continuousIntegration": "https://gitlab.com/gilles.serasset/dbnary/-/pipelines",
  "developmentStatus": "active",
  "isSourceCodeOf": "DBnary dataset",
  "issueTracker": "https://gitlab.com/gilles.serasset/dbnary/-/issues/"
}

GitHub Events

Total

Release event: 22
Watch event: 4
Delete event: 15
Push event: 59
Create event: 28

Last Year

Release event: 22
Watch event: 4
Delete event: 15
Push event: 59
Create event: 28

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 11
Average time to close issues: N/A
Average time to close pull requests: 21 days
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

snyk-bot (11)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

dbnary-commands/pom.xml maven

info.picocli:picocli
org.apache.commons:commons-compress
org.getalp:dbnary-extractor 3.0.11
org.getalp:rdf-utils 3.0.11
org.slf4j:slf4j-api
org.slf4j:slf4j-simple

dbnary-commons/pom.xml maven

org.slf4j:slf4j-api
junit:junit test
org.junit.jupiter:junit-jupiter-api test
org.junit.jupiter:junit-jupiter-engine test
org.junit.vintage:junit-vintage-engine test

dbnary-enhancer/pom.xml maven

com.h2database:h2 2.1.210
com.memetix:microsoft-translator-java-api 0.6.2
com.wcohen:secondstring 20120620
commons-cli:commons-cli
org.apache.commons:commons-compress
org.apache.jena:apache-jena-libs
org.getalp:dbnary-commons 3.0.11
org.getalp:dbnary-ontology 3.0.11
org.slf4j:slf4j-api

dbnary-extractor/pom.xml maven

com.fasterxml.jackson.core:jackson-databind
com.fasterxml.woodstox:woodstox-core
com.github.rwitzel.streamflyer:streamflyer-core 1.2.0
com.sun.xml.bind:jaxb-impl
com.typesafe.scala-logging:scala-logging_${scalaBinaryVersion}
commons-cli:commons-cli
commons-io:commons-io
info.bliki.wiki:bliki-core
jakarta.xml.bind:jakarta.xml.bind-api
net.sf.ehcache:ehcache 2.10.9.2
org.apache.commons:commons-compress
org.apache.commons:commons-text
org.apache.httpcomponents:httpclient
org.apache.httpcomponents:httpcore
org.apache.jena:apache-jena-libs
org.codehaus.woodstox:stax2-api
org.getalp:dbnary-commons ${project.version}
org.getalp:dbnary-enhancer ${project.version}
org.getalp:dbnary-hdt ${project.version}
org.getalp:dbnary-ontology ${project.version}
org.getalp:dbnary-wikitext ${project.version}
org.jsoup:jsoup 1.14.3
org.scala-lang.modules:scala-parser-combinators_${scalaBinaryVersion}
org.scala-lang:scala-library
org.slf4j:jcl-over-slf4j
org.slf4j:slf4j-api
junit:junit test
org.hamcrest:hamcrest-core test
org.junit.jupiter:junit-jupiter-api test
org.junit.jupiter:junit-jupiter-engine test
org.junit.vintage:junit-vintage-engine test
org.slf4j:slf4j-simple test

dbnary-hdt/pom.xml maven

org.apache.commons:commons-compress
org.apache.jena:apache-jena-libs
org.rdfhdt:hdt-java-core 2.1.2
org.slf4j:slf4j-api
org.hamcrest:hamcrest-core test
org.junit.jupiter:junit-jupiter-api test
org.junit.jupiter:junit-jupiter-engine test
org.slf4j:slf4j-simple test

dbnary-ontology/pom.xml maven

org.apache.jena:apache-jena-libs
org.getalp:rdf-utils 3.0.11 test
org.slf4j:slf4j-api test
org.slf4j:slf4j-simple test

dbnary-wikitext/pom.xml maven

org.apache.commons:commons-text
org.slf4j:slf4j-api
junit:junit test
org.junit.jupiter:junit-jupiter-api test
org.junit.jupiter:junit-jupiter-engine test
org.junit.vintage:junit-vintage-engine test
org.slf4j:slf4j-simple test

experiments/OUPLinker/pom.xml maven

commons-cli:commons-cli 1.2 compile
org.apache.commons:commons-compress 1.0
org.apache.commons:commons-lang3 3.0
org.apache.httpcomponents:httpclient 4.2.6
org.apache.jena:apache-jena-libs 2.12.1
org.getalp.dbnary:ontology 1.5-SNAPSHOT
org.getalp:org.getalp.lexsema-ontolex-api 1.0-SNAPSHOT
org.getalp:org.getalp.lexsema-ontolex-dbnary 1.0-SNAPSHOT
org.getalp:org.getalp.lexsema-similarity 1.0-SNAPSHOT
org.slf4j:slf4j-api 1.7.7
org.slf4j:slf4j-simple 1.7.7
junit:junit 4.9 test

experiments/ldl2014/pom.xml maven

com.h2database:h2 1.4.177
com.memetix:microsoft-translator-java-api 0.6.2
com.wcohen:secondstring 20120620
commons-cli:commons-cli 1.2
org.getalp:dbnary-extractor 2.0-SNAPSHOT
junit:junit 4.9 test

experiments/trans2links/pom.xml maven

com.h2database:h2 1.4.177
com.memetix:microsoft-translator-java-api 0.6.2
com.wcohen:secondstring 20120620
commons-cli:commons-cli 1.2
org.getalp:dbnary-extractor 2.0-SNAPSHOT
org.jgrapht:jgrapht-core 1.0.1
org.jgrapht:jgrapht-ext 1.0.1
junit:junit 4.9 test

pom.xml maven

org.junit:junit-bom 5.8.2 import
com.fasterxml.jackson.core:jackson-databind 2.13.2.1
com.fasterxml.woodstox:woodstox-core 6.2.8
com.sun.xml.bind:jaxb-impl 3.0.1
com.typesafe.scala-logging:scala-logging_2.13 3.9.4
commons-cli:commons-cli 1.5.0
commons-io:commons-io 2.11.0
info.bliki.wiki:bliki-core 3.1.2G
info.picocli:picocli 4.6.3
jakarta.xml.bind:jakarta.xml.bind-api 3.0.1
org.apache.commons:commons-compress 1.21
org.apache.commons:commons-text 1.9
org.apache.httpcomponents:httpclient 4.5.13
org.apache.httpcomponents:httpcore 4.4.15
org.apache.jena:apache-jena-base 4.5.0
org.apache.jena:apache-jena-libs 4.5.0
org.apache.jena:jena-cmds 4.5.0
org.codehaus.woodstox:stax2-api 4.2.1
org.scala-lang.modules:scala-parser-combinators_2.13 1.1.2
org.scala-lang:scala-library 2.13.8
org.slf4j:jcl-over-slf4j 1.7.36
org.slf4j:slf4j-api 1.7.36
org.slf4j:slf4j-simple 1.7.36
junit:junit 4.13.2 test
org.hamcrest:hamcrest-core 2.2 test

rdf-utils/pom.xml maven

com.slack.api:slack-api-client 1.23.0
commons-cli:commons-cli
commons-io:commons-io
org.apache.commons:commons-compress
org.apache.jena:apache-jena-libs
org.apache.jena:jena-cmds
org.getalp:dbnary-commons ${project.version}
org.slf4j:slf4j-api
org.slf4j:slf4j-simple
junit:junit test

tutorials/jupyter-python/docker-compose.yml docker

serasset/sparql-jupyterlab latest

dbnary-jena-libs/pom.xml maven

com.fasterxml.jackson.core:jackson-databind
com.fasterxml.woodstox:woodstox-core
com.google.protobuf:protobuf-java
org.apache.commons:commons-compress
org.apache.jena:apache-jena-libs ${jena.version}
org.codehaus.woodstox:stax2-api
org.slf4j:slf4j-api
org.slf4j:slf4j-simple test

build-tools/pom.xml maven