dbnary
DBnary extractor mirror - See https://gitlab.com/gilles.serasset/dbnary
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary
Keywords
Repository
DBnary extractor mirror - See https://gitlab.com/gilles.serasset/dbnary
Basic Info
- Host: GitHub
- Owner: serasset
- License: mit
- Language: Java
- Default Branch: master
- Homepage: http://kaiko.getalp.org/about-dbnary
- Size: 52.3 MB
Statistics
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 9
Topics
Metadata Files
README.md
DBnary extractor
DBnary is an attempt to extract as many lexical data as possible from as many Wiktionary Language Editions as possible, in a structured (RDF) way, using standard lexicon ontology vocabulary (ontolex).
The extracted data is kept in sync with Wiktionary each time a new dump is generated and is available from http://kaiko.getalp.org/about-dbnary (more info is contained there).
The current repository contains the extraction programs, currently handling 26 language editions.
Using the extracted data
Extracted data is available in RDF. You will have to load it in an RDF database or using an RDF API (Jena in Java or others in other languages...). You may download the data from the above web page.
You may also query the data from the above web page, using SPARQL.
This repository hosts the programs that extracted the data from Wiktionary. It does not contain tools to use it.
Installing the extractor (without compiling)
The DBnary extractor utilities are now packaged as a homebrew java application. Just install it with :
bash
brew install serasset/tap/dbnary
This will install the DBnary commands along with all dependencies in the homebrew directory.
Compiling the extractor
First, you do not have to compile this extractor if your only purpose is to use the extracted data. As stated in the previous section, the extracted data is made available in sync with wiktionary dumps.
However, you are free (and encouraged) to compile and enhance the extractors.
- DBnary extractor uses maven and is written in Java (with small parts in scala)
- Dependencies should be taken care of by maven
- There is no database to configure, the extractor directly uses the dump files
Using the extractor?
Easiest way is to use the Command Line Interfaces packaged as a java app.
You should either install DBnary extractor using homebrew or, if you are debugging the extractor,
make sure the dbnary shell script defined at YOUR_SOURCE_DIRECTORY/dbnary/dbnary-commands/target/appassembler/bin/dbnary is selected first by your PATH. Note, this file only exists after a full mvn package run from the dbnary source code root folder.
```bash
$ dbnary --version
3.0.9
$ dbnary help
Usage: dbnary [-hvV] [--dir=
--dir=<dbnaryDir>
-h, --help Show this help message and exit.
--trace=
Commands:
The dbnary commands are: check check the mediawiki syntax of all pages of a dump. extract extract all pages from a dump and write resulting RDF files. help Displays help information about the specified command update Update dumps for all specified languages, then extract them. sample extract the specified pages from a dump and write resulting RDF files to stdout. tree Parse the specified entries wikitext and display the parse tree to stdout. source get the wikitext source of the specified pages. compare fetch and compare extracts from different dates. grep grep a given pattern in all pages of a dump. ```
All subcommands are also documented using the help subcommand. E.g.
```bash
$ dbnary help grep
grep a given pattern in all pages of a dump.
Usage: dbnary grep [-hlvV] [--all-matches] [--[no-]compress] [--[no-]tdb]
[--plain] [--dir=
--dir=<dbnaryDir>
-F, --frompage=NUMBER Begin the extraction at the specified page number.
-h, --help Show this help message and exit.
-l, --pagename only show the name of the page.
--[no-]compress Compress the resulting extracted files using BZip2.
set by default.
--[no-]tdb Use TDB2 (temporary file storage for extracted
models, usefull/necessary for big dumps. set by
default.
--plain match is displayed without specific formatting.
-T, --topage=NUMBER Stop the extraction at the specified page number.
--trace=
Performing releases
The DBnary project uses the git flow branching model. To successfully release the code using maven, we use the git flow plugin.
```bash mvn gitflow:release-start
edit all scripts in kaiko/ to use the correct (non SNAPSHOT) version.
mvn deploy site:site site:deploy mvn gitflow:release-finish ```
Using CI/CD to validate changes in the extractors
As DBnary now extracts 22 different languages editions which use very diverse microstructure for their entry descriptions, it is very likely that a change (especially one at the DataHandler level) breaks the extraction of another language.
Hence, it is essential to be able to evaluate the impact of a set of changes to the extraction of all languages. In oder to evaluate this, a CI/CD setup has been created that will launch the extraction of a SAMPLE of 10000 pages from each languages and compute the diffs between the new and previous versions.
This CI/CD pipeline is triggered when a Merge Request is created on the gitlab platform.
As we are using the gitflow strategy, here are the different steps to be performed :
- Features
mvn gitflow:feature-start -DpushRemote=true- Develop the feature on its branch (don't forget to push the feature branch)
- Create a Pull Request to develop branch on gitlab (this will trigger CI/CD evaluation of the pull request, the pipeline extracts a sample of pages from latest wiktionary dumps and compares these. The ttl files are available as an artefact in the pipeline, available for 14 days after evaluation, please keep in mind that evaluation can take a very long time (several hours))
- When the PR has been evaluated, checked and approved, then finnish it using gitflow plugin
mvn gitflow:feature-finnish- OR, merge it using the MR on gitlab (and delete the feature branch).
- Releases
- TDB
Controlling CI/CD extractors validation
In order to avoid all languages to be re-evaluated when it is not necessary, it is possible to control the validation process in 2 different manners :
- Globally setting VALIDATION_LANGUAGES variable on the repository (see repository variables on gitlab)
- Specifying the languages in the COMMIT MESSAGE
- The commit message THAT TRIGGERS THE EVALUATION (the last message of the PR), should contain the string :
VALIDATION_LANGUAGES="la es fr"(note that the quotes are mandatory)
- The commit message THAT TRIGGERS THE EVALUATION (the last message of the PR), should contain the string :
Contribution guidelines
- Writing tests
- Code review
- Other guidelines
Contacts
- Contact
Gilles Sérasset <Gilles.Serasset@imag.fr>
Owner
- Name: Gilles Sérasset
- Login: serasset
- Kind: user
- Location: Grenoble, France
- Company: Université Grenoble Alpes
- Website: http://serasset.bitbucket.io/
- Repositories: 18
- Profile: https://github.com/serasset
Teacher/Researcher at Université Grenoble Alpes, I'm the author and maintainer of the DBnary dataset (wiktionaries in RDF)
CodeMeta (codemeta.json)
{
"@context": "https://w3id.org/codemeta/3.0",
"type": "SoftwareSourceCode",
"applicationCategory": "Natural Language Processing",
"author": [
{
"id": "https://orcid.org/0000-0003-2761-7353",
"type": "Person",
"affiliation": {
"type": "Organization",
"name": "Laboratoire d'Informatique de Grenoble, Universit Grenoble Alpes"
},
"email": "gilles.serasset@imag.fr",
"familyName": "Srasset",
"givenName": "Gilles"
},
{
"type": "Role",
"schema:author": "https://orcid.org/0000-0003-2761-7353",
"roleName": "Developper",
"startDate": "2010-04-27"
}
],
"codeRepository": "git+https://gitlab.com/gilles.serasset/dbnary.git",
"dateCreated": "2010-04-27",
"dateModified": "2024-08-20",
"datePublished": "2013-05-07",
"description": "DBnary is an attempt to extract as many lexical data as possible from as many Wiktionary Language Editions as possible, in a structured (RDF) way, using standard lexicon ontology vocabulary (ontolex).\nThe extracted data is kept in sync with Wiktionary each time a new dump is generated and is available from http://kaiko.getalp.org/about-dbnary (more info is contained there).\nThe current repository contains the extraction programs, currently handling 25 language editions.",
"downloadUrl": "https://github.com/serasset/dbnary/releases/download/v3.1.23/dbnary-commands-3.1.23.zip",
"isPartOf": "https://kaiko.getalp.org/about-dbnary",
"keywords": [
"Lexicon",
"ontolex",
"wiktionary",
"dbnary"
],
"license": "https://spdx.org/licenses/MIT",
"name": "DBnary extractor",
"operatingSystem": [
"Linux",
"MacOS",
"Windows"
],
"programmingLanguage": [
"Java",
"scala",
"bash",
"SPARQL"
],
"runtimePlatform": "JVM",
"version": "3.1.23",
"codemeta:contIntegration": {
"id": "https://gitlab.com/gilles.serasset/dbnary/-/pipelines"
},
"continuousIntegration": "https://gitlab.com/gilles.serasset/dbnary/-/pipelines",
"developmentStatus": "active",
"isSourceCodeOf": "DBnary dataset",
"issueTracker": "https://gitlab.com/gilles.serasset/dbnary/-/issues/"
}
GitHub Events
Total
- Release event: 22
- Watch event: 4
- Delete event: 15
- Push event: 59
- Create event: 28
Last Year
- Release event: 22
- Watch event: 4
- Delete event: 15
- Push event: 59
- Create event: 28
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 0
- Total pull requests: 11
- Average time to close issues: N/A
- Average time to close pull requests: 21 days
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- snyk-bot (11)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- info.picocli:picocli
- org.apache.commons:commons-compress
- org.getalp:dbnary-extractor 3.0.11
- org.getalp:rdf-utils 3.0.11
- org.slf4j:slf4j-api
- org.slf4j:slf4j-simple
- org.slf4j:slf4j-api
- junit:junit test
- org.junit.jupiter:junit-jupiter-api test
- org.junit.jupiter:junit-jupiter-engine test
- org.junit.vintage:junit-vintage-engine test
- com.h2database:h2 2.1.210
- com.memetix:microsoft-translator-java-api 0.6.2
- com.wcohen:secondstring 20120620
- commons-cli:commons-cli
- org.apache.commons:commons-compress
- org.apache.jena:apache-jena-libs
- org.getalp:dbnary-commons 3.0.11
- org.getalp:dbnary-ontology 3.0.11
- org.slf4j:slf4j-api
- com.fasterxml.jackson.core:jackson-databind
- com.fasterxml.woodstox:woodstox-core
- com.github.rwitzel.streamflyer:streamflyer-core 1.2.0
- com.sun.xml.bind:jaxb-impl
- com.typesafe.scala-logging:scala-logging_${scalaBinaryVersion}
- commons-cli:commons-cli
- commons-io:commons-io
- info.bliki.wiki:bliki-core
- jakarta.xml.bind:jakarta.xml.bind-api
- net.sf.ehcache:ehcache 2.10.9.2
- org.apache.commons:commons-compress
- org.apache.commons:commons-text
- org.apache.httpcomponents:httpclient
- org.apache.httpcomponents:httpcore
- org.apache.jena:apache-jena-libs
- org.codehaus.woodstox:stax2-api
- org.getalp:dbnary-commons ${project.version}
- org.getalp:dbnary-enhancer ${project.version}
- org.getalp:dbnary-hdt ${project.version}
- org.getalp:dbnary-ontology ${project.version}
- org.getalp:dbnary-wikitext ${project.version}
- org.jsoup:jsoup 1.14.3
- org.scala-lang.modules:scala-parser-combinators_${scalaBinaryVersion}
- org.scala-lang:scala-library
- org.slf4j:jcl-over-slf4j
- org.slf4j:slf4j-api
- junit:junit test
- org.hamcrest:hamcrest-core test
- org.junit.jupiter:junit-jupiter-api test
- org.junit.jupiter:junit-jupiter-engine test
- org.junit.vintage:junit-vintage-engine test
- org.slf4j:slf4j-simple test
- org.apache.commons:commons-compress
- org.apache.jena:apache-jena-libs
- org.rdfhdt:hdt-java-core 2.1.2
- org.slf4j:slf4j-api
- org.hamcrest:hamcrest-core test
- org.junit.jupiter:junit-jupiter-api test
- org.junit.jupiter:junit-jupiter-engine test
- org.slf4j:slf4j-simple test
- org.apache.jena:apache-jena-libs
- org.getalp:rdf-utils 3.0.11 test
- org.slf4j:slf4j-api test
- org.slf4j:slf4j-simple test
- org.apache.commons:commons-text
- org.slf4j:slf4j-api
- junit:junit test
- org.junit.jupiter:junit-jupiter-api test
- org.junit.jupiter:junit-jupiter-engine test
- org.junit.vintage:junit-vintage-engine test
- org.slf4j:slf4j-simple test
- commons-cli:commons-cli 1.2 compile
- org.apache.commons:commons-compress 1.0
- org.apache.commons:commons-lang3 3.0
- org.apache.httpcomponents:httpclient 4.2.6
- org.apache.jena:apache-jena-libs 2.12.1
- org.getalp.dbnary:ontology 1.5-SNAPSHOT
- org.getalp:org.getalp.lexsema-ontolex-api 1.0-SNAPSHOT
- org.getalp:org.getalp.lexsema-ontolex-dbnary 1.0-SNAPSHOT
- org.getalp:org.getalp.lexsema-similarity 1.0-SNAPSHOT
- org.slf4j:slf4j-api 1.7.7
- org.slf4j:slf4j-simple 1.7.7
- junit:junit 4.9 test
- com.h2database:h2 1.4.177
- com.memetix:microsoft-translator-java-api 0.6.2
- com.wcohen:secondstring 20120620
- commons-cli:commons-cli 1.2
- org.getalp:dbnary-extractor 2.0-SNAPSHOT
- junit:junit 4.9 test
- com.h2database:h2 1.4.177
- com.memetix:microsoft-translator-java-api 0.6.2
- com.wcohen:secondstring 20120620
- commons-cli:commons-cli 1.2
- org.getalp:dbnary-extractor 2.0-SNAPSHOT
- org.jgrapht:jgrapht-core 1.0.1
- org.jgrapht:jgrapht-ext 1.0.1
- junit:junit 4.9 test
- org.junit:junit-bom 5.8.2 import
- com.fasterxml.jackson.core:jackson-databind 2.13.2.1
- com.fasterxml.woodstox:woodstox-core 6.2.8
- com.sun.xml.bind:jaxb-impl 3.0.1
- com.typesafe.scala-logging:scala-logging_2.13 3.9.4
- commons-cli:commons-cli 1.5.0
- commons-io:commons-io 2.11.0
- info.bliki.wiki:bliki-core 3.1.2G
- info.picocli:picocli 4.6.3
- jakarta.xml.bind:jakarta.xml.bind-api 3.0.1
- org.apache.commons:commons-compress 1.21
- org.apache.commons:commons-text 1.9
- org.apache.httpcomponents:httpclient 4.5.13
- org.apache.httpcomponents:httpcore 4.4.15
- org.apache.jena:apache-jena-base 4.5.0
- org.apache.jena:apache-jena-libs 4.5.0
- org.apache.jena:jena-cmds 4.5.0
- org.codehaus.woodstox:stax2-api 4.2.1
- org.scala-lang.modules:scala-parser-combinators_2.13 1.1.2
- org.scala-lang:scala-library 2.13.8
- org.slf4j:jcl-over-slf4j 1.7.36
- org.slf4j:slf4j-api 1.7.36
- org.slf4j:slf4j-simple 1.7.36
- junit:junit 4.13.2 test
- org.hamcrest:hamcrest-core 2.2 test
- com.slack.api:slack-api-client 1.23.0
- commons-cli:commons-cli
- commons-io:commons-io
- org.apache.commons:commons-compress
- org.apache.jena:apache-jena-libs
- org.apache.jena:jena-cmds
- org.getalp:dbnary-commons ${project.version}
- org.slf4j:slf4j-api
- org.slf4j:slf4j-simple
- junit:junit test
- serasset/sparql-jupyterlab latest
- com.fasterxml.jackson.core:jackson-databind
- com.fasterxml.woodstox:woodstox-core
- com.google.protobuf:protobuf-java
- org.apache.commons:commons-compress
- org.apache.jena:apache-jena-libs ${jena.version}
- org.codehaus.woodstox:stax2-api
- org.slf4j:slf4j-api
- org.slf4j:slf4j-simple test