https://github.com/cldf/cldfbench

Tooling to create CLDF datasets from existing data

Keywords from Contributors

linguistics concepts cross-linguistic-data glottolog

Last synced: 7 months ago · JSON representation

Repository

Tooling to create CLDF datasets from existing data

Basic Info

Host: GitHub
Owner: cldf
License: apache-2.0
Language: Python
Default Branch: master
Homepage:
Size: 364 KB

Statistics

Stars: 10
Watchers: 6
Forks: 5
Open Issues: 5
Releases: 1

Created almost 7 years ago · Last pushed 8 months ago

Metadata Files

Readme Changelog Contributing License

cldfbench

Tooling to create CLDF datasets from existing data.

Overview

This package provides tools to curate cross-linguistic data, with the goal of packaging it as CLDF datasets.

In particular, it supports a workflow where: - "raw" source data is downloaded to a raw/ subdirectory, - and subsequently converted to one or more CLDF datasets in a cldf/ subdirectory, with the help of: - configuration data in a etc/ directory and - custom Python code (a subclass of cldfbench.Dataset which implements the workflow actions).

This workflow is supported via: - a commandline interface cldfbench which calls the workflow actions as subcommands, - a cldfbench.Dataset base class, which must be overwritten in a custom module to hook custom code into the workflow.

With this workflow and the separation of the data into three directories we want to provide a workbench for transparently deriving CLDF data from data that has been published before. In particular we want to delineate clearly: - what forms part of the original or source data (raw), - what kind of information is added by the curators of the CLDF dataset (etc) - and what data was derived using the workbench (cldf).

Installation

cldfbench can be installed via pip - preferably in a virtual environment - by running: shell script pip install cldfbench

cldfbench provides some functionality that relies on python packages which are not needed for the core functionality. These are specified as extras and can be installed using syntax like: shell pip install cldfbench[<extras>] where <extras> is a comma-separated list of names from the following list: - excel: support for reading spreadsheet data. - glottolog: support to access Glottolog data. - concepticon: support to access Concepticon data. - clts: support to access CLTS data.

The command line interface `cldfbench`

Installing the python package will also install a command cldfbench available on the command line: ```shell script $ cldfbench -h usage: cldfbench [-h] [--log-level LOG_LEVEL] COMMAND ...

optional arguments: -h, --help show this help message and exit --log-level LOG_LEVEL log level ERROR|WARN|INFO|DEBUG

available commands: Run "COMAMND -h" to get help for a specific command.

COMMAND check Run generic CLDF checks ... ```

As shown above, run cldfbench -h to get help, and cldfbench COMMAND -h to get help on individual subcommands, e.g. cldfbench new -h to read about the usage of the new subcommand.

Dataset discovery

Most cldfbench commands operate on an existing dataset (unlike new, which creates a new one). Datasets can be discovered in two ways:

Via the python module (i.e. the *.py file, containing the Dataset subclass). To use this mode of discovery, pass the path to the python module as DATASET argument, when required by a command.
Via entry point and dataset ID. To use this mode, specify the name of the entry point as value of the --entry-point option (or use the default name cldfbench.dataset) and the Dataset.id as DATASET argument.

Discovery via entry point is particularly useful for commands that can operate on multiple datasets. To select all datasets advertising a given entry point, pass "_" (i.e. an underscore) as DATASET argument.

Workflow

For a full example of the cldfbench curation workflow, see the tutorial.

Creating a skeleton for a new dataset directory

A directory containing stub entries for a dataset can be created running

bash cldfbench new

This will create the following layout (where <ID> stands for the chosen dataset ID): <ID>/ ├── cldf # A stub directory for the CLDF data │ └── README.md ├── cldfbench_<ID>.py # The python module, providing the Dataset subclass ├── etc # A stub directory for the configuration data │ └── README.md ├── metadata.json # The metadata provided to the subcommand serialized as JSON ├── raw # A stub directory for the raw data │ └── README.md ├── setup.cfg # Python setup config, providing defaults for test integration ├── setup.py # Python setup file, making the dataset "installable" ├── test.py # The python code to run for dataset validation └── .github # Integrate the validation with GitHub actions

Implementing CLDF creation

cldfbench provides tools to make CLDF creation simple. Still, each dataset is different, and so each dataset will have to provide its own custom code to do so. This custom code goes into the cmd_makecldf method of the Dataset subclass in the dataset's python module. (See also the API documentation of cldfbench.Dataset.)

Typically, this code will make use of one or more cldfbench.CLDFSpec instances, which describes what kind of CLDF to create. A CLDFSpec also gives access to a cldfbench.CLDFWriter instance, which wraps a pycldf.Dataset.

The main interfaces to these objects are: - cldfbench.Dataset.cldf_specs: a method returning specifications of all CLDF datasets that are created by the dataset, - cldfbench.Dataset.cldf_writer: a method returning an initialized CLDFWriter associated with a particular CLDFSpec.

cldfbench supports several scenarios of CLDF creation: - The typical use case is turning raw data into a single CLDF dataset. This would require instantiating one CLDFWriter writer in the cmd_makecldf method, and the defaults of CLDFSpec will probably be ok. Since this is the most common and simplest case, it is supported with some extra "sugar": The initialized CLDFWriter is available as args.writer when cmd_makecldf is called. - But it is also possible to create multiple CLDF datasets: - For a dataset containing both, lexical and typological data, it may be appropriate to create a Ẁordlist and a StructureDataset. To do so, one would have to call cldf_writer twice, passing in an approriate CLDFSpec. Note that if both CLDF datasets are created in the same directory, they can share the LanguageTable - but would have to specify distinct file names for the ParameterTable, passing distinct values to CLDFSpec.data_fnames. - When creating multiple datasets of the same CLDF module, e.g. to split a large dataset into smaller chunks, care must be taken to also disambiguate the name of the metadata file, passing distinct values to CLDFSpec.metadata_fname.

When creating CLDF, it is also often useful to have standard reference catalogs accessible, in particular Glottolog. See the section on Catalogs for a description of how this is supported by cldfbench.

Catalogs

Linking data to reference catalogs is a major goal of CLDF, thus cldfbench provides tools to make catalog access and maintenance easier. Catalog data must be accessible in local clones of the data repository. cldfbench provides commands: - catconfig to create the clones and make them known through a configuration file, - catinfo to get an overview of the installed catalogs and their versions, - catupdate to update local clones from the upstream repositories.

See:

https://cldfbench.readthedocs.io/en/latest//catalogs.html

for a list of reference catalogs which are currently supported in cldfbench.

Note: Cloning glottolog/glottolog - due to the deeply nested directories of the language classification - results in long path names. On Windows this may require disabling the maximum path length limitation.

Curating a dataset on GitHub

One of the design goals of CLDF was to specify a data format that plays well with version control. Thus, it's natural - and actually recommended - to curate a CLDF dataset in a version controlled repository. The most popular way to do this in a collaborative fashion is by using a git repository hosted on GitHub.

The directory layout supported by cldfbench caters to this use case in several ways: - Each directory contains a file README.md, which will be rendered as human readable description when browsing the repository at GitHub. - The file .travis.yml contains the configuration for hooking up a repository with Travis CI, to provide continuous consistency checking of the data.

Archiving a dataset with Zenodo

Curating a dataset on GitHub also provides a simple way to archiving and publishing released versions of the data. You can hook up your repository with Zenodo (following this guide). Then, Zenodo will pick up any released package, assign a DOI to it, archive it and make it accessible in the long-term.

Some notes: - Hook-up with Zenodo requires the repository to be public (not private). - You should consider using an institutional account on GitHub and Zenodo to associate the repository with. Currently, only the user account registering a repository on Zenodo can change any metadata of releases lateron. - Once released and archived with Zenodo, it's a good idea to add the DOI assigned by Zenodo to the release description on GitHub. - To make sure a release is picked up by Zenodo, the version number must start with a letter, e.g. "v1.0" - not "1.0".

Thus, with a setup as described here, you can make sure you create FAIR data.

Extending `cldfbench`

cldfbench can be extended or built-upon in various ways - typically by customizing core functionality in new python packages. To support particular types of raw data, you might want a custom Dataset class, or to support a particular type of CLDF data, you would customize CLDFWriter.

In addition to extending cldfbench using the standard methods of object-oriented programming, there are two more ways of extending cldfbench: Commands and dataset templates. Both are implemented using entry ponits. So packages which provide custom commands or dataset templates must declare these in metadata that is made known to other Python packages (in particular the cldfbench package) upon installation.

Commands

A python package (or a dataset) can provide additional subcommands to be run from cldfbench. For more info see the commands.README.

Custom dataset templates

A python package can provide alternative dataset templates to be run with cldfbench new. Such templates are implemented by: - a subclass of cldfbench.Template, - which is advertised using an entry point cldfbench.scaffold:

python entry_points={ 'cldfbench.scaffold': [ 'template_name=mypackage.scaffold:DerivedTemplate', ], },

Owner

Name: Cross-Linguistic Data Formats
Login: cldf
Kind: organization

Website: https://cldf.clld.org
Repositories: 15
Profile: https://github.com/cldf

GitHub Events

Total

Issues event: 5
Watch event: 1
Delete event: 1
Issue comment event: 12
Push event: 7
Pull request event: 1
Pull request review event: 1
Create event: 3

Last Year

Issues event: 5
Watch event: 1
Delete event: 1
Issue comment event: 12
Push event: 7
Pull request event: 1
Pull request review event: 1
Create event: 3

Committers

Last synced: about 3 years ago

All Time

Total Commits: 223
Total Committers: 11
Avg Commits per committer: 20.273
Development Distribution Score (DDS): 0.143

Top Committers

Name	Email	Commits
xrotwang	x**g@g**m	191
Hans-Jörg Bibiko	b**o@s**e	14
Johannes Englisch	e**h@s**e	4
bambooforest@gmail.com	b**t@g**m	3
chrzyki	c**h@f**t	3
Simon J Greenhill	S**l@u**m	2
Hans-Jörg Bibiko	h**o@e**e	2
Johannes Englisch	j**h@t**e	1
lingulist	m**t@l**g	1
Tiago Tresoldi	t**i@s**e	1
Tiago Tresoldi	t**i@l**e	1

Committer Domains (Top 20 + Academic)

shh.mpg.de: 3 lingfil.uu.se: 1 lingpy.org: 1 t-online.de: 1 eva.mpg.de: 1 functor-argument.net: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 74
Total pull requests: 23
Average time to close issues: 2 months
Average time to close pull requests: 4 days
Total issue authors: 11
Total pull request authors: 8
Average comments per issue: 1.36
Average comments per pull request: 0.61
Merged pull requests: 21
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 4
Pull requests: 1
Average time to close issues: about 20 hours
Average time to close pull requests: about 19 hours
Issue authors: 3
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 2.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

xrotwang (48)
SimonGreenhill (6)
martino-vic (4)
chrzyki (4)
johenglisch (3)
SamPassmore (2)
tresoldi (2)
LinguList (2)
PromiseDodzi (1)
Bibiko (1)
Anaphory (1)

Pull Request Authors

johenglisch (8)
Bibiko (4)
SimonGreenhill (3)
chrzyki (3)
bambooforest (2)
tresoldi (2)
LinguList (1)
xrotwang (1)

Top Labels

Issue Labels

bug (13) wontfix (4) documentation (4) enhancement (2) repos enrichment (1) question (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 1,882 last-month

Total dependent packages: 9
Total dependent repositories: 148
Total versions: 30
Total maintainers: 3

pypi.org: cldfbench

Python library implementing a CLDF workbench

Homepage: https://github.com/cldf/cldfbench
Documentation: https://cldfbench.readthedocs.io/
License: Apache 2.0
Latest release: 1.14.1
published about 1 year ago

Versions: 30
Dependent Packages: 9
Dependent Repositories: 148
Downloads: 1,882 Last month

Rankings

Dependent packages count: 1.2%

Dependent repos count: 1.2%

Downloads: 8.1%

Average: 8.5%

Forks count: 14.2%

Stargazers count: 17.7%

Maintainers (3)

LinguList xrotwang chrzyki

Last synced: 8 months ago

https://github.com/cldf/cldfbench

Science Score: 46.0%

Keywords from Contributors

Repository

Basic Info

Statistics

Metadata Files

README.md

cldfbench

Overview

Further reading

Installation

The command line interface cldfbench

Dataset discovery

Workflow

Creating a skeleton for a new dataset directory

Implementing CLDF creation

Catalogs

Curating a dataset on GitHub

Archiving a dataset with Zenodo

Extending cldfbench

Commands

Custom dataset templates

Owner

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: cldfbench

Rankings

Maintainers (3)

Dependencies

The command line interface `cldfbench`

Extending `cldfbench`