pycldf

python package to read and write CLDF datasets

https://github.com/cldf/pycldf

Keywords from Contributors

linguistics concepts cross-linguistic-data phylogenetics glottolog

Last synced: 10 months ago · JSON representation

Repository

python package to read and write CLDF datasets

Basic Info

Host: GitHub
Owner: cldf
License: apache-2.0
Language: Python
Default Branch: master
Homepage: https://cldf.clld.org
Size: 871 KB

Statistics

Stars: 18
Watchers: 11
Forks: 7
Open Issues: 1
Releases: 3

Created about 10 years ago · Last pushed 11 months ago

Metadata Files

Readme Changelog Contributing License

pycldf

A python package to read and write CLDF datasets.

Install

Install pycldf from PyPI: shell pip install pycldf

Command line usage

Installing the pycldf package will also install a command line interface cldf, which provides some sub-commands to manage CLDF datasets.

Dataset discovery

cldf subcommands support dataset discovery as specified in the standard.

So a typical workflow involving a remote dataset could look as follows.

Create a local directory to which to download the dataset (ideally including version info): shell $ mkdir wacl-1.0.0

Accessing CLDF datasets on Zenodo requires installing cldfzenodo (via pip install cldfzenodo). Validating a dataset from Zenodo will implicitly download it, so running shell $ cldf validate https://zenodo.org/record/7322688#rdf:ID=wacl --download-dir wacl-1.0.0/ will download the dataset to wacl-1.0.0.

Subsequently we can access the data locally for better performance: ```shell $ cldf stats wacl-1.0.0/#rdf:ID=wacl value

dc:bibliographicCitation Her, One-Soon, Harald Hammarström and Marc Allassonnière-Tang. 2022. dc:conformsTo http://cldf.clld.org/v1.0/terms.rdf#StructureDataset dc:identifier https://wacl.clld.org dc:license https://creativecommons.org/licenses/by/4.0/ dc:source sources.bib dc:title World Atlas of Classifier Languages dcat:accessURL https://github.com/cldf-datasets/wacl rdf:ID wacl rdf:type http://www.w3.org/ns/dcat#Distribution

            Type              Rows

values.csv ValueTable 3338 parameters.csv ParameterTable 1 languages.csv LanguageTable 3338 codes.csv CodeTable 2 sources.bib Sources 2000 ```

(Note that locating datasets on Zenodo requires installation of cldfzenodo.)

Summary statistics

```shell $ cldf stats tests/data/wordlistwithcognates/metadata.json value

dc:conformsTo http://cldf.clld.org/v1.0/terms.rdf#Wordlist dc:source sources.bib

             Type               Rows

languages.csv LanguageTable 2 parameters.csv ParameterTable 2 forms.csv FormTable 3 cognates.csv CognateTable 2 cognatesets.csv CognatesetTable 1 sources.bib Sources 1 ```

Validation

Arguably the most important functionality of pycldf is validating CLDF datasets.

By default, data files are read in strict-mode, i.e. invalid rows will result in an exception being raised. To validate a data file, it can be read in validating-mode.

For example the following output is generated

sh $ cldf validate mydataset/forms.csv WARNING forms.csv: duplicate primary key: (u'1',) WARNING forms.csv:4:Source missing source key: Mei2005

when reading the file

ID,Language_ID,Parameter_ID,Value,Segments,Comment,Source 1,abcd1234,1277,word,,,Meier2005[3-7] 1,stan1295,1277,hand,,,Meier2005[3-7] 2,stan1295,1277,hand,,,Mei2005[3-7]

Extracting human readable metadata

The information in a CLDF metadata file can be converted to markdown (a human readable markup language) running shell cldf markdown PATH/TO/metadata.json A typical usage of this feature is to create a README.md for your dataset (which, when uploaded to e.g. GitHub will be rendered nicely in the browser).

Downloading media listed in a dataset's MediaTable

Typically, CLDF datasets only reference media items. The MediaTable provides enough information, though, to download and save an item's content. This can be done running shell cldf downloadmedia PATH/TO/metadata.json PATH/TO/DOWNLOAD/DIR To minimize bandwidth usage, relevant items can be filtered by passing selection criteria in the form COLUMN_NAME=SUBSTRING as optional arguments. E.g. downloading could be limited to audio files passing Media_Type=audio/ (provided, Media_Type is the name of the column with propertyUrl http://cldf.clld.org/v1.0/terms.rdf#mediaType)

Converting a CLDF dataset to an SQLite database

A very useful feature of CSVW in general and CLDF in particular is that it provides enough metadata for a set of CSV files to load them into a relational database - including relations between tables. This can be done running the cldf createdb command:

```shell script $ cldf createdb -h usage: cldf createdb [-h] [--infer-primary-keys] DATASET SQLITEDBPATH

Load a CLDF dataset into a SQLite DB

positional arguments: DATASET Dataset specification (i.e. path to a CLDF metadata file or to the data file) SQLITEDBPATH Path to the SQLite db file ```

For a specification of the resulting database schema refer to the documentation in src/pycldf/db.py.

Handling large media files

Often, platforms like GitHub impose limits on the size of individual files in a repository. Thus, in order to facilitate curation of datasets with large media files on such platforms, pycldf provides a pragmatic solution as follows:

Running shell cldf splitmedia <dataset-locator> on a dataset will split all media files with sizes bigger than a configurable threshold into multiple files, just like UNIX' split command would. A file named audio.wav will be split into files audio.wav.aa, audio.wav.ab and so on.

[!CAUTION] With large files split (and removed) the dataset will not validate anymore.

In order to restore the files, the corresponding command shell cldf catmedia <dataset-locator> can be used.

Thus, in a typical workflow each commit to the repository would be wrapped in a cldf splitmedia and a cldf catmedia call (possibly automated via git hooks).

Python API

For a detailed documentation of the Python API, refer to the docs on ReadTheDocs.

Reading CLDF

As an example, we'll read data from WALS Online, v2020:

```python

from pycldf import Dataset wals2020 = Dataset.from_metadata('https://raw.githubusercontent.com/cldf-datasets/wals/v2020/cldf/StructureDataset-metadata.json') ```

For exploratory purposes, accessing a remote dataset over HTTP is fine. But for real analysis, you'd want to download the datasets first and then access them locally, passing a local file path to Dataset.from_metadata.

Let's look at what we got: ```python

print(wals2020) for c in wals2020.components: ... print(c) ... ValueTable ParameterTable CodeTable LanguageTable ExampleTable ``As expected, we got a [StructureDataset](https://github.com/cldf/cldf/tree/master/modules/StructureDataset), and in addition to the requiredValueTable`, we also have a couple more components.

We can investigate the values using pycldf's ORM functionality, i.e. mapping rows in the CLDF data files to convenient python objects. (Take note of the limitations describe in orm.py, though.)

```python

for value in wals2020.objects('ValueTable'): ... break ... value value.language value.language.cldf Namespace(glottocode=None, id='aab', iso639P3code=None, latitude=Decimal('-3.45'), longitude=Decimal('142.95'), macroarea=None, name='Arapesh (Abu)') value.parameter value.parameter.cldf Namespace(description=None, id='81A', name='Order of Subject, Object and Verb') value.references (,) value.references[0] print(value.references[0].source.bibtex()) @misc{Nekitel-1985, olacfield = {syntax; generallinguistics; typology}, school = {Australian National University}, title = {Sociolinguistic Aspects of Abu', a Papuan Language of the Sepik Area, Papua New Guinea}, wals_code = {aab}, year = {1985}, author = {Nekitel, Otto I. M. S.} } ```

If performance is important, you can just read rows of data as python dicts, in which case the references between tables must be resolved "by hand":

```python

params = {r['id']: r for r in wals2020.iterrows('ParameterTable', 'id', 'name')} for v in wals2020.iterrows('ValueTable', 'parameterReference'): ... print(params[v['parameterReference']]['name']) ... break ... Order of Subject, Object and Verb ```

Note that we passed names of CLDF terms to Dataset.iter_rows (e.g. id) specifying which columns we want to access by CLDF term - rather than by the column names they are mapped to in the dataset.

Writing CLDF

Warning: Writing CLDF with pycldf does not automatically result in valid CLDF! It does result in data that can be checked via cldf validate (see below), though, so you should always validate after writing.

```python from pycldf import Wordlist, Source

dataset = Wordlist.indir('mydataset') dataset.addsources(Source('book', 'Meier2005', author='Hans Meier', year='2005', title='The Book')) dataset.write(FormTable=[ { 'ID': '1', 'Form': 'word', 'LanguageID': 'abcd1234', 'ParameterID': '1277', 'Source': ['Meier2005[3-7]'], }]) ```

results in $ ls -1 mydataset/ forms.csv sources.bib Wordlist-metadata.json

mydataset/forms.csv ID,Language_ID,Parameter_ID,Value,Segments,Comment,Source 1,abcd1234,1277,word,,,Meier2005[3-7]
mydataset/sources.bib ```bibtex @book{Meier2005, author = {Meier, Hans}, year = {2005}, title = {The Book} }

``-mydataset/Wordlist-metadata.json`

Advanced writing

To add predefined CLDF components to a dataset, use the add_component method: ```python from pycldf import StructureDataset, term_uri

dataset = StructureDataset.indir('mydataset') dataset.addcomponent('ParameterTable') dataset.write( ValueTable=[{'ID': '1', 'LanguageID': 'abc', 'ParameterID': '1', 'Value': 'x'}], ParameterTable=[{'ID': '1', 'Name': 'Grammatical Feature'}]) ```

It is also possible to add generic tables: python dataset.add_table('contributors.csv', term_uri('id'), term_uri('name')) which can also be linked to other tables: python dataset.add_columns('ParameterTable', 'Contributor_ID') dataset.add_foreign_key('ParameterTable', 'Contributor_ID', 'contributors.csv', 'ID')

Addressing tables and columns

Tables in a dataset can be referenced using a Dataset's __getitem__ method, passing - a full CLDF Ontology URI for the corresponding component, - the local name of the component in the CLDF Ontology, - the url of the table.

Columns in a dataset can be referenced using a Dataset's __getitem__ method, passing a tuple (<TABLE>, <COLUMN>) where <TABLE> specifies a table as explained above and <COLUMN> is - a full CLDF Ontolgy URI used as propertyUrl of the column, - the name property of the column.

See also https://pycldf.readthedocs.io/en/latest/dataset.html#accessing-schema-objects-components-tables-columns-etc

Object oriented access to CLDF data

The pycldf.orm module implements functionality to access CLDF data via an ORM. See https://pycldf.readthedocs.io/en/latest/orm.html for details.

Accessing CLDF data via SQL

The pycldf.db module implements functionality to load CLDF data into a SQLite database. See https://pycldf.readthedocs.io/en/latest/ext_sql.html for details.

Name	Email	Commits
xrotwang	x**g@g**m	377
Sebastian Bank	s**k@u**e	18
SimonGreenhill	s**n@s**z	5
Christoph Rzymski	c**h@f**t	4
Gereon Kaiping	g**g@h**l	2
Gereon Kaiping	a**y@y**e	1
Simon J Greenhill	S****l	1

Total issues: 101
Total pull requests: 23
Average time to close issues: about 2 months
Average time to close pull requests: 21 days
Total issue authors: 11
Total pull request authors: 8
Average comments per issue: 2.36
Average comments per pull request: 1.65
Merged pull requests: 16
Bot issues: 0
Bot pull requests: 2

Past Year

Issues: 9
Pull requests: 2
Average time to close issues: 29 days
Average time to close pull requests: 21 minutes
Issue authors: 5
Pull request authors: 2
Average comments per issue: 3.0
Average comments per pull request: 2.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

xrotwang (66)
Anaphory (17)
SimonGreenhill (9)
chrzyki (3)
nataliacp (2)
SamPassmore (1)
fmatter (1)
LinguList (1)
FredericBlum (1)
JPapir (1)

Pull Request Authors

xrotwang (9)
SimonGreenhill (4)
chrzyki (3)
Anaphory (2)
fmatter (2)
Bibiko (2)
marph91 (1)
dependabot[bot] (1)
johenglisch (1)

Top Labels

Issue Labels

bug (18) enhancement (8) documentation (4) wontfix (1) duplicate (1)

Pull Request Labels

dependencies (1)

Packages

Total packages: 2
Total downloads:
- pypi 16,469 last-month

Total dependent packages: 30
(may contain duplicates)
Total dependent repositories: 182
(may contain duplicates)
Total versions: 147
Total maintainers: 5

pypi.org: pycldf

A python library to read and write CLDF datasets

Homepage: https://github.com/cldf/pycldf
Documentation: https://pycldf.readthedocs.io/
License: Apache 2.0
Latest release: 1.43.0
published 11 months ago

Versions: 103
Dependent Packages: 26
Dependent Repositories: 168
Downloads: 15,437 Last month

Rankings

Dependent packages count: 0.6%

Dependent repos count: 1.2%

Downloads: 5.9%

Average: 7.2%

Forks count: 12.6%

Stargazers count: 15.6%

Maintainers (4)

xrotwang xflr6 chrzyki bibiko

Last synced: 10 months ago

pypi.org: python-nexus

A nexus (phylogenetics) file reader (.nex, .trees)

Homepage: https://github.com/dlce-eva/python-nexus
Documentation: https://python-nexus.readthedocs.io/
License: BSD-2-Clause
Latest release: 2.9.0
published almost 4 years ago

Versions: 44
Dependent Packages: 4
Dependent Repositories: 14
Downloads: 1,032 Last month

Rankings

Dependent packages count: 1.9%

Dependent repos count: 3.9%

Downloads: 7.2%

Average: 9.3%

Stargazers count: 16.6%

Forks count: 16.9%

Maintainers (2)

xrotwang SimonGreenhill

Last synced: 10 months ago

pycldf

Science Score: 46.0%

Keywords from Contributors

Repository

Basic Info

Statistics

Metadata Files

README.md

pycldf

Install

Command line usage

Dataset discovery

Summary statistics

Validation

Extracting human readable metadata

Downloading media listed in a dataset's MediaTable

Converting a CLDF dataset to an SQLite database

Handling large media files

Python API

Reading CLDF

Writing CLDF

Advanced writing

Addressing tables and columns

Object oriented access to CLDF data

Accessing CLDF data via SQL

See also

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pycldf

Rankings

Maintainers (4)

pypi.org: python-nexus

Rankings

Maintainers (2)

Dependencies