Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.8%) to scientific vocabulary
Last synced: 9 months ago
·
JSON representation
Repository
CLDF: Cross-linguistic Data Formats
Basic Info
- Host: GitHub
- Owner: cysouw
- License: apache-2.0
- Language: Python
- Default Branch: master
- Homepage: http://cldf.clld.org
- Size: 979 KB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of cldf/cldf
Created over 8 years ago
· Last pushed over 8 years ago
https://github.com/cysouw/cldf/blob/master/
# CLDF: Cross-linguistic Data Formats
**Table of Contents**
* [Conformance Levels](#conformance-levels)
* [Metadata-free conformance](#metadata-free-conformance)
* [Extended conformance](#extended-conformance)
* [CLDF Ontology](#cldf-ontology)
* [CLDF Dataset](#cldf-dataset)
* [CLDF Metadata file](#cldf-metadata-file)
* [CDLF Data files](#cldf-data-files)
* [Sources file](#sources-file)
* [CLDF Modules](#cldf-modules)
* [CLDF Components](#cldf-components)
* [Compatibility](#compatibility)
* [Examples](#examples)
* [Versioning](#versioning)
* [History](#history)
## Conformance Levels
A CLDF dataset is
- a set of UTF-8 encoded CSV files
- described by a [TableGroup](http://w3c.github.io/csvw/metadata/#table-groups) serialized as JSON file
- with a [common property](http://w3c.github.io/csvw/metadata/#dfn-common-property) `dc:conformsTo` having one of the [CLDF module](#modules) URIs as value.
While the [JSON-LD dialect](https://www.w3.org/TR/tabular-metadata/#json-ld-dialect) to be used for metadata according to the [Metadata Vocabulary for Tabular Data](https://www.w3.org/TR/tabular-metadata/) can be edited by hand, this may already be beyond what can be expected by typical linguists. Thus, CLDF specifies two conformance levels for datasets: metadata-free or extended.
### Metadata-free conformance
A dataset can be CLDF conformant without providing a separate metadata description file. To do so, the dataset must *exactly* follow the default specification for the appropriate module regarding:
- file names
- column names (for specified columns)
- CSV dialect
Thus, rather than not *having* any metadata, the dataset does not *specify*
any; and instead it falls back to using the defaults, i.e. "free" as in "beer" not
as in "gluten-free". The CSV file may contain additional columns not specified in
the default module descriptions.
The default file names and column names are described in [`components`](components). The default CSV dialect is [RFC4180](http://tools.ietf.org/html/rfc4180) using the [UTF-8](http://en.wikipedia.org/wiki/UTF-8) character encoding, i.e. use the CSV dialect specified by
```
{
"encoding": "utf-8",
"lineTerminators": ["\r\n", "\n"],
"quoteChar": "\"",
"doubleQuote": true,
"skipRows": 0,
"commentPrefix": "#",
"header": true,
"headerRowCount": 1,
"delimiter": ",",
"skipColumns": 0,
"skipBlankRows": false,
"skipInitialSpace": false,
"trim": false
}
```
Some of the effects of this metadata-free conformance are:
- The first line of each file must contain the comma-separated list of column names.
- No comment lines are allowed.
### Extended conformance
A dataset is CLDF conformant if it uses a custom metadata file, derived from the default profile for the appropriate module, possibly overriding/customizing:
- the CSV [dialect description](http://w3c.github.io/csvw/metadata/#dialect-descriptions) (possibly per table), e.g. to:
- allow comment lines (if appropriately prefixed with [`commentPrefix`](http://w3c.github.io/csvw/metadata/#dialect-commentPrefix))
- omit a header line (if appropriately indicated by `"header": false`)
- use tab-separated data files (if appropriately indicated by `"delimiter": ","`)
- the table property [url](http://w3c.github.io/csvw/metadata/#tables)
- the column property [titles](http://w3c.github.io/csvw/metadata/#columns)
- the inherited column properties
- [default](http://w3c.github.io/csvw/metadata/#cell-default)
- [null](http://w3c.github.io/csvw/metadata/#cell-null)
- [separator](http://w3c.github.io/csvw/metadata/#cell-separator)
- adding common properties,
- adding [foreign keys](#foreign-keys), to specify relations between tables of the dataset.
Thus, using extended conformance via metadata, a dataset may
- use tab-separated data files,
- use non-standard file names,
- use non-standard column names,
- add metadata describing attribution and provenance of the data,
- specify [relations between multiple tables](http://w3c.github.io/csvw/metadata/#common-properties) in a dataset,
- supply default values for required columns like `Language_ID`, using [virtual columns](http://w3c.github.io/csvw/metadata/#use-of-virtual-columns).
In particular, since the metadata description resides in a separate file, it is often possible to retrofit existing CSV files into the CLDF framework by adding a metadata description.
## CLDF Ontology
CLDF data uses terms from the [CLDF Ontology](http://cldf.clld.org/v1.0/terms.rdf), as specified in the file `terms.rdf`, to mark [`TableGroup`](http://w3c.github.io/csvw/metadata/#table-groups) or [`Table`](http://w3c.github.io/csvw/metadata/#tables) objects which have special meaning within the CLDF framework.
The CLDF Ontology also provides a set of [properties](http://cldf.clld.org/v1.0/terms.rdf#properties) to attach semantics to individual columns. While many of these properties are similar (or identical) to properties defined elsewhere - most notably in the [General Ontology for Linguistic Description - GOLD](http://linguistics-ontology.org/) - we opted for inclusion to avoid ambiguity, but made sure to reference the related related properties in the Ontology.
Note that the column *names* in the default table descriptions (e.g. [`formTable`](components/forms)) are not always the same as the column *properties*. Each column has both a `csvw:name` and a separate `propertyURL` linking the column to the ontology. Each property also has a `rdf:label` which might also be different.
An more easily readable list of all properties is available in the file [`properties.md`](properties.md). Please note that this file is just for easier reference, but is not normative: in case of discrepancy, the description in `terms.rdf` holds.
Note: For better human readability the [CLDF Ontology](http://cldf.clld.org/v1.0/terms.rdf) should
be visited with a browser capable of rendering XSLT - such as Firefox.
## CLDF Dataset
### CLDF Metadata file
A CLDF dataset is described with metadata provided as JSON file following the [Metadata Vocabulary for Tabular Data](https://www.w3.org/TR/tabular-metadata/). To make tooling simpler, we restrict the metadata specification as follows:
- Metadata files must specify a `tables` property on top-level, i.e. must describe a
[`TableGroup`](http://w3c.github.io/csvw/metadata/#table-groups). While this adds a
bit of verbosity to the metadata description, it makes it possible to describe mutiple
tables in one metadata file.
- The common property `dc:conformsTo` of the `TableGroup` is used to indicate the
CLDF module, e.g.
`"dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#Wordlist"`
- The common property `dc:conformsTo` of a `Table` is used to associate tables with
a particular role in a CLDF module using appropriate classes from the
[CLDF Ontology](http://cldf.clld.org/v1.0/terms.rdf).
- If each row in the data file corresponds to a resource on the web (i.e. a resource
identified by a dereferenceable HTTP URI), the `tableSchema` property should provide an
`aboutUrl` property.
- If individual cells in a row correspond to resources on the web, the corresponding
column specification should provide a `valueUrl` property.
Each dataset should provide a dataset distribution description using the [DCAT vocabulary](http://www.w3.org/TR/vocab-dcat/#class-distribution). This will make it easy to [catalog](http://www.w3.org/TR/vocab-dcat/#class-catalog) cross-linguistic datasets. In particular, each dataset description should include properties
- `dc:bibliographicCitation` and
- `dc:license`.
Thus, an example for a CLDF dataset description could look as follows:
```
{
"@context": "http://www.w3.org/ns/csvw",
"dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#StructureDataset",
"dc:title": "The Dataset",
"dc:bibliographicCitation": "Cite me like this!",
"dc:license": "http://creativecommons.org/licenses/by/4.0/",
"null": "?",
"tables": [
{
"url": "ds1.csv",
"dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#ValueTable",
"tableSchema": {
"columns": [
{
"name": "ID",
"datatype": "string",
"propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#id"
},
{
"name": "Language_ID",
"datatype": "string",
"propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#languageReference",
"valueUrl": "http://glottolog.org/resource/languoid/id/{Language_ID}"
},
{
"name": "Parameter_ID",
"datatype": "string",
"propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#parameterReference"
},
{
"name": "Value",
"datatype": "string",
"propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#value"
},
{
"name": "Comment",
"datatype": "string",
"propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#comment"
},
{
"name": "Source",
"datatype": "string",
"propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#source"
},
{
"name": "Glottocode",
"virtual": true,
"propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#glottocode",
"valueUrl": "{Language_ID}"
},
],
"aboutUrl": "http://example.org/valuesets/{ID}",
"primaryKey": "ID"
}
}
]
}
```
### CLDF Data files
While it is possible to add any kind of CSV files to a CLDF dataset, the CLDF standard
recognizes (and attaches specified semantics) to tables described with a common property `dc:conformsTo` with one of the [table type](#cldf-components) URIs of the [CLDF ontology](http://cldf.cld.org/v1.0/terms.rdf) as value.
Additionally, CLDF semantics can be assigned to individual columns by
assigning one of the property URIs defined in the
[CLDF ontology](http://cldf.cld.org/v1.0/terms.rdf) as `propertyUrl`.
Note: CLDF column properties are assumed to have a complete row (or rather the
entity a row stores data about) as scope; e.g. a [source column](#column-source)
is assumed to provide source information for any piece of data in the row.
Thus, each property can be used only once per table, which makes processing simpler.
#### Identifier
Each CLDF data table should contain a column which uniquely identifies a row in
the table. This column must be marked using
- a `propertyUrl` of `http://cldf.cld.org/v1.0/terms.rdf#id`
- the column name `ID` in the case of metadata-free conformance.
To allow usage of identifiers as path components of URIs and ensure they are
portable across systems, identifiers must be composed of
[alphanumeric characters](https://en.wikipedia.org/wiki/Alphanumeric),
underscore `_` and hyphen `-` only, i.e. match the regular expression
`[a-zA-Z0-9\-_]+` (see [rfc3986](https://tools.ietf.org/html/rfc3986#section-2.3)).
Following our design goal to reference rather than duplicate data, identifiers
may be used to reference existing entities (e.g. Glottolog languages, WALS features,
etc.). This can be done as follows:
- If the identifier can be interpreted as links to other entities, e.g.
using the WALS three-letted language codes to identify languages, this should be
indicated by assigning the column an appropriate `valueUrl` property, e.g.
`http://wals.info/languoid/lect/wals_code_{ID}`
- If the identifier follows a specified identification scheme, e.g. ISO 639-3 for
languages, this can be indicated by adding [a virtual column](http://w3c.github.io/csvw/metadata/#x5-6-1-1-use-of-virtual-columns) with a suitable `propertyUrl`
to the table's list of columns.
#### Sources
Considering that any single step in collecting (cross-)linguistic data involves some
amount of analysis and judgement calls, it is essential to make it easy to trace
assertions back to their source.
Each CLDF data table may contain a column listing sources for the data asserted in the
row. This column must be marked using
- a `propertyUrl` of `http://cldf.cld.org/v1.0/terms.rdf#source`
- the column name `Source` in the case of metadata-free conformance.
Sources are specified as semicolon-separated source specifications, of the form
*source_ID[source context]*, e.g. *meier2015[3-12]* where *meier2015* is a citation key in the accompanying [sources file](#sources).
#### Foreign keys
Often cross-linguistic data is [relational](https://en.wikipedia.org/wiki/Relational_model), e.g. *cognate judgements* group *forms* into *cognate sets*, creating a [many-to-many relationship](https://en.wikipedia.org/wiki/Many-to-many_(data_model)) between a `FormTable` and a `CognatesetTable`.
To make such relations explicit, the CLDF Ontology provides a set of
[reference properties](http://cldf.cld.org/v1.0/terms.rdf#reference-properties).
Reference properties are interpreted as *optional* foreign key, i.e.
- if a `table1.csv` makes reference to a `table2.csv`, and both are part of the dataset, then mentioning the `ID` from table2 in a column of table1 (typically using the column-name `table2_ID`) is sufficient as a reference, and this is implicit equivalent to a [foreignKeys](http://w3c.github.io/csvw/metadata/#schema-foreignKeys) property of `table1.csv`:
```
"columns": [
"name": "table2_ID",
...
]
"foreignKeys": [
{
"columnReference": "table2_ID",
"reference": {
"resource": "table2.csv",
"columnReference": "ID"
}
}
]
```
- otherwise values in the column are interpreted as identifiers of the referenced
entities (in which case the actual entities can only be resolved by context
or via additonal `valueUrl` properties on the column).
### Sources reference file
References to sources - if not referenced by Glottolog ID - can be supplied as part of a CLDF dataset as an UTF-8 encoded BibTeX file (with the citation keys serving as local Source IDs). The filename of this BibTeX file must be either:
- `sources.bib` in case of metadata-free conformance
- or specified as top-level common property `dc:source` in the dataset's metadata.
## CLDF Modules
Much like
[Dublin Core Application Profiles](http://dublinco.org/documents/profile-guidelines/),
CLDF Modules group terms of the CLDF Ontology into tables.
Thus, CLDF module specifications are recommendations for groups
of tables modeling typical cross-linguistic datatypes. Currently, the CLDF
specification recognizes the following modules:
- [Wordlist](modules/Wordlist/)
- [Structure dataset](modules/StructureDataset/)
- [Dictionary](modules/Dictionary/)
- [Parallel text](modules/ParallelText)
In addition, a CLDF dataset can be specified as
[*Generic*](http://cldf.cld.org/v1.0/terms.rdf#Generic), imposing no requirements
on tables or columns. Thus, *Generic* datasets are a way to evolve new data types
(to become recognized modules), while already providing (generic) tool support.
In the CLDF Ontology [modules](http://cldf.clld.org/v1.0/terms.rdf#modules) are modeled
as subclasses of [`dcat:Distribution`](http://www.w3.org/ns/dcat#Distribution), thus
additional metadata as recommended in the
[DCAT specification](https://www.w3.org/TR/vocab-dcat/#class-distribution) should be
provided.
For each type of CLDF dataset there is a *CLDF module*, i.e. a default metadata profile
describing the required tables, columns and datatypes.
*metadata-free conformance* means data files will be read as if they were accompanied by
the corresponding default metadata.
## CLDF Components
Some types of cross-linguistic data may be part of different CLDF modules. These
types are specified as *components* in a way that can be re-used across modules (typically as [table descriptions](http://w3c.github.io/csvw/metadata/#tables), which can be appended to the `tables` property of a module's metadata).
- [Language metadata](components/languages/)
- [Parameter metadata](components/parameters/)
- [Values](components/values) - as defined for a [`StructureDataset`](modules/StructureDataset/)
- [Codes](components/codes/)
- [Entries](components/entries)
- [Senses](components/senses)
- [Examples](components/examples/)
- [Forms](components/forms) - as defined for a [`Wordlist`](modules/Wordlist/)
- [Cognates](components/cognates/)
- [CognateSets](components/cognatesets)
- [Borrowings](components/borrowings/)
- [Functional Equivalents](components/functionalequivalents)
- [Functional Equivalents Sets](components/functionalequivalentssets)
A component corresponds to a certain type of data. Thus, to make sure all instances of
such a type have the same set of properties, we allow at most one component for each type
in a CLDF dataset.
## Examples
To stipulate further discussion and help experiments with tools, some examples of CLDF datasets are available in the [examples directory](examples/).
## Compatibility
- Using UTF-8 as character encoding means editing these files with MS Excel is not completely trivial, because Excel assumes cp1252 as default character encoding - Libre Office Calc on the other hand handles these files just fine.
- The tool support for csv files is getting better and better due to the hype around "data science". Some particularly useful tools are
- [csvkit](https://csvkit.readthedocs.org/en/stable/)
- [q - Text as Data](http://harelba.github.io/q/)
## Versioning
Changes to the CLDF specification will be released as new versions, using
a [Semantic Versioning](http://semver.org/) number scheme. While older versions
can be accessed via [releases of this repository](releases) or from
[ZENODO](https://zenodo.org), where releases will be archived, the latest
released version is also reflected in the `master` branch of this repository,
i.e. whatever you see navigating the directory tree at [https://github.com/glottobank/cldf](https://github.com/glottobank/cldf/tree/master)
reflects the latest released version of the specification.
## History
Work on this proposal for a cross-linguistic data format was triggered by the [LANCLID 2 workshop](http://www.eva.mpg.de/linguistics/conferences/2014-ws-lanclid2/index.html) held in April 2015 in Leipzig -
in particular by Harald Hammarstrm's presentation [A Proposal for Data Interface Formats for Cross-Linguistic Data](https://github.com/clld/lanclid2/blob/master/presentations/hammarstrom.pdf).
Owner
- Name: Michael Cysouw
- Login: cysouw
- Kind: user
- Location: Marburg, Germany
- Company: Philipps-Universität Marburg
- Website: cysouw.de/home
- Repositories: 39
- Profile: https://github.com/cysouw