clics4

CLDF dataset providing data underlying "CLICS⁴" from 2023

https://github.com/clics/clics4

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

CLDF dataset providing data underlying "CLICS⁴" from 2023

Basic Info

Host: GitHub
Owner: clics
License: cc-by-4.0
Language: TeX
Default Branch: main
Size: 828 MB

Statistics

Stars: 0
Watchers: 5
Forks: 0
Open Issues: 6
Releases: 3

Created over 3 years ago · Last pushed 10 months ago

Metadata Files

Readme License Zenodo

CLICS 4

How to cite

If you use these data please cite - the original source

Tjuka, Annika; Forkel, Robert; Rzymski, Christoph; and List, Johann-Mattis (2025): CLICS 4: An Improved Database of Cross-Linguistic Colexifications [Dataset, Version 0.5]. Passau: MCL Chair at the University of Passau. - the derived dataset using the DOI of the particular released version you were using

Description

This dataset is licensed under a CC-BY-4.0 license

Available online at https://github.com/clics/clics4

Notes

CLICS 4: Workflow for Data Generation

The CLICS 4 workflow differs slightly from the workflow we have used in CLICS3. We now have drastically increased the number of datasets, but we have also made sure to use stricter selection criteria for the languages to be included. This also results in different numbers with respect to the number of concepts and the number of language varieties.

How to Cite CLICS 4?

If you use the data in your work, make sure to cite the correct version that you are using. For the currently most recent version, we recommend to cite it as follows:

Tjuka, Annika; Forkel, Robert; Rzymski, Christoph; and List, Johann-Mattis (2025): CLICS 4: An Improved Database of Cross-Linguistic Colexifications [Dataset Version 0.5]. Passau: MCL Chair at the University of Passau. https://github.com/clics/clics4/

Since the whole workflow underlying CLICS 4 regardless of the individual versions will be presented in a freely available publication, we also appreciate if you cite this forthcoming paper (already available as preprint):

Tjuka, Annika; Forkel, Robert; Rzymski, Christoph; and List, Johann-Mattis (forthcoming): Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data. Proceedings of the 16th International Conference on Computational Semantics (IWCS). Dsseldorf: Association for Computational Linguistics. 1-15. Preprint: https://doi.org/10.48550/arXiv.2503.11377

What is New in Comparison with CLICS?

The following points summarize major differences between CLICS and CLICS:

more datasets in CLICS 4: CLICS 4 now uses 98 datasets, while CLICS used 30
fully transcribed data instead of data in orthography: CLICS 4 now uses data fully transcribed to IPA, ignoring all datasets that only offer orthography (this results in fewer languages at times, despite the increase in datasets)
treatment of concepts: we now model some "hidden" colexifications that have been ignored before, since the concept identifiers in Concepticon cover two separate concepts that are frequently colexified as one single concept, as separate concepts (these are marked in the CLDF representation)
we provide a full-fledged CLDF dataset now, in which the concept network is also modeled with the help of CLDF

Workflow for Data Aggregation

In the following, we run those interested in trying the workflow that we applied to construct CLICS 4 on their own machines through the workflow in due detail. To run the workflow, we assume that users have enough experience with Python in order to know how to create their own fresh virtual environment and know how to run commands in the terminal.

W1: Install Packages

All you need to install the packages required is to install the current package with PIP as follows (using a fresh virtual environment), after having downloaded the clics4 package with GIT. The following lines also obtain the version that we used in this demo. $ git clone https://github.com/clics/clics4.git $ cd clics4 $ git checkout v0.5 $ pip install -e .

W2: Download Data

In order to do a fresh download of all the data that we use in CLICS 4, you need to run the following command:

$ cldfbench download lexibank_clics4.py

W3: Create CLICS 4 Dataset

Before you can run the code, you must make sure to have downloaded all data and also obtained actual copies of Glottolog, Concepticon, and CLTS. An easy way to obtain these with the help of cldfbench is to run the command cldfbench catconfig and follow instructions there. If you use a Windows machine, you will need some additional preparations (see Snee 2024), so we kindly ask you to follow the respective instructions in Snee (2024).

If you have successfully run the catconfig subcommand, just type:

$ cldfbench lexibank.makecldf --glottolog-version=v5.2.1 --concepticon-version=v3.4.0 --clts-version=v2.3.0 lexibank_clics4.py

In the other case, specify the explicit locations of the repositories for Glottolog, Concepticon, and CLTS as follwo.

cldfbench lexibank.makecldf --glottolog-repos=Path2Glottolog --concepticon-repos=Path2Concepticon --clts-repos=Path2Clics --glottolog-version=v5.2.1 --concepticon-version=v3.4.0 --clts-version=v2.3.0 lexibank_clics4.py

W4: CLLD Version of CLICS 4

This release is a CLICS 4 dataset that we consider generally good enough with respect to the data to be used in publications (small errors would always be possible with such large numbers of data aggregated from different sources). However, we emphasize that there are a couple of shortcomings for now that we will try to handle before publishing a new web-based version of CLICS that succeeds the current version 3.0 at https://clics.clld.org. Before publishing this new CLLD version of CLICS 4, we will implement a new representation of the data in order to adhere to the representation of ParameterNetworks in the new CLDF specification.

Statistics

Varieties: 3,432 (linked to 2,152 different Glottocodes)
Concepts: 1,730 (linked to 1,730 different Concepticon concept sets)
Lexemes: 1,445,845
Sources: 95
Synonymy: 1.10
Invalid lexemes: 0
Tokens: 8,120,261
Segments: 2,039 (0 BIPA errors, 0 CLTS sound class errors, 2031 CLTS modified)
Inventory size (avg): 40.78

Possible Improvements:

Languages linked to bookkeeping languoids in Glottolog:
- Laisaw Thu Htay Kung lait1239
- Songlai-Hettui 8Karchaung (Hettui) song1313
- Songlai-Maung Um (Song) 1Maung Um (Song) song1313
- Laitu (Khuasung) lait1239
- Doitu (Hetsawlay) song1313
- Thaiphum (Rengkheng) thai1262
- Laitu Ahongdong lait1239
- Taungtha (Wethet) rung1263
- Khalaj khal1270

Contributors

Name | GitHub user | Description | Role --- | --- | --- | --- Annika Tjuka | @annikatjuka | maintainer | Author Robert Forkel | @xrotwang | maintainer | Author Christoph Rzymski | @chrzyki | maintainer | Author Johann-Mattis List | @LinguList | maintainer | Author

CLDF Datasets

The following CLDF datasets are available in cldf:

CLDF Wordlist at cldf/Wordlist-metadata.json
CLDF StructureDataset at cldf/StructureDataset-metadata.json

Owner

Name: CLICS
Login: clics
Kind: organization
Email: clics@lingpy.org

Website: http://clics.clld.org
Repositories: 12
Profile: https://github.com/clics

Database of Cross-Linguistic Colexifications

GitHub Events

Total

Release event: 1
Delete event: 1
Member event: 1
Issue comment event: 7
Push event: 6
Public event: 1
Pull request review event: 3
Pull request event: 3
Create event: 2

Last Year

Release event: 1
Delete event: 1
Member event: 1
Issue comment event: 7
Push event: 6
Public event: 1
Pull request review event: 3
Pull request event: 3
Create event: 2

Dependencies

cldf/requirements.txt pypi

Jinja2 ==3.1.5
Markdown ==3.7
MarkupSafe ==3.0.2
SQLAlchemy ==1.4.54
Unidecode ==1.3.8
appdirs ==1.4.4
arrow ==1.3.0
attrs ==24.3.0
babel ==2.16.0
bibtexparser ==2.0.0b8
bs4 ==0.0.2
certifi ==2024.12.14
chardet ==5.2.0
cldfbench ==1.14.0
cldfcatalog ==1.5.1
cldfviz ==1.3.0
cldfzenodo ==2.1.2
clldutils ==3.24.0
cltoolkit ==0.2.0
colorama ==0.4.6
colorlog ==6.9.0
commonnexus ==1.9.2
csvw ==3.5.1
cycler ==0.12.1
geojson ==3.2.0
gitdb ==4.0.12
greenlet ==3.1.1
idna ==3.10
igraph ==0.11.8
iniconfig ==2.0.0
isodate ==0.7.2
jmespath ==1.0.1
jsonschema ==4.23.0
kiwisolver ==1.4.8
lingpy ==2.6.13
lxml ==5.3.0
matplotlib ==3.10.0
multipledispatch ==1.0.0
nameparser ==1.1.3
networkx ==3.4.2
newick ==1.9.0
numpy ==2.2.1
openpyxl ==3.1.5
packaging ==24.2
pluggy ==1.5.0
pybtex ==0.24.0
pycldf ==1.40.3
pyclics ==3.1.0
pyclts ==3.2.0
pyconcepticon ==3.1.0
pycountry ==24.6.1
pyglottolog ==3.14.0
pylatexenc ==2.10
pylexibank ==3.5.0
pyparsing ==3.2.1
pytest ==8.3.4
python-dateutil ==2.9.0.post0
python-frontmatter ==1.1.0
python-igraph ==0.11.8
rdflib ==7.1.1
referencing ==0.35.1
regex ==2024.11.6
reportlab ==4.2.5
requests ==2.32.3
rfc3986 ==1.5.0
segments ==2.2.1
six ==1.17.0
smmap ==5.0.2
soupsieve ==2.6
tabulate ==0.9.0
termcolor ==2.5.0
texttable ==1.7.0
toyplot ==2.0.0
toytree ==2.0.5
tqdm ==4.67.1
uritemplate ==4.1.1
urllib3 ==2.3.0
xlrd ==2.0.1
zenodoclient ==0.5.1

setup.py pypi

collabutils *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

clics4

Science Score: 39.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

CLICS 4

How to cite

Description

Notes

CLICS 4: Workflow for Data Generation

How to Cite CLICS 4?

What is New in Comparison with CLICS?

Workflow for Data Aggregation

W1: Install Packages

W2: Download Data

W3: Create CLICS 4 Dataset

W4: CLLD Version of CLICS 4

Statistics

Possible Improvements:

Contributors

CLDF Datasets

Owner

GitHub Events

Total

Last Year

Dependencies