cwb-ccc

Python wrapper for the CWB to extract concordances and score frequency lists

https://github.com/ausgerechnet/cwb-ccc

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

Python wrapper for the CWB to extract concordances and score frequency lists

Basic Info
  • Host: GitHub
  • Owner: ausgerechnet
  • License: gpl-3.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 5.46 MB
Statistics
  • Stars: 22
  • Watchers: 4
  • Forks: 5
  • Open Issues: 11
  • Releases: 26
Created about 6 years ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

Collocation and Concordance Computation

Build Build PyPI version PyPI Downloads License Imports: association-measures

cwb-ccc is a Python 3 wrapper around the IMS Open Corpus Workbench (CWB). Main purpose of the module is to run queries (including queries with more than two anchor points), extract concordance lines, and score frequency lists (particularly to extract collocates and keywords).

The Quickstart here gives a rough overview. For a more detailed dive into the functionality, see the Vignette.

Installation

System requirements: The module is developed for Ubuntu (currently 24.04 LTS) but also runs on other Debian-based systems and MacOS. On a fresh install of Ubuntu, you will need to install the following packages: sudo apt install libncurses5-dev libglib2.0-dev libpcre3 libpcre3-dev

CWB: The module needs a working installation of CWB and operates on CWB-indexed corpora. If you want to run queries with more than two anchor points, you will need CWB version 3.4.16 or later. We recommend installing the 3.5.x package.

On Ubuntu, you will also need to install the corresponding cwb-dev package: wget https://sourceforge.net/projects/cwb/files/cwb/cwb-3.5/deb/cwb_3.5.0-1_amd64.deb wget https://sourceforge.net/projects/cwb/files/cwb/cwb-3.5/deb/cwb-dev_3.5.0-1_amd64.deb sudo apt install ./cwb_3.5.0-1_amd64.deb sudo apt install ./cwb-dev_3.5.0-1_amd64.deb

On MacOS, you can simply brew install cwb3

Python dependencies: Python dependencies are specified in requirements.txt and will be installed automatically if you follow the instructions below. Note that since version v0.13.0, cwb-ccc uses pandas2 and numpy2, which requires Python 3.9 or above.

In all cases, we recommend installing dependencies in a virtual environment to avoid conflicts with other installs on your machine.

Installation using pip: You can install cwb-ccc with pip from PyPI: python3 -m pip install cwb-ccc

Installation from source: You can also clone the source from github, cd in the respective folder, install all dependencies python3 -m pip install -U pip setuptools wheel twine python3 -m pip install -r requirements-dev.txt compile the C-extension python3 -m cython -2 ccc/cl.pyx and build it python3 setup.py bdist_ext --inplace

Quickstart

Accessing Corpora

To list all available corpora, you can use python from ccc import Corpora corpora = Corpora(registry_dir="/usr/local/share/cwb/registry/")

Most functionality is tied to the Corpus class, which establishes the connection to your CWB-indexed corpus: python from ccc import Corpus corpus = Corpus(corpus_name="GERMAPARL1386", registry_dir="tests/corpora/registry/") This will raise a KeyError if the named corpus is not in the specified registry.

Queries and SubCorpora

The usual starting point is to run a query with corpus.query(). This method accepts valid CQP queries such as python subcorpus = corpus.query('[lemma="Arbeit"]', context_break='s')

The result is a SubCorpus; at its core this is a pandas DataFrame with corpus positions (similar to CWB dumps of NQRs).

You can also query structural attributes, e.g. python corpus.query(s_query='text_party', s_values={'CDU', 'CSU'})

Concordancing

You can access concordance lines via the concordance() method of subcorpora. This method returns a DataFrame with information about the query matches in context:

subcorpus.concordance()

| *match* | *matchend* | word | |--------:|-----------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------| | 151 | 151 | Er brachte diese Erfahrung in seine Arbeit im Ausschuß für Familie , Senioren , Frauen und Jugend sowie im Petitionsausschuß ein , wo er sich vor allem | | 227 | 227 | Seine Arbeit und sein Rat werden uns fehlen . | | 1493 | 1493 | Ausschuß für Arbeit und Sozialordnung | | 1555 | 1555 | Ausschuß für Arbeit und Sozialordnung | | 1598 | 1598 | Ausschuß für Arbeit und Sozialordnung | | ... | ... | ... | | | | |

By default, it retrieves concordance lines in simple format in the order in which they appear in the corpus. In most situations it is more useful to get random concordance lines in KWIC formatting:

subcorpus.concordance(form='kwic', order='random')

| *match* | *matchend* | left\_word | node\_word | right\_word | |--------:|-----------:|:----------------------------------------------------------------------------------------------------------------------------------------------|:-----------|:---------------------------------------------------------------------------------------------------------------------------------------------------| | 81769 | 81769 | Ich unterstütze daher nachträglich die Forderung , daß die Durchführung des Gesetzes auch künftig durch die Bundesanstalt für | Arbeit | vorgenommen wird ; denn beim Bund gibt es die entsprechend ausgebildeten Sachbearbeiter . | | 8774 | 8774 | Glauben Sie im Ernst , Sie könnten am Ende ein Bündnis für | Arbeit | , eine Wende in der deutschen Politik , die Bekämpfung der Arbeitslosigkeit erreichen , wenn Sie nicht die Länder , | | 8994 | 8994 | alle Entscheidungen gemeinsam zu treffen , die sich gegen Schwarzarbeit und illegale | Arbeit | wenden , und gemeinsam nach einem Weg zu suchen , | | 80098 | 80098 | : Was der Vermittlungsausschuß mit Mehrheit zum Meister-BAföG beschlossen hat , heißt , daß die bewährten Institutionen der Bundesanstalt für | Arbeit | , die die Ausbildungsförderung für Meister bis zum Jahr 1993 durchgeführt haben , die darin große Erfahrung haben , die | | 61056 | 61056 | Selbst wenn Sie ein Konstrukt anbieten , das tendenziell die zusätzliche Belastung der Bundesanstalt für | Arbeit | etwas geringer hielte als die Entlastung bei der gesetzlichen Rentenversicherung , so wäre dies bei einem deutlichen Aufwuchs der Arbeitslosigkeit | | ... | ... | ... | ... | ... | | | | | | |

Use cut_off to specify the maximum number of lines.

Collocation Analyses

After executing a query, you can use subcorpus.collocates() to extract collocates (see the vignette for parameter settings). The result is a DataFrame with lemmata as index and frequency signatures and association measures as columns:

subcorpus.collocates()

| *item* | O11 | O12 | O21 | O22 | R1 | R2 | C1 | C2 | N | E11 | E12 | E21 | E22 | z\_score | t\_score | log\_likelihood | simple\_ll | min\_sensitivity | liddell | dice | log\_ratio | conservative\_log\_ratio | mutual\_information | local\_mutual\_information | ipm | ipm\_reference | ipm\_expected | in\_nodes | marginal | |:-------|----:|----:|-----:|-------:|----:|-------:|-----:|-------:|-------:|--------:|--------:|--------:|-------:|---------:|---------:|----------------:|-----------:|-----------------:|---------:|---------:|-----------:|-------------------------:|--------------------:|---------------------------:|--------:|---------------:|--------------:|----------:|---------:| | für | 46 | 730 | 831 | 148102 | 776 | 148933 | 877 | 148832 | 149709 | 4.54583 | 771.454 | 872.454 | 148061 | 19.4429 | 6.11208 | 134.301 | 130.019 | 0.052452 | 0.047547 | 0.055656 | 3.40925 | 2.26335 | 1.00514 | 46.2366 | 59278.4 | 5579.69 | 5858.03 | 0 | 877 | | , | 43 | 733 | 7827 | 141106 | 776 | 148933 | 7870 | 141839 | 149709 | 40.7933 | 735.207 | 7829.21 | 141104 | 0.345505 | 0.336523 | 0.124564 | 0.117278 | 0.005464 | 0.000296 | 0.009947 | 0.076412 | 0 | 0.02288 | 0.983836 | 55412.4 | 52553.8 | 52568.6 | 0 | 7870 | | . | 33 | 743 | 5626 | 143307 | 776 | 148933 | 5659 | 144050 | 149709 | 29.3328 | 746.667 | 5629.67 | 143303 | 0.677108 | 0.638378 | 0.461005 | 0.440481 | 0.005831 | 0.000673 | 0.010256 | 0.170891 | 0 | 0.05116 | 1.68829 | 42525.8 | 37775.4 | 37800 | 0 | 5659 | | und | 32 | 744 | 2848 | 146085 | 776 | 148933 | 2880 | 146829 | 149709 | 14.9282 | 761.072 | 2865.07 | 146068 | 4.41852 | 3.0179 | 15.1452 | 14.6555 | 0.011111 | 0.006044 | 0.017505 | 1.10866 | 0 | 0.331144 | 10.5966 | 41237.1 | 19122.7 | 19237.3 | 0 | 2880 | | in | 24 | 752 | 2474 | 146459 | 776 | 148933 | 2498 | 147211 | 149709 | 12.9481 | 763.052 | 2485.05 | 146448 | 3.07138 | 2.25596 | 7.72813 | 7.51722 | 0.009608 | 0.004499 | 0.014661 | 0.896724 | 0 | 0.268005 | 6.43212 | 30927.8 | 16611.5 | 16685.7 | 0 | 2498 | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

Setting p_query allows calculating scores for arbitrary combinations of positional attributes, e.g. p_query=['lemma', 'pos']. The dataframe contains the observed counts in contingency notation and is annotated with all available association measures from the pandas-association-measures package (parameter ams).

Keyword Analyses

Having created a subcorpus python subcorpus = corpus.query(s_query='text_party', s_values={'CDU', 'CSU'}) you can use its keywords() method for retrieving keywords:

subcorpus.keywords(order='conservative_log_ratio')

| *item* | O11 | O12 | O21 | O22 | R1 | R2 | C1 | C2 | N | E11 | E12 | E21 | E22 | z\_score | t\_score | log\_likelihood | simple\_ll | min\_sensitivity | liddell | dice | log\_ratio | conservative\_log\_ratio | mutual\_information | local\_mutual\_information | ipm | ipm\_reference | ipm\_expected | |:-----------|----:|------:|-----:|-------:|------:|-------:|-----:|-------:|-------:|--------:|--------:|--------:|-------:|---------:|---------:|----------------:|-----------:|-----------------:|---------:|---------:|-----------:|-------------------------:|--------------------:|---------------------------:|--------:|---------------:|--------------:| | deswegen | 55 | 41296 | 37 | 108412 | 41351 | 108449 | 92 | 149708 | 149800 | 25.3958 | 41325.6 | 66.6042 | 108382 | 5.87452 | 3.99183 | 41.5308 | 25.794 | 0.00133 | 0.321982 | 0.002654 | 1.96293 | 0.404166 | 0.335601 | 18.458 | 1330.08 | 341.174 | 614.152 | | CSU | 255 | 41096 | 380 | 108069 | 41351 | 108449 | 635 | 149165 | 149800 | 175.286 | 41175.7 | 459.714 | 107989 | 6.02087 | 4.99187 | 46.6543 | 31.7425 | 0.006167 | 0.126068 | 0.012147 | 0.81552 | 0.212301 | 0.162792 | 41.512 | 6166.72 | 3503.95 | 4238.99 | | CDU | 260 | 41091 | 390 | 108059 | 41351 | 108449 | 650 | 149150 | 149800 | 179.427 | 41171.6 | 470.573 | 107978 | 6.01515 | 4.99693 | 46.6055 | 31.7289 | 0.006288 | 0.124499 | 0.012381 | 0.80606 | 0.209511 | 0.161086 | 41.8823 | 6287.64 | 3596.16 | 4339.12 | | in | 867 | 40484 | 1631 | 106818 | 41351 | 108449 | 2498 | 147302 | 149800 | 689.551 | 40661.4 | 1808.45 | 106641 | 6.75755 | 6.02647 | 61.2663 | 42.1849 | 0.020967 | 0.072241 | 0.039545 | 0.47937 | 0.168901 | 0.099452 | 86.2253 | 20966.8 | 15039.3 | 16675.6 | | Wirtschaft | 39 | 41312 | 25 | 108424 | 41351 | 108449 | 64 | 149736 | 149800 | 17.6666 | 41333.3 | 46.3334 | 108403 | 5.07554 | 3.41607 | 30.9328 | 19.1002 | 0.000943 | 0.333476 | 0.001883 | 2.03257 | 0.150982 | 0.34391 | 13.4125 | 943.145 | 230.523 | 427.236 | | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

Just as with collocates, the result is a DataFrame with lemmata as index and frequency signatures and association measures as columns.

Testing

The module ships with a small test corpus ("GERMAPARL1386"), which contains all speeches of the 86th session of the 13th German Bundestag on Feburary 8, 1996. python corpus = Corpus("GERMAPARL1386", registry_dir="tests/corpora/registry/") This corpus consists of 149,800 tokens in 7332 paragraphs (s-attribute "p" with annotation "type" ("regular" or "interjection")) split into 11,364 sentences (s-attribute "s"). The p-attributes are "pos" and "lemma":

corpus.available_attributes()

| type | attribute | annotation | active | |:-------|:---------------------------|:-------------|:---------| | p-Att | word | False | True | | p-Att | pos | False | False | | p-Att | lemma | False | False | | s-Att | corpus | False | False | | s-Att | corpus\_name | True | False | | s-Att | sitzung | False | False | | s-Att | sitzung\_date | True | False | | s-Att | sitzung\_period | True | False | | s-Att | sitzung\_session | True | False | | s-Att | div | False | False | | s-Att | div\_desc | True | False | | s-Att | div\_n | True | False | | s-Att | div\_type | True | False | | s-Att | div\_what | True | False | | s-Att | text | False | False | | s-Att | text\_id | True | False | | s-Att | text\_name | True | False | | s-Att | text\_parliamentary\_group | True | False | | s-Att | text\_party | True | False | | s-Att | text\_position | True | False | | s-Att | text\_role | True | False | | s-Att | text\_who | True | False | | s-Att | p | False | False | | s-Att | p\_type | True | False | | s-Att | s | False | False |

The corpus is located in this repository. All tests are written using this corpus as well as some reference counts and scores obtained from the UCS toolkit and some additional frequency lists. Make sure you install all development dependencies (especially pytest). You can then pytest -m "not benchmark" pytest -m benchmark pytest --cov-report term-missing -v --cov=ccc/

Acknowledgements

Owner

  • Name: Philipp Heinrich
  • Login: ausgerechnet
  • Kind: user
  • Location: Erlangen
  • Company: @fau-klue

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Heinrich"
  given-names: "Philipp"
  orcid: "https://orcid.org/0000-0002-4785-9205"
title: "cwb-ccc"
version: 0.13.0
date-released: 2025-05-02
url: "https://github.com/ausgerechnet/cwb-ccc"

GitHub Events

Total
  • Create event: 6
  • Release event: 3
  • Issues event: 8
  • Watch event: 2
  • Delete event: 2
  • Issue comment event: 12
  • Push event: 31
  • Pull request event: 8
Last Year
  • Create event: 6
  • Release event: 3
  • Issues event: 8
  • Watch event: 2
  • Delete event: 2
  • Issue comment event: 12
  • Push event: 31
  • Pull request event: 8

Committers

Last synced: about 3 years ago

All Time
  • Total Commits: 508
  • Total Committers: 3
  • Avg Commits per committer: 169.333
  • Development Distribution Score (DDS): 0.006
Top Committers
Name Email Commits
Philipp Heinrich p****h@f****e 505
dependabot[bot] 4****]@u****m 2
Stephanie Evert e****n@S****l 1
Committer Domains (Top 20 + Academic)
fau.de: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 31
  • Total pull requests: 54
  • Average time to close issues: 7 months
  • Average time to close pull requests: 6 days
  • Total issue authors: 11
  • Total pull request authors: 3
  • Average comments per issue: 0.97
  • Average comments per pull request: 0.15
  • Merged pull requests: 47
  • Bot issues: 0
  • Bot pull requests: 9
Past Year
  • Issues: 3
  • Pull requests: 4
  • Average time to close issues: N/A
  • Average time to close pull requests: 13 minutes
  • Issue authors: 3
  • Pull request authors: 1
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ausgerechnet (20)
  • debajyotibd (2)
  • peteruhrig (1)
  • EspinosaLeal (1)
  • nfdykes (1)
  • iulusoy (1)
  • debajyotin16 (1)
  • cometbridge1998 (1)
  • anlinguist (1)
  • julesbouton (1)
  • fussballlinguist (1)
Pull Request Authors
  • ausgerechnet (46)
  • dependabot[bot] (9)
  • schtepf (1)
Top Labels
Issue Labels
enhancement (8) bug (2) documentation (1)
Pull Request Labels
dependencies (9)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 20 last-month
  • Total docker downloads: 8
  • Total dependent packages: 2
  • Total dependent repositories: 2
  • Total versions: 34
  • Total maintainers: 1
pypi.org: cwb-ccc

CWB wrapper to extract concordances and score frequency lists

  • Versions: 34
  • Dependent Packages: 2
  • Dependent Repositories: 2
  • Downloads: 20 Last month
  • Docker Downloads: 8
Rankings
Dependent packages count: 3.2%
Docker downloads count: 4.3%
Dependent repos count: 11.5%
Average: 13.0%
Stargazers count: 13.4%
Forks count: 15.4%
Downloads: 30.1%
Maintainers (1)
Last synced: 8 months ago

Dependencies

.github/workflows/build-test.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/publish.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • pypa/gh-action-pypi-publish release/v1 composite
Dockerfile docker
  • ubuntu 20.04 build
requirements.txt pypi
  • alabaster ==0.7.13
  • association-measures ==0.2.6
  • astroid ==2.11.7
  • attrs ==22.2.0
  • babel ==2.11.0
  • bleach ==6.0.0
  • bottleneck ==1.3.6
  • certifi ==2022.12.7
  • cffi ==1.15.1
  • charset-normalizer ==3.0.1
  • colorama ==0.4.6
  • coverage ==7.1.0
  • cryptography ==39.0.0
  • cython ==0.29.30
  • dill ==0.3.6
  • docutils ==0.18.1
  • enthought-sphinx-theme ==0.7.1
  • exceptiongroup ==1.1.0
  • idna ==3.4
  • imagesize ==1.4.1
  • importlib-metadata ==6.0.0
  • iniconfig ==2.0.0
  • isort ==5.11.4
  • jaraco.classes ==3.2.3
  • jeepney ==0.8.0
  • jinja2 ==3.1.2
  • keyring ==23.13.1
  • lazy-object-proxy ==1.9.0
  • markupsafe ==2.1.2
  • mccabe ==0.7.0
  • more-itertools ==9.0.0
  • numexpr ==2.8.4
  • numpy ==1.24.1
  • packaging ==23.0
  • pandas ==1.5.3
  • pkginfo ==1.9.6
  • platformdirs ==2.6.2
  • pluggy ==1.0.0
  • py-cpuinfo ==9.0.0
  • pycparser ==2.21
  • pygments ==2.14.0
  • pylint ==2.13.9
  • pyperclip ==1.8.2
  • pytest ==7.2.0
  • pytest-benchmark ==4.0.0
  • pytest-cov ==3.0.0
  • python-dateutil ==2.8.2
  • pytz ==2022.7.1
  • pyyaml ==6.0
  • readme-renderer ==37.3
  • requests ==2.28.2
  • requests-toolbelt ==0.10.1
  • rfc3986 ==2.0.0
  • scipy ==1.10.0
  • secretstorage ==3.3.3
  • setuptools ==65.5.1
  • six ==1.16.0
  • snowballstemmer ==2.2.0
  • sphinx ==5.0.0
  • sphinxcontrib-applehelp ==1.0.4
  • sphinxcontrib-devhelp ==1.0.2
  • sphinxcontrib-htmlhelp ==2.0.0
  • sphinxcontrib-jsmath ==1.0.1
  • sphinxcontrib-qthelp ==1.0.3
  • sphinxcontrib-serializinghtml ==1.1.5
  • tabulate ==0.8.9
  • tomli ==2.0.1
  • tqdm ==4.64.1
  • twine ==3.7.1
  • typing-extensions ==4.4.0
  • unidecode ==1.3.6
  • urllib3 ==1.26.14
  • webencodings ==0.5.1
  • wheel ==0.38.4
  • wrapt ==1.14.1
  • zipp ==3.11.0
Pipfile pypi
  • cython ==0.29.30 develop
  • enthought-sphinx-theme ==0.7.1 develop
  • pylint ==2.13.9 develop
  • pyperclip ==1.8.2 develop
  • pytest ==7.0.1 develop
  • pytest-benchmark 3.4.1 develop
  • pytest-cov ==3.0.0 develop
  • setuptools ==59.6.0 develop
  • sphinx ==5.0.0 develop
  • tabulate ==0.8.9 develop
  • twine ==3.7.1 develop
  • Bottleneck >=1.3.4
  • association-measures >=0.2.4
  • numexpr >=2.7.1
  • pandas >=1.1.5
  • pyyaml >=6.0
  • unidecode >=1.3.4
  • wheel >=0.37.1
setup.py pypi
  • Bottleneck >=1.3.4
  • association-measures >=0.2.4
  • numexpr >=2.7.1
  • pandas >=1.1.5
  • pyyaml >=6.0
  • unidecode >=1.3.4
  • wheel >=0.37.1
pyproject.toml pypi