https://github.com/ausgerechnet/cwb-vrt

Tools for processing VRT files and CWB import/export

https://github.com/ausgerechnet/cwb-vrt

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.5%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Tools for processing VRT files and CWB import/export

Basic Info
  • Host: GitHub
  • Owner: ausgerechnet
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 617 KB
Statistics
  • Stars: 2
  • Watchers: 2
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Created over 3 years ago · Last pushed about 1 year ago
Metadata Files
Readme License

README.md

cwb-vrt

cwb-vrt is a Python 3 module and command line interface for processing VRT files.

Installation

pip install git+https://github.com/ausgerechnet/cwb-vrt.git

VRT files

VRT files are XML files containing verticalised text and are used as an import (and export) format of the IMS Open Corpus Workbench (CWB). The CWB distinguishes positional attributes (p-atts) on a token-level, which are stored in tab-separated lines, and structural attributes (s-atts) stored in XML elements (i.e. matching pairs of start and end tags). <?xml version="1.0" encoding="ISO-8859-1" standalone="yes" ?> <!-- A Thrilling Experience --> <story num="4" title="A Thrilling Experience"> <p> <s> Tick NN tick . SENT . </s> <s> A DT a clock NN clock . SENT . </s> <s> Tick VB tick , , , tick VB tick . SENT . </s> </p> ... </story> ... The VRT file above contains three p-atts: by default, the first or primary layer is called word — the other p-atts here would be pos and lemma. Note that the names of p-atts are usually not explicitly encoded in VRT files.

There are also three s-atts encoded in XML-elements: story, p, and s. story has two attribute-value pairs: <story num="4" title="A Thrilling Experience"> Note that in the CWB, each attribute is stored separately (here: story_num and story_title) with annotation (story itself is not encoded as an s-att). p and s do not have any annotation. cwb-vrt refers to the name of the XML element (e.g. story) usually as "level" of the s-att and stores it alongside the other key-value pairs in a dictionary.

VRT files for CQPweb

Not all VRT files accepted by the CWB can be used for CQPweb: - there has to be an XML-element called "text" with unique IDs <text id="..."> - only a relatively small number of <text>s are possible (~ 10,000,000) - meta data stored in <text>-attributes can be used for subcorpus creation and restricted queries, but they have to be marked as categorical in CQPweb — and this only works if all values are valid MySQL-identifiers

Using the CLI

vrt-cohort: conflate texts according to meta data into cohorts vrt-cohort -m tagesschau-mini.vrt.gz -c month rubrik --level-old article --level-new article

vrt-cqpweb: make VRT file compatible with CQPweb vrt-cqpweb tagesschau-mini.vrt.gz --level article

vrt-deduplicate: check regions enclosed by level for duplicates vrt-deduplicate tagesschau-mini.vrt.gz --level s

vrt-index: create CWB import script from VRT file vrt-index tagesschau-mini.vrt.gz

vrt-meta: create TSV table of meta data stored in s-atts vrt-meta tagesschau-mini.vrt.gz --level article

Owner

  • Name: Philipp Heinrich
  • Login: ausgerechnet
  • Kind: user
  • Location: Erlangen
  • Company: @fau-klue

GitHub Events

Total
  • Issues event: 1
  • Watch event: 1
  • Issue comment event: 1
  • Push event: 4
Last Year
  • Issues event: 1
  • Watch event: 1
  • Issue comment event: 1
  • Push event: 4

Dependencies

requirements.txt pypi
  • pandas *
  • pytest *
  • tqdm *
setup.py pypi
  • pandas >=1.1.5
  • tqdm >=4.64.0