https://github.com/cdli-gh/conll-merge

Tools for manipulating CoNLL TSV and related formats

https://github.com/cdli-gh/conll-merge

Science Score: 33.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: springer.com
  • Committers with academic emails
    1 of 5 committers (20.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.4%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Tools for manipulating CoNLL TSV and related formats

Basic Info
  • Host: GitHub
  • Owner: cdli-gh
  • License: apache-2.0
  • Language: Java
  • Default Branch: master
  • Size: 412 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of acoli-repo/conll-merge
Created almost 4 years ago · Last pushed almost 4 years ago
Metadata Files
Readme License

README.md

CoNLL-Merge

Tools for manipulating CoNLL TSV and related formats * special focus on processing multi-layer corpora, including annotations with conflicting tokenizations and/or textual variations (CoNLL-Merge) * Native support for CoNLL-X, CoNLL-U and any other TSV format as previously used for parts-of-speech, morphosyntactic features, chunking, dependency syntax, named entity annotation, semantic roles, etc. Note that we require tabulator-separated values. Space- or comma-separated files must be transformed beforehand. * converters from manifold source representations, incl. * Penn Treebank syntax * PropBank/NomBank semantic role annotations (standoff) * OntoNotes named entity annotations * OntoNotes coreference coreference * Penn Discourse Graph * Penn Discourse Treebank (PDTB 2) * RST Discourse Treebank * For routines for parsing, transforming and manipulating annotation graphs, see our CoNLL-RDF library under https://github.com/acoli-repo/conll-rdf.

Usage

  • Open Source, Apache license 2.0, see LICENSE
  • In scientific publications, please refer to both following publications:
    • Chiarcos, Christian, Julia Ritz, and Manfred Stede (2012), By all these lovely tokens ... Merging conflicting tokenizations. Journal of Language Resources & Evaluation 46(1):53-74.
    • Chiarcos, Christian, and Niko Schenk (2018), The ACoLi CoNLL Libraries: Beyond Tab-Separated Values, In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018, p.571-576.
  • For other forms of usage and redistribution, please refer to https://github.com/acoli-repo/conll and preserve the attached NOTICE file

Subdirectories

  • cmd/ converters from different source formats to CoNLL
  • src/ java classes for manipulating CoNLL files, in particular a tokenization-independent merging routine
  • lib/ jars for src
  • data/ sample data for merging various linguistic and semantic annotations of the same text
  • data_phil sample data and pipeline for merging multiple versions of the same text (i.e., similar text)
  • experimental/ experimental extension, currently containing a partial reimplementation in Python

History

  • May 2009 version 0.1 ("PAULA merge") Original implementation of different strategies for merging multi-layer annotations with different tokenizations as described by Chiarcos et al. (2009, 2012), developed at the University Potsdam, Germany, funded by the DFG Collaborative Research Center 632. Originally, this has been part of the PAULA framework, using a standoff XML format. While it was used locally, processing standoff XML is cumbersome, the implementation was thus not widely adapted.
  • June 2012 version 0.2 ("inline merge") Reimplementation of the merging using an inline XML format, specific to the Penn Treebank and its subcorpora (OntoNotes, PropBank, RST-DTB, PDTB, PDGB, etc.), conducted at the Information Sciences Institute of the University of Southern California (ISI/USC), funded by a DAAD PostDoc stipend. This reimplementation was used for a number of experiments (e.g., Chiarcos 2012), but specific to these and never formally released. In particular, dependency annotation was used to convert original span-based annotation to token-level annotations.
  • Oct 2016 version 0.3 ("CoNLL merge") Reimplementation of the merging routine using generic CoNLL data structures, conducted at the Applied Computational Linguistics (ACoLi) Lab at Goethe University Frankfurt, supported by the LOEWE cluster "Digital Humanities". The intention of the reimplementation is to separate application-specific and generic aspects of the 0.2 processing pipeline. Application-specific components are to be published at a later stage. The code for the merging and a number of converters is now published under an Apache license via Github. The release includes a data sample for an NLP/semantic annotation workflow (wsj0655), which is, however, password protected for reasons of copyright. Contact us to check whether we can give you access, alternatively, ask the LDC for file wsj0655 in PTB3, PDTB2, RST-DTB, PDGB, OntoNotes, PropBank, NomBank.
  • Apr 2017 version 0.31 ("CoNLL merge") Code base partially restructured, with backward-compatible functionality and parameters. Added a philological use case (editions of historical texts, see data_phil): New flag -lev for Levenshtein-based alignment (to be used when working with similar text rather than different annotations of the same text). This version is documented in Chiarcos and Schenk (2018)
  • Oct 2021 (no version change) Partial reimplementation in Python provided for convenience. The Java version remains the default version.

References

Contributors * CC - Christian Chiarcos, christian.chiarcos@web.de, Applied Computational Linguistics (ACoLi) lab, Goethe University Frankfurt, Germany

Owner

  • Name: CDLI
  • Login: cdli-gh
  • Kind: organization
  • Email: cdli@orinst.ox.ac.uk
  • Location: Los Angeles, Oxford, Berlin

GitHub Events

Total
Last Year

Committers

Last synced: 12 months ago

All Time
  • Total Commits: 78
  • Total Committers: 5
  • Avg Commits per committer: 15.6
  • Development Distribution Score (DDS): 0.064
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
chiarcos c****s@w****e 73
chiarcos c****s@g****m 2
Lars Willighagen l****n@g****m 1
cfaeth f****h@e****e 1
lpmi-13 l****s@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels