https://github.com/cdli-gh/conll-merge
Tools for manipulating CoNLL TSV and related formats
Science Score: 33.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: springer.com -
✓Committers with academic emails
1 of 5 committers (20.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.4%) to scientific vocabulary
Repository
Tools for manipulating CoNLL TSV and related formats
Basic Info
- Host: GitHub
- Owner: cdli-gh
- License: apache-2.0
- Language: Java
- Default Branch: master
- Size: 412 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
CoNLL-Merge
Tools for manipulating CoNLL TSV and related formats * special focus on processing multi-layer corpora, including annotations with conflicting tokenizations and/or textual variations (CoNLL-Merge) * Native support for CoNLL-X, CoNLL-U and any other TSV format as previously used for parts-of-speech, morphosyntactic features, chunking, dependency syntax, named entity annotation, semantic roles, etc. Note that we require tabulator-separated values. Space- or comma-separated files must be transformed beforehand. * converters from manifold source representations, incl. * Penn Treebank syntax * PropBank/NomBank semantic role annotations (standoff) * OntoNotes named entity annotations * OntoNotes coreference coreference * Penn Discourse Graph * Penn Discourse Treebank (PDTB 2) * RST Discourse Treebank * For routines for parsing, transforming and manipulating annotation graphs, see our CoNLL-RDF library under https://github.com/acoli-repo/conll-rdf.
Usage
- Open Source, Apache license 2.0, see LICENSE
- In scientific publications, please refer to both following publications:
- Chiarcos, Christian, Julia Ritz, and Manfred Stede (2012), By all these lovely tokens ... Merging conflicting tokenizations. Journal of Language Resources & Evaluation 46(1):53-74.
- Chiarcos, Christian, and Niko Schenk (2018), The ACoLi CoNLL Libraries: Beyond Tab-Separated Values, In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018, p.571-576.
- For other forms of usage and redistribution, please refer to https://github.com/acoli-repo/conll and preserve the attached NOTICE file
Subdirectories
cmd/converters from different source formats to CoNLLsrc/java classes for manipulating CoNLL files, in particular a tokenization-independent merging routinelib/jars for srcdata/sample data for merging various linguistic and semantic annotations of the same textdata_philsample data and pipeline for merging multiple versions of the same text (i.e., similar text)experimental/experimental extension, currently containing a partial reimplementation in Python
History
- May 2009 version 0.1 ("PAULA merge") Original implementation of different strategies for merging multi-layer annotations with different tokenizations as described by Chiarcos et al. (2009, 2012), developed at the University Potsdam, Germany, funded by the DFG Collaborative Research Center 632. Originally, this has been part of the PAULA framework, using a standoff XML format. While it was used locally, processing standoff XML is cumbersome, the implementation was thus not widely adapted.
- June 2012 version 0.2 ("inline merge") Reimplementation of the merging using an inline XML format, specific to the Penn Treebank and its subcorpora (OntoNotes, PropBank, RST-DTB, PDTB, PDGB, etc.), conducted at the Information Sciences Institute of the University of Southern California (ISI/USC), funded by a DAAD PostDoc stipend. This reimplementation was used for a number of experiments (e.g., Chiarcos 2012), but specific to these and never formally released. In particular, dependency annotation was used to convert original span-based annotation to token-level annotations.
- Oct 2016 version 0.3 ("CoNLL merge") Reimplementation of the merging routine using generic CoNLL data structures, conducted at the Applied Computational Linguistics (ACoLi) Lab at Goethe University Frankfurt, supported by the LOEWE cluster "Digital Humanities". The intention of the reimplementation is to separate application-specific and generic aspects of the 0.2 processing pipeline. Application-specific components are to be published at a later stage. The code for the merging and a number of converters is now published under an Apache license via Github. The release includes a data sample for an NLP/semantic annotation workflow (wsj0655), which is, however, password protected for reasons of copyright. Contact us to check whether we can give you access, alternatively, ask the LDC for file wsj0655 in PTB3, PDTB2, RST-DTB, PDGB, OntoNotes, PropBank, NomBank.
- Apr 2017 version 0.31 ("CoNLL merge") Code base partially restructured, with backward-compatible functionality and parameters. Added a philological use case (editions of historical texts, see data_phil): New flag -lev for Levenshtein-based alignment (to be used when working with similar text rather than different annotations of the same text). This version is documented in Chiarcos and Schenk (2018)
- Oct 2021 (no version change) Partial reimplementation in Python provided for convenience. The Java version remains the default version.
References
- Chiarcos, C., Ritz, J. & Stede, M. (2009), By all these lovely tokens ... Merging conflicting tokenizations. In: Proceedings of the Third Linguistic Annotation Workshop (LAW-2009), held in conjunction with ACL-IJCNLP 2009, Suntec, Singapore, p. 35-43.
- Chiarcos, C., Ritz, J. & Stede, M. (2012), By all these lovely tokens ... Merging conflicting tokenizations. Journal of Language Resources & Evaluation 46(1):53-74.
- Chiarcos, C. (2012), Towards the unsupervised acquisition of discourse relations, In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP 2012), Jeju Island, Korea, p. 213-217.
- Chiarcos, C. & Schenk, N. (2018), The ACoLi CoNLL Libraries: Beyond Tab-Separated Values, In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018, p.571-576.
Contributors * CC - Christian Chiarcos, christian.chiarcos@web.de, Applied Computational Linguistics (ACoLi) lab, Goethe University Frankfurt, Germany
Owner
- Name: CDLI
- Login: cdli-gh
- Kind: organization
- Email: cdli@orinst.ox.ac.uk
- Location: Los Angeles, Oxford, Berlin
- Website: https://cdli.ucla.edu
- Repositories: 83
- Profile: https://github.com/cdli-gh
GitHub Events
Total
Last Year
Committers
Last synced: 12 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| chiarcos | c****s@w****e | 73 |
| chiarcos | c****s@g****m | 2 |
| Lars Willighagen | l****n@g****m | 1 |
| cfaeth | f****h@e****e | 1 |
| lpmi-13 | l****s@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 12 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0