https://github.com/dcavar/treebankparser

Parser for treebanks based on Penn Treebank type of encoding that generates Probabilistic Context Free Grammars

https://github.com/dcavar/treebankparser

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.5%) to scientific vocabulary

Keywords

bnf bnfc context-free-grammar lexical-functional-grammar parser penn-treebank probabilistic-context-free-grammar syntax treebank
Last synced: 4 months ago · JSON representation

Repository

Parser for treebanks based on Penn Treebank type of encoding that generates Probabilistic Context Free Grammars

Basic Info
  • Host: GitHub
  • Owner: dcavar
  • License: apache-2.0
  • Language: C
  • Default Branch: master
  • Homepage: http://damir.cavar.me/
  • Size: 186 KB
Statistics
  • Stars: 3
  • Watchers: 2
  • Forks: 3
  • Open Issues: 0
  • Releases: 0
Topics
bnf bnfc context-free-grammar lexical-functional-grammar parser penn-treebank probabilistic-context-free-grammar syntax treebank
Created over 7 years ago · Last pushed over 7 years ago
Metadata Files
Readme License

README.md

TreebankParser

(C) 2016-2018 by Damir Cavar <dcavar@iu.edu>

This code and the binaries are made available under the Apache License, Version 2.0, January 2004. For details see the included LICENSE.txt file.

This is a tool that reads treebank files and generates a probabilistic grammar for use in FLE.

Currently it can generate all Context-free Grammar rules from a treebank in the Penn-treebank format.

Take for example the test1.txt file in the current source repository. You can run treebankparser to generate a frequency profile of the rules:

./treebankparser -y S test1.txt

The -y S parameter generates an S-symbol for empty root nodes, as in test1.txt. The default is to generate ROOT as the label for such root nodes.

The out put should look like this:

1   ADJP --> JJ
1   IP-HLN --> VP
1   JJ --> 重要
1   NN --> 企业
1   NN --> 增长点
1   NN --> 外商
1   NN --> 外贸
1   NN --> 投资
2   NP --> NN
1   NP --> NP
1   NP-OBJ --> NP
1   NP-PN --> NR
1   NP-SBJ --> NN NN NN
1   NR --> 中国
1   S --> IP-HLN
1   VP --> NP-OBJ
1   VV --> 成为

The probability is tab-delimited from the rule. It can also be generated as a float using the -r parameter:

./treebankparser -r -y S test1.txt > res.log

The output should look like:

0.0555556       ADJP --> JJ
0.0555556       IP-HLN --> VP
0.0555556       JJ --> 重要
0.0555556       NN --> 企业
0.0555556       NN --> 增长点
0.0555556       NN --> 外商
0.0555556       NN --> 外贸
0.0555556       NN --> 投资
0.111111        NP --> NN
0.0555556       NP --> NP
0.0555556       NP-OBJ --> NP
0.0555556       NP-PN --> NR
0.0555556       NP-SBJ --> NN NN NN
0.0555556       NR --> 中国
0.0555556       S --> IP-HLN
0.0555556       VP --> NP-OBJ
0.0555556       VV --> 成为

The rules are printed to standard out with absolute or relative frequencies.

I am adding more features, e.g.:

  • reloading existing grammars (multi-batch cycles for larger corpus collections)
  • elimination of terminal rules
  • parsing alternative coding formats for syntactic trees or treebanks (e.g. XML, TEI XML)
  • output probabilities for Left-hand-side symbols only, rather than rules
  • generation of a Weighted Finite State Transducer representation, as coded in FLE

If you have ideas or suggestions, let me know.

Prerequisites

The tool is written in C++11 and requires the following libraries:

Compile

Use CLion or otherwise run:

cmake CMakeLists.txt
make

Owner

  • Name: Damir Cavar
  • Login: dcavar
  • Kind: user
  • Location: Bloomington, IN
  • Company: Indiana University

GitHub Events

Total
Last Year

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 5
  • Total Committers: 2
  • Avg Commits per committer: 2.5
  • Development Distribution Score (DDS): 0.2
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Damir Cavar d****r@g****m 4
Damir Cavar d****r@m****m 1
Committer Domains (Top 20 + Academic)
me.com: 1

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels