https://github.com/dcavar/treebankparser
Parser for treebanks based on Penn Treebank type of encoding that generates Probabilistic Context Free Grammars
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.5%) to scientific vocabulary
Keywords
Repository
Parser for treebanks based on Penn Treebank type of encoding that generates Probabilistic Context Free Grammars
Basic Info
- Host: GitHub
- Owner: dcavar
- License: apache-2.0
- Language: C
- Default Branch: master
- Homepage: http://damir.cavar.me/
- Size: 186 KB
Statistics
- Stars: 3
- Watchers: 2
- Forks: 3
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
TreebankParser
(C) 2016-2018 by Damir Cavar <dcavar@iu.edu>
This code and the binaries are made available under the Apache License, Version 2.0, January 2004. For details see the included LICENSE.txt file.
This is a tool that reads treebank files and generates a probabilistic grammar for use in FLE.
Currently it can generate all Context-free Grammar rules from a treebank in the Penn-treebank format.
Take for example the test1.txt file in the current source repository. You can run treebankparser to generate a frequency profile of the rules:
./treebankparser -y S test1.txt
The -y S parameter generates an S-symbol for empty root nodes, as in test1.txt. The default is to generate ROOT as the label for such root nodes.
The out put should look like this:
1 ADJP --> JJ
1 IP-HLN --> VP
1 JJ --> 重要
1 NN --> 企业
1 NN --> 增长点
1 NN --> 外商
1 NN --> 外贸
1 NN --> 投资
2 NP --> NN
1 NP --> NP
1 NP-OBJ --> NP
1 NP-PN --> NR
1 NP-SBJ --> NN NN NN
1 NR --> 中国
1 S --> IP-HLN
1 VP --> NP-OBJ
1 VV --> 成为
The probability is tab-delimited from the rule. It can also be generated as a float using the -r parameter:
./treebankparser -r -y S test1.txt > res.log
The output should look like:
0.0555556 ADJP --> JJ
0.0555556 IP-HLN --> VP
0.0555556 JJ --> 重要
0.0555556 NN --> 企业
0.0555556 NN --> 增长点
0.0555556 NN --> 外商
0.0555556 NN --> 外贸
0.0555556 NN --> 投资
0.111111 NP --> NN
0.0555556 NP --> NP
0.0555556 NP-OBJ --> NP
0.0555556 NP-PN --> NR
0.0555556 NP-SBJ --> NN NN NN
0.0555556 NR --> 中国
0.0555556 S --> IP-HLN
0.0555556 VP --> NP-OBJ
0.0555556 VV --> 成为
The rules are printed to standard out with absolute or relative frequencies.
I am adding more features, e.g.:
- reloading existing grammars (multi-batch cycles for larger corpus collections)
- elimination of terminal rules
- parsing alternative coding formats for syntactic trees or treebanks (e.g. XML, TEI XML)
- output probabilities for Left-hand-side symbols only, rather than rules
- generation of a Weighted Finite State Transducer representation, as coded in FLE
If you have ideas or suggestions, let me know.
Prerequisites
The tool is written in C++11 and requires the following libraries:
Compile
Use CLion or otherwise run:
cmake CMakeLists.txt
make
Owner
- Name: Damir Cavar
- Login: dcavar
- Kind: user
- Location: Bloomington, IN
- Company: Indiana University
- Website: http://damir.cavar.me/
- Repositories: 29
- Profile: https://github.com/dcavar
GitHub Events
Total
Last Year
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Damir Cavar | d****r@g****m | 4 |
| Damir Cavar | d****r@m****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0