https://github.com/dcavar/treebankparser

Parser for treebanks based on Penn Treebank type of encoding that generates Probabilistic Context Free Grammars

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.5%) to scientific vocabulary

Keywords

bnf bnfc context-free-grammar lexical-functional-grammar parser penn-treebank probabilistic-context-free-grammar syntax treebank

Last synced: 4 months ago · JSON representation

Repository

Parser for treebanks based on Penn Treebank type of encoding that generates Probabilistic Context Free Grammars

Basic Info

Host: GitHub
Owner: dcavar
License: apache-2.0
Language: C
Default Branch: master
Homepage: http://damir.cavar.me/
Size: 186 KB

Statistics

Stars: 3
Watchers: 2
Forks: 3
Open Issues: 0
Releases: 0

Topics

bnf bnfc context-free-grammar lexical-functional-grammar parser penn-treebank probabilistic-context-free-grammar syntax treebank

Created over 7 years ago · Last pushed over 7 years ago

Metadata Files

Readme License

README.md

TreebankParser

This code and the binaries are made available under the Apache License, Version 2.0, January 2004. For details see the included LICENSE.txt file.

This is a tool that reads treebank files and generates a probabilistic grammar for use in FLE.

Currently it can generate all Context-free Grammar rules from a treebank in the Penn-treebank format.

Take for example the test1.txt file in the current source repository. You can run treebankparser to generate a frequency profile of the rules:

./treebankparser -y S test1.txt

The -y S parameter generates an S-symbol for empty root nodes, as in test1.txt. The default is to generate ROOT as the label for such root nodes.

The out put should look like this:

1   ADJP --> JJ
1   IP-HLN --> VP
1   JJ --> 重要
1   NN --> 企业
1   NN --> 增长点
1   NN --> 外商
1   NN --> 外贸
1   NN --> 投资
2   NP --> NN
1   NP --> NP
1   NP-OBJ --> NP
1   NP-PN --> NR
1   NP-SBJ --> NN NN NN
1   NR --> 中国
1   S --> IP-HLN
1   VP --> NP-OBJ
1   VV --> 成为

The probability is tab-delimited from the rule. It can also be generated as a float using the -r parameter:

./treebankparser -r -y S test1.txt > res.log

The output should look like:

0.0555556       ADJP --> JJ
0.0555556       IP-HLN --> VP
0.0555556       JJ --> 重要
0.0555556       NN --> 企业
0.0555556       NN --> 增长点
0.0555556       NN --> 外商
0.0555556       NN --> 外贸
0.0555556       NN --> 投资
0.111111        NP --> NN
0.0555556       NP --> NP
0.0555556       NP-OBJ --> NP
0.0555556       NP-PN --> NR
0.0555556       NP-SBJ --> NN NN NN
0.0555556       NR --> 中国
0.0555556       S --> IP-HLN
0.0555556       VP --> NP-OBJ
0.0555556       VV --> 成为

The rules are printed to standard out with absolute or relative frequencies.

I am adding more features, e.g.:

reloading existing grammars (multi-batch cycles for larger corpus collections)
elimination of terminal rules
parsing alternative coding formats for syntactic trees or treebanks (e.g. XML, TEI XML)
output probabilities for Left-hand-side symbols only, rather than rules
generation of a Weighted Finite State Transducer representation, as coded in FLE

If you have ideas or suggestions, let me know.

Prerequisites

The tool is written in C++11 and requires the following libraries:

Compile

Use CLion or otherwise run:

cmake CMakeLists.txt
make

Owner

Name: Damir Cavar
Login: dcavar
Kind: user
Location: Bloomington, IN
Company: Indiana University

Website: http://damir.cavar.me/
Repositories: 29
Profile: https://github.com/dcavar

GitHub Events

Total

Last Year

Committers

Last synced: 8 months ago

All Time

Total Commits: 5
Total Committers: 2
Avg Commits per committer: 2.5
Development Distribution Score (DDS): 0.2

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Damir Cavar	d**r@g**m	4
Damir Cavar	d**r@m**m	1

Committer Domains (Top 20 + Academic)

me.com: 1

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/dcavar/treebankparser

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

TreebankParser

Prerequisites

Compile

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels