https://github.com/cdli-gh/conll2graphml
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 2 committers (50.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: cdli-gh
- License: mit
- Language: Python
- Default Branch: main
- Size: 33.2 KB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
conll2graphml
This script creates a co-occurrence network of named entities mentioned in annotated texts. It reads input data from CoNLL formatted files and extracts relevant information to construct a graph. The graph represents the relationships between named entities based on their co-occurrence in the texts.
Citation
Todo
Priority
- Strip any morphological annotations before proceeding
- feed the POS tags to look out for from the command or a file, not in the script
- period attribute from sheet should be optional
- add text attributes from artifact metadata (use CDLI API when stand-alone or db or local dump for the online version)
- add node metadata from sense entry in dictionary (When available, todo... )
### Nice to have - Extend the functionality of using POS/NE tag to some additional tags in MISC field (eg. commodity, profession, etc) - use the misc column to add additional attributes to the nodes, such as roles - use the syntax information to create directed edges - use the syntax information to infer roles (as per @Chiarcos' idea)
Usage
python conll2graph.py conll node_attr edge_attr NE_to_remove
example:
python conll2graph.py adab.conll nodes_attributes.tsv edges_attributes.tsv NEs_to_remove.txt
Arguments
conll: The path to the CDLI-CoNLL formatted file containing the annotations.node_attr: The path to the TSV file containing node attributes.edge_attr: The path to the TSV file containing edge attributes.NE_to_remove: The path to the TXT file of named entities to remove from the graph.
Requirements
CoNLL
- lemma should be followed with the sense in square brackets, eg Jean-Jacques[1]
- POS is required for the lemmata to add to the graph
- Morphological annotations must be stripped prior to running the script
nodeattr and edgeattr
- column names will be used as attributes names
NEtoremove
- One item per line, used for broken forms
eg ...[0]andur-x[0]
Dependencies
Make sure you have the following dependencies installed:
recsvsysargparsepandasnumpynetworkxcommunitystring
Steps
The script follows these steps to create the co-occurrence network:
- Read in the data and extract relevant information from the CDLI-CoNLL file.
- Create a graph and add nodes and edges representing the co-occurrence relationships between named entities.
- Read in node attributes from a TSV file and add them to the nodes in the graph.
- Read in edge attributes from a TSV file and add them to the edges in the graph.
- Remove specified named entities from the graph.
- Perform pre-computations on the graph, such as calculating node degrees and weighted degrees, modularity, and identifying k-plex communities.
- Assign node sizes based on the degree values.
- Output the graph data to a GraphML file and create separate graph files for each period attribute.
- Output unmatched node IDs and names to separate files.
Output
The script produces the following output files:
output-graph.graphml: The co-occurrence network graph in GraphML format.graph_{period}.graphml: Separate graph files for each period attribute, where{period}represents the specific period attribute.unmatched_ids.txt: A file containing unmatched node IDs.unmatched_names.txt: A file containing unmatched node names.
Note: Make sure you have write permissions in the script's directory for generating the output files.
Please ensure that you have the necessary input files in the specified formats and run the script with the appropriate command line arguments.
Owner
- Name: CDLI
- Login: cdli-gh
- Kind: organization
- Email: cdli@orinst.ox.ac.uk
- Location: Los Angeles, Oxford, Berlin
- Website: https://cdli.ucla.edu
- Repositories: 83
- Profile: https://github.com/cdli-gh
GitHub Events
Total
- Push event: 1
Last Year
- Push event: 1
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Emilie Page-Perron | e****n@w****k | 15 |
| sengarsameer | s****r@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 12 months ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: about 1 month
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 2.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- sengarsameer (1)