https://github.com/cdli-gh/conll2graphml

https://github.com/cdli-gh/conll2graphml

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 2 committers (50.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: cdli-gh
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 33.2 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created about 3 years ago · Last pushed about 1 year ago
Metadata Files
Readme License

README.md

conll2graphml

This script creates a co-occurrence network of named entities mentioned in annotated texts. It reads input data from CoNLL formatted files and extracts relevant information to construct a graph. The graph represents the relationships between named entities based on their co-occurrence in the texts.

Citation

Todo

Priority

  • Strip any morphological annotations before proceeding
  • feed the POS tags to look out for from the command or a file, not in the script
  • period attribute from sheet should be optional
  • add text attributes from artifact metadata (use CDLI API when stand-alone or db or local dump for the online version)
  • add node metadata from sense entry in dictionary (When available, todo... )

### Nice to have - Extend the functionality of using POS/NE tag to some additional tags in MISC field (eg. commodity, profession, etc) - use the misc column to add additional attributes to the nodes, such as roles - use the syntax information to create directed edges - use the syntax information to infer roles (as per @Chiarcos' idea)

Usage

python conll2graph.py conll node_attr edge_attr NE_to_remove

example:
python conll2graph.py adab.conll nodes_attributes.tsv edges_attributes.tsv NEs_to_remove.txt

Arguments

  • conll: The path to the CDLI-CoNLL formatted file containing the annotations.
  • node_attr: The path to the TSV file containing node attributes.
  • edge_attr: The path to the TSV file containing edge attributes.
  • NE_to_remove: The path to the TXT file of named entities to remove from the graph.

Requirements

CoNLL

  • lemma should be followed with the sense in square brackets, eg Jean-Jacques[1]
  • POS is required for the lemmata to add to the graph
  • Morphological annotations must be stripped prior to running the script

nodeattr and edgeattr

  • column names will be used as attributes names

NEtoremove

  • One item per line, used for broken forms eg ...[0] and ur-x[0]

Dependencies

Make sure you have the following dependencies installed:

  • re
  • csv
  • sys
  • argparse
  • pandas
  • numpy
  • networkx
  • community
  • string

Steps

The script follows these steps to create the co-occurrence network:

  1. Read in the data and extract relevant information from the CDLI-CoNLL file.
  2. Create a graph and add nodes and edges representing the co-occurrence relationships between named entities.
  3. Read in node attributes from a TSV file and add them to the nodes in the graph.
  4. Read in edge attributes from a TSV file and add them to the edges in the graph.
  5. Remove specified named entities from the graph.
  6. Perform pre-computations on the graph, such as calculating node degrees and weighted degrees, modularity, and identifying k-plex communities.
  7. Assign node sizes based on the degree values.
  8. Output the graph data to a GraphML file and create separate graph files for each period attribute.
  9. Output unmatched node IDs and names to separate files.

Output

The script produces the following output files:

  • output-graph.graphml: The co-occurrence network graph in GraphML format.
  • graph_{period}.graphml: Separate graph files for each period attribute, where {period} represents the specific period attribute.
  • unmatched_ids.txt: A file containing unmatched node IDs.
  • unmatched_names.txt: A file containing unmatched node names.

Note: Make sure you have write permissions in the script's directory for generating the output files.

Please ensure that you have the necessary input files in the specified formats and run the script with the appropriate command line arguments.

Owner

  • Name: CDLI
  • Login: cdli-gh
  • Kind: organization
  • Email: cdli@orinst.ox.ac.uk
  • Location: Los Angeles, Oxford, Berlin

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 16
  • Total Committers: 2
  • Avg Commits per committer: 8.0
  • Development Distribution Score (DDS): 0.063
Past Year
  • Commits: 1
  • Committers: 1
  • Avg Commits per committer: 1.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Emilie Page-Perron e****n@w****k 15
sengarsameer s****r@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 month
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 2.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • sengarsameer (1)
Top Labels
Issue Labels
Pull Request Labels