Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.6%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

MSC project

Basic Info
  • Host: GitHub
  • Owner: omegatro
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 106 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 3 years ago · Last pushed about 2 years ago
Metadata Files
Readme Citation

README.md

Introduction

This repository contains the code developed as a part of work on the following thesis: "UNSUPERVISED MACHINE LEARNING APPROACH FOR HIERARCHICAL GRAPH-BASED REPRESENTATION OF NATURAL LANGUAGE TEXT COLLECTIONS" It consists of python scripts for data preprocessing, as well as jupyter notebooks outlining the learning process and summarizing final results. Additionally the preprocessed data were made available.

Setup & Installation

  • Prerequisites
  • Installation git clone https://github.com/omegatro/MSC cd <path_to_MSC_folder>/MSC/ conda env create -f msc_project_env.yaml mkdir backup touch .secrets #API key goes here python -c """ import nltk nltk.download('stopwords') nltk.download('punkt') """
  • Command-line interface for preprocessing ``` options: -h, --help show this help message and exit --l, --libraryname Name of the Zotero library to parse. --c, --collectionname Name of the Zotero collection to parse --m, --modelfilename Name of model to name the files (required for compatibility with genism LDA module) --ng, --nramlen Length of the n-gram --ul, --uselocal Flag to analyze locally-stored copies of publications instead of downloading from links. --sm, --skipmodel Flag to skip LDA topic modelling --wc, --wordclouds Flag to generate wordcloud plot for each pdf file --tfp, --tf_plots Flag to generate term frequency bar plot for each pdf file

Required arguments: --o, --outputpath Full path to the folder where the downloaded files should be saved. - Command example: python main.py --l 'My Library' --o ./Data/bioitset/ --c 'Bioinformatics set' --m bioitset1 --sm --ng 1 ``` - Results - Visual results include hARTM topic models and plots generated based on those models. - Preliminary results presented at RaTSIf conference - Datasets - Contains two preprocessed datasets from the thesis: - Bioinformatics dataset - NLP dataset - Each dataset contains: - Unigram total frequency for each dataset (files with tf1.csv suffix). - Bigram-based representation for each document in Vowpal-Wabbit format (files with _vw.txt suffix). - Bigram co-occurrence count dictionary used for coherence calculations (cooc2.txt). - Vocabulary of unique bigrams for each dataset (vocab_2.txt).

Owner

  • Name: Eugene Bodrenko
  • Login: omegatro
  • Kind: user
  • Company: @NMRL

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: hARTM dataset for Master Thesis
message: >-
  If you use this dataset, please cite it using the metadata
  from this file.
type: dataset
authors:
  - given-names: Jevgenijs
    family-names: Bodrenko
    email: omegatro@gmail.com
    orcid: 'https://orcid.org/0009-0007-7085-7762'
repository-code: 'https://github.com/omegatro/MSC/'
abstract: >-
  Code companion for the thesis - "Unsupervised machine learning approach for hierarchical graph-based representation of natural language text collections".
  The repository contains python scripts for data preprocessing, as well as jupyter notebooks outlining the learning process and summarizing final results.
  The datasets contain a selection of tokenized full-text articles. 
  BIOIT dataset covers a variety of bioinformatics topics, including single cell data analysis, variant calling, genome  assembly and antimicrobial surveillance. 
  NLP dataset covers a variety of natural language processing topics, including data preparation techniques, topic modelling, document clustering methods, deep learning methods and large language models. 
  Each dataset contains Unigram total frequency (files with tf_1.csv suffix); Bigram-based representation for each document in Vowpal-Wabbit format (files with _vw.txt suffix)
  Bigram co-occurrence count dictionary used for coherence calculations (cooc_2.txt); Vocabulary of unique bigrams for each dataset (vocab_2.txt).

GitHub Events

Total
Last Year