Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary
Repository
MSC project
Basic Info
- Host: GitHub
- Owner: omegatro
- Language: Jupyter Notebook
- Default Branch: main
- Size: 106 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Introduction
This repository contains the code developed as a part of work on the following thesis: "UNSUPERVISED MACHINE LEARNING APPROACH FOR HIERARCHICAL GRAPH-BASED REPRESENTATION OF NATURAL LANGUAGE TEXT COLLECTIONS" It consists of python scripts for data preprocessing, as well as jupyter notebooks outlining the learning process and summarizing final results. Additionally the preprocessed data were made available.
Setup & Installation
- Prerequisites
- Installation
git clone https://github.com/omegatro/MSC cd <path_to_MSC_folder>/MSC/ conda env create -f msc_project_env.yaml mkdir backup touch .secrets #API key goes here python -c """ import nltk nltk.download('stopwords') nltk.download('punkt') """ - Command-line interface for preprocessing ``` options: -h, --help show this help message and exit --l, --libraryname Name of the Zotero library to parse. --c, --collectionname Name of the Zotero collection to parse --m, --modelfilename Name of model to name the files (required for compatibility with genism LDA module) --ng, --nramlen Length of the n-gram --ul, --uselocal Flag to analyze locally-stored copies of publications instead of downloading from links. --sm, --skipmodel Flag to skip LDA topic modelling --wc, --wordclouds Flag to generate wordcloud plot for each pdf file --tfp, --tf_plots Flag to generate term frequency bar plot for each pdf file
Required arguments:
--o, --outputpath
Full path to the folder where the downloaded files should be saved.
- Command example:
python main.py --l 'My Library' --o ./Data/bioitset/ --c 'Bioinformatics set' --m bioitset1 --sm --ng 1
```
- Results
- Visual results include hARTM topic models and plots generated based on those models.
- Preliminary results presented at RaTSIf conference
- Datasets
- Contains two preprocessed datasets from the thesis:
- Bioinformatics dataset
- NLP dataset
- Each dataset contains:
- Unigram total frequency for each dataset (files with tf1.csv suffix).
- Bigram-based representation for each document in Vowpal-Wabbit format (files with _vw.txt suffix).
- Bigram co-occurrence count dictionary used for coherence calculations (cooc2.txt).
- Vocabulary of unique bigrams for each dataset (vocab_2.txt).
Owner
- Name: Eugene Bodrenko
- Login: omegatro
- Kind: user
- Company: @NMRL
- Repositories: 2
- Profile: https://github.com/omegatro
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: hARTM dataset for Master Thesis
message: >-
If you use this dataset, please cite it using the metadata
from this file.
type: dataset
authors:
- given-names: Jevgenijs
family-names: Bodrenko
email: omegatro@gmail.com
orcid: 'https://orcid.org/0009-0007-7085-7762'
repository-code: 'https://github.com/omegatro/MSC/'
abstract: >-
Code companion for the thesis - "Unsupervised machine learning approach for hierarchical graph-based representation of natural language text collections".
The repository contains python scripts for data preprocessing, as well as jupyter notebooks outlining the learning process and summarizing final results.
The datasets contain a selection of tokenized full-text articles.
BIOIT dataset covers a variety of bioinformatics topics, including single cell data analysis, variant calling, genome assembly and antimicrobial surveillance.
NLP dataset covers a variety of natural language processing topics, including data preparation techniques, topic modelling, document clustering methods, deep learning methods and large language models.
Each dataset contains Unigram total frequency (files with tf_1.csv suffix); Bigram-based representation for each document in Vowpal-Wabbit format (files with _vw.txt suffix)
Bigram co-occurrence count dictionary used for coherence calculations (cooc_2.txt); Vocabulary of unique bigrams for each dataset (vocab_2.txt).