msc

MSC project

https://github.com/omegatro/msc

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

MSC project

Basic Info

Host: GitHub
Owner: omegatro
Language: Jupyter Notebook
Default Branch: main
Size: 106 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 3 years ago · Last pushed about 2 years ago

Metadata Files

Readme Citation

Introduction

This repository contains the code developed as a part of work on the following thesis: "UNSUPERVISED MACHINE LEARNING APPROACH FOR HIERARCHICAL GRAPH-BASED REPRESENTATION OF NATURAL LANGUAGE TEXT COLLECTIONS" It consists of python scripts for data preprocessing, as well as jupyter notebooks outlining the learning process and summarizing final results. Additionally the preprocessed data were made available.

Setup & Installation

Prerequisites
- Anaconda
- WSL2
- Jupyter under WSL2 or Colab
- Zotero
- Zotero private API key available in <path_to_MSC_folder>/MSC/.secrets file
Installation git clone https://github.com/omegatro/MSC cd <path_to_MSC_folder>/MSC/ conda env create -f msc_project_env.yaml mkdir backup touch .secrets #API key goes here python -c """ import nltk nltk.download('stopwords') nltk.download('punkt') """
Command-line interface for preprocessing ``` options: -h, --help show this help message and exit --l, --libraryname Name of the Zotero library to parse. --c, --collectionname Name of the Zotero collection to parse --m, --modelfilename Name of model to name the files (required for compatibility with genism LDA module) --ng, --nramlen Length of the n-gram --ul, --uselocal Flag to analyze locally-stored copies of publications instead of downloading from links. --sm, --skipmodel Flag to skip LDA topic modelling --wc, --wordclouds Flag to generate wordcloud plot for each pdf file --tfp, --tf_plots Flag to generate term frequency bar plot for each pdf file

Required arguments: --o, --outputpath Full path to the folder where the downloaded files should be saved. - Command example: python main.py --l 'My Library' --o ./Data/bioitset/ --c 'Bioinformatics set' --m bioitset1 --sm --ng 1 ``` - Results - Visual results include hARTM topic models and plots generated based on those models. - Preliminary results presented at RaTSIf conference - Datasets - Contains two preprocessed datasets from the thesis: - Bioinformatics dataset - NLP dataset - Each dataset contains: - Unigram total frequency for each dataset (files with tf1.csv suffix). - Bigram-based representation for each document in Vowpal-Wabbit format (files with _vw.txt suffix). - Bigram co-occurrence count dictionary used for coherence calculations (cooc2.txt). - Vocabulary of unique bigrams for each dataset (vocab_2.txt).

Owner

Name: Eugene Bodrenko
Login: omegatro
Kind: user
Company: @NMRL

Repositories: 2
Profile: https://github.com/omegatro

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: hARTM dataset for Master Thesis
message: >-
If you use this dataset, please cite it using the metadata
from this file.
type: dataset
authors:
- given-names: Jevgenijs
family-names: Bodrenko
email: omegatro@gmail.com
orcid: 'https://orcid.org/0009-0007-7085-7762'
repository-code: 'https://github.com/omegatro/MSC/'
abstract: >-
Code companion for the thesis - "Unsupervised machine learning approach for hierarchical graph-based representation of natural language text collections".
The repository contains python scripts for data preprocessing, as well as jupyter notebooks outlining the learning process and summarizing final results.
The datasets contain a selection of tokenized full-text articles.
BIOIT dataset covers a variety of bioinformatics topics, including single cell data analysis, variant calling, genome assembly and antimicrobial surveillance.
NLP dataset covers a variety of natural language processing topics, including data preparation techniques, topic modelling, document clustering methods, deep learning methods and large language models.
Each dataset contains Unigram total frequency (files with tf_1.csv suffix); Bigram-based representation for each document in Vowpal-Wabbit format (files with _vw.txt suffix)
Bigram co-occurrence count dictionary used for coherence calculations (cooc_2.txt); Vocabulary of unique bigrams for each dataset (vocab_2.txt).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

msc

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Introduction

Setup & Installation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year