Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: seancze
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 12.3 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created 11 months ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

sgexams-data-analysis

DOI

This repository contains data collected from the r/SGExams subreddit from July 2024 to March 2025.

Also, it contains code to train ML models to classify if a thread should be marked as "popular". We define a "popular" thread as one that received upvotes in the top 25% of all threads in the dataset. For this dataset, the thread must have received at least 41 upvotes.

Data

The data was collected using the Python Reddit API Wrapper by running a daily cron job. Also the dataset has been filtered to remove: 1. All data which has been marked as removed by the official Python Reddit API Wrapper 2. All data which contains media (is_self=False)

There are 4 datasets in the data folder: 1. data/threads_jul24_mar25.csv = Full dataset containing 9,703 rows 2. data/threads_jul24_mar25_train.csv = Training dataset containing 7,762 rows (80%) 3. data/threads_jul24_mar25_val.csv = Validation dataset containing 970 rows (10%) 4. data/threads_jul24_mar25_test.csv = Test dataset containing 971 rows (10%)

ML Classifier

In this section, we outline how to run the experiments to train your own ML classifier and to replicate the results found in data/model_comparison_results.csv. data/model_comparison_results.csv contains the results of all experiments conducted.

1. Install dependencies

This project uses Python 3.12. It is recommended to create a virtual environment to install the dependencies.

bash pip install -r requirements.txt

2. Run the experiments

To run all experiments and replicate the results found in data/model_comparison_results.csv, run the following command:

bash python main.py --all --test

To run the experiment with the best subset of features, run the following command:

bash python main.py

To run the experiment with all features, run the following command:

bash python main.py --all

A summary of all command line arguments is shown below:

| Flag | Type | Default | Variable | Description | |--------------------|--------------------|-----------|------------------------------------------|-----------------------------------------------------------------------------------------------------------------------| | --mode {0,1} | choice (0/1) | 0 | TRAIN_BEST_MODEL, TRAIN_ALL_FEATURES | Selects which model to train:
- 0: train best model (default)
- 1: train model with all features | | --all | boolean flag | false | RUN_ALL_EXPERIMENTS | If set, all experiments will be run (overrides --mode).
Warning: may cause a segfault if memory is low. | | --test | boolean flag | false | GET_TEST_PERFORMANCE | If set, evaluate on the test dataset; otherwise only validation is performed. |

Owner

  • Login: seancze
  • Kind: user

Citation (citation.cff)

cff-version: 1.2.0
type: dataset
message: "If you use this repository, please cite it as below."
authors:
  - family-names: "Chen"
    given-names: "Sean"
    orcid: "https://orcid.org/0000-0001-7160-4037"
title: "SGExams Data Analysis"
version: 1.0.0
doi: 10.5281/zenodo.15269590
date-released: 2025-04-23
url: "https://github.com/seancze/sgexams-data-analysis"

GitHub Events

Total
  • Release event: 1
  • Public event: 1
  • Push event: 1
Last Year
  • Release event: 1
  • Public event: 1
  • Push event: 1

Dependencies

requirements.txt pypi
  • Jinja2 ==3.1.6
  • joblib ==1.4.2
  • matplotlib ==3.10.1
  • numpy ==2.2.5
  • openpyxl ==3.1.5
  • pandas ==2.2.3
  • scikit_learn ==1.6.1
  • shap ==0.47.2
  • xgboost ==2.1.4