sgexams-data-analysis

https://github.com/seancze/sgexams-data-analysis

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: seancze
License: mit
Language: Python
Default Branch: main
Size: 12.3 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

sgexams-data-analysis

This repository contains data collected from the r/SGExams subreddit from July 2024 to March 2025.

Also, it contains code to train ML models to classify if a thread should be marked as "popular". We define a "popular" thread as one that received upvotes in the top 25% of all threads in the dataset. For this dataset, the thread must have received at least 41 upvotes.

Data

The data was collected using the Python Reddit API Wrapper by running a daily cron job. Also the dataset has been filtered to remove: 1. All data which has been marked as removed by the official Python Reddit API Wrapper 2. All data which contains media (is_self=False)

There are 4 datasets in the data folder: 1. data/threads_jul24_mar25.csv = Full dataset containing 9,703 rows 2. data/threads_jul24_mar25_train.csv = Training dataset containing 7,762 rows (80%) 3. data/threads_jul24_mar25_val.csv = Validation dataset containing 970 rows (10%) 4. data/threads_jul24_mar25_test.csv = Test dataset containing 971 rows (10%)

ML Classifier

In this section, we outline how to run the experiments to train your own ML classifier and to replicate the results found in data/model_comparison_results.csv. data/model_comparison_results.csv contains the results of all experiments conducted.

1. Install dependencies

This project uses Python 3.12. It is recommended to create a virtual environment to install the dependencies.

bash pip install -r requirements.txt

2. Run the experiments

To run all experiments and replicate the results found in data/model_comparison_results.csv, run the following command:

bash python main.py --all --test

To run the experiment with the best subset of features, run the following command:

bash python main.py

To run the experiment with all features, run the following command:

bash python main.py --all

A summary of all command line arguments is shown below:

| Flag | Type | Default | Variable | Description | |--------------------|--------------------|-----------|------------------------------------------|-----------------------------------------------------------------------------------------------------------------------| | --mode {0,1} | choice (0/1) | 0 | TRAIN_BEST_MODEL, TRAIN_ALL_FEATURES | Selects which model to train:
- 0: train best model (default)
- 1: train model with all features | | --all | boolean flag | false | RUN_ALL_EXPERIMENTS | If set, all experiments will be run (overrides --mode).
Warning: may cause a segfault if memory is low. | | --test | boolean flag | false | GET_TEST_PERFORMANCE | If set, evaluate on the test dataset; otherwise only validation is performed. |

Owner

Login: seancze
Kind: user

Repositories: 16
Profile: https://github.com/seancze

Citation (citation.cff)

cff-version: 1.2.0
type: dataset
message: "If you use this repository, please cite it as below."
authors:
  - family-names: "Chen"
    given-names: "Sean"
    orcid: "https://orcid.org/0000-0001-7160-4037"
title: "SGExams Data Analysis"
version: 1.0.0
doi: 10.5281/zenodo.15269590
date-released: 2025-04-23
url: "https://github.com/seancze/sgexams-data-analysis"

GitHub Events

Total

Release event: 1
Public event: 1
Push event: 1

Last Year

Release event: 1
Public event: 1
Push event: 1

Dependencies

requirements.txt pypi

Jinja2 ==3.1.6
joblib ==1.4.2
matplotlib ==3.10.1
numpy ==2.2.5
openpyxl ==3.1.5
pandas ==2.2.3
scikit_learn ==1.6.1
shap ==0.47.2
xgboost ==2.1.4

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

sgexams-data-analysis

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

sgexams-data-analysis

Data

ML Classifier

1. Install dependencies

2. Run the experiments

Owner

Citation (citation.cff)

GitHub Events

Total

Last Year

Dependencies