Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary
Repository
SDG classification of texts using LDA topic model
Basic Info
- Host: GitHub
- Owner: SeaCelo
- License: gpl-3.0
- Language: Shell
- Default Branch: master
- Size: 429 MB
Statistics
- Stars: 7
- Watchers: 4
- Forks: 3
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
SDGclassy
SDG classification of texts using LDA topic modeling
This tool classifies texts based on the 17 Sustainable Development Goals (SDGs). Each SDG is defined by training texts derived from official UN publications. Since February 2024, the training set also includes synthetic texts generated by ChatGPT, enhancing the classifier's coverage.
To learn more about the methodology, see:
- "Art is long, life is short: An SDG Classification System for DESA Publications".
- "Using large language models to help train machine learning SDG classifiers".
Note: This tool does not determine if a text is SDG-related. Instead, it calculates scores based on how well the text fits within the SDG vocabulary. For details, see the section "Interpreting the Results."
Requirements
- Mallet 2.0.8 (Download here)
- Text files to classify (in
.txtformat)
Text Preparation:
- Ensure your text files are cleaned to exclude irrelevant material (e.g., front matter). Cleaned data yields better classification results.
Supported Platforms:
- Mac OS X (Zsh shell recommended for newer macOS versions)
- Windows (requires additional configuration, see below)
- Linux
On Windows, the Mallet bigrams command may need fixing. Refer to this GitHub issue.
Installation
For Mac OS and UNIX:
Clone the repository:
bash git clone https://github.com/SeaCelo/SDGclassy.git SDGclassy cd SDGclassy chmod +x infer-scores.shInstall Mallet inside the cloned SDGclassy directory:
bash wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip unzip mallet-2.0.8.zip rm mallet-2.0.8.zipKeeping Mallet in the same directory ensures all dependencies remain self-contained.
Download the required model files:
sdgclassy.mallet: Download from Google Driveinferring-temp.mallet: Download from Google Drive
After downloading, manually place both files into the /SDGclassy/classifier/ directory.
These files are too large to be included in the repository, so they must be downloaded and added manually.
Usage
For Mac OS and UNIX:
Option 1: Use Predefined Directories
Prepare your text files:
- Ensure all files are in plain
.txtformat (no PDFs or directories). - Clean the files by removing irrelevant material (e.g., front matter or unrelated content).
- Ensure all files are in plain
Place the files in the predefined input directory:
/SDGclassy/target/input/Run the classification script:
bash ./infer-scores.shFind the results in the predefined output directory:
/SDGclassy/target/output/SDG-scores-out.txt
Option 2: Specify Custom Directories
If your input files are stored elsewhere or you want to save results in a different location:
- Use the alternative script
infer-scores2.sh.
- Use the alternative script
Run the script with custom paths:
bash ./infer-scores2.sh -i /path/to/your/input -o /path/to/your/outputCheck the specified output directory for the results.
For Windows:
Prepare your text files:
- Convert files to plain
.txtformat. - Clean up irrelevant content.
- Convert files to plain
Place the text files in the
/SDGclassy/target/input/directory.Run the script:
- Right-click
infer-scores.ps1and select "Run with Powershell".
- Right-click
Results will be saved in:
/SDGclassy/target/output/SDG-scores-out.txt
Interpreting the Results
- The output file
SDG-scores-out.txtlists topics (0–18) and their corresponding scores. Each topic maps to an SDG, except one filter topic, which should be ignored. - Use
/classifier/topic-sdg_mapping.csvto match topics with SDGs. - Scores do not sum to 100% due to the extra category. Rescale them if necessary for your analysis.
Additional Notes
- You can install Mallet elsewhere and adjust the scripts accordingly. Alternatively, add Mallet to your
$PATHvariable. - If Mallet runs out of memory during processing, allocate more memory:
- Navigate to the Mallet installation directory:
bash cd /path/to/mallet-2.0.8/bin - Edit the binary file:
bash nano mallet - Set the memory allocation:
bash MEMORY=8g
- Navigate to the Mallet installation directory:
Owner
- Name: Marcelo LaFleur
- Login: SeaCelo
- Kind: user
- Repositories: 7
- Profile: https://github.com/SeaCelo
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: SDGClassy
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Marcelo T.
family-names: LaFleur
email: mtlafleur@gmail.com
repository-code: 'https://github.com/SeaCelo/SDGclassy'
abstract: >-
SDG classification of texts using LDA topic model
This script is based on my work to classify UN
publications according to each SDG. This tool provides a
way to easily compute "SDG scores" for individual or a
collection of texts. Each SDG is defined by a collection
of training texts for each of the 17 SDGs taken from
official UN publications.
To read the details of the methodology: "Art is long, life
is short: An SDG Classification System for DESA
Publications"
(https://www.un.org/development/desa/publications/working-paper/wp159).
license: GPL-3.0
commit: 4cc71a631f9b8133e97ee6f9b6e8fafdfa01f517
date-released: '2022-10-12'
GitHub Events
Total
- Watch event: 1
- Delete event: 2
- Push event: 3
- Pull request event: 1
- Create event: 1
Last Year
- Watch event: 1
- Delete event: 2
- Push event: 3
- Pull request event: 1
- Create event: 1