sdgclassy

SDG classification of texts using LDA topic model

https://github.com/seacelo/sdgclassy

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

SDG classification of texts using LDA topic model

Basic Info
  • Host: GitHub
  • Owner: SeaCelo
  • License: gpl-3.0
  • Language: Shell
  • Default Branch: master
  • Size: 429 MB
Statistics
  • Stars: 7
  • Watchers: 4
  • Forks: 3
  • Open Issues: 0
  • Releases: 0
Created almost 7 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

SDGclassy

SDG classification of texts using LDA topic modeling

This tool classifies texts based on the 17 Sustainable Development Goals (SDGs). Each SDG is defined by training texts derived from official UN publications. Since February 2024, the training set also includes synthetic texts generated by ChatGPT, enhancing the classifier's coverage.

To learn more about the methodology, see:
- "Art is long, life is short: An SDG Classification System for DESA Publications".
- "Using large language models to help train machine learning SDG classifiers".

Note: This tool does not determine if a text is SDG-related. Instead, it calculates scores based on how well the text fits within the SDG vocabulary. For details, see the section "Interpreting the Results."


Requirements

  • Mallet 2.0.8 (Download here)
  • Text files to classify (in .txt format)

Text Preparation:

  • Ensure your text files are cleaned to exclude irrelevant material (e.g., front matter). Cleaned data yields better classification results.

Supported Platforms:

  • Mac OS X (Zsh shell recommended for newer macOS versions)
  • Windows (requires additional configuration, see below)
  • Linux

On Windows, the Mallet bigrams command may need fixing. Refer to this GitHub issue.


Installation

For Mac OS and UNIX:

  1. Clone the repository: bash git clone https://github.com/SeaCelo/SDGclassy.git SDGclassy cd SDGclassy chmod +x infer-scores.sh

  2. Install Mallet inside the cloned SDGclassy directory: bash wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip unzip mallet-2.0.8.zip rm mallet-2.0.8.zip

    Keeping Mallet in the same directory ensures all dependencies remain self-contained.

  3. Download the required model files:

After downloading, manually place both files into the /SDGclassy/classifier/ directory.

These files are too large to be included in the repository, so they must be downloaded and added manually.


Usage

For Mac OS and UNIX:

Option 1: Use Predefined Directories

  1. Prepare your text files:

    • Ensure all files are in plain .txt format (no PDFs or directories).
    • Clean the files by removing irrelevant material (e.g., front matter or unrelated content).
  2. Place the files in the predefined input directory: /SDGclassy/target/input/

  3. Run the classification script: bash ./infer-scores.sh

  4. Find the results in the predefined output directory: /SDGclassy/target/output/SDG-scores-out.txt

Option 2: Specify Custom Directories

  1. If your input files are stored elsewhere or you want to save results in a different location:

    • Use the alternative script infer-scores2.sh.
  2. Run the script with custom paths: bash ./infer-scores2.sh -i /path/to/your/input -o /path/to/your/output

  3. Check the specified output directory for the results.


For Windows:

  1. Prepare your text files:

    • Convert files to plain .txt format.
    • Clean up irrelevant content.
  2. Place the text files in the /SDGclassy/target/input/ directory.

  3. Run the script:

    • Right-click infer-scores.ps1 and select "Run with Powershell".
  4. Results will be saved in: /SDGclassy/target/output/SDG-scores-out.txt


Interpreting the Results

  • The output file SDG-scores-out.txt lists topics (0–18) and their corresponding scores. Each topic maps to an SDG, except one filter topic, which should be ignored.
  • Use /classifier/topic-sdg_mapping.csv to match topics with SDGs.
  • Scores do not sum to 100% due to the extra category. Rescale them if necessary for your analysis.

Additional Notes

  • You can install Mallet elsewhere and adjust the scripts accordingly. Alternatively, add Mallet to your $PATH variable.
  • If Mallet runs out of memory during processing, allocate more memory:
    1. Navigate to the Mallet installation directory: bash cd /path/to/mallet-2.0.8/bin
    2. Edit the binary file: bash nano mallet
    3. Set the memory allocation: bash MEMORY=8g

Owner

  • Name: Marcelo LaFleur
  • Login: SeaCelo
  • Kind: user

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: SDGClassy
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Marcelo T.
    family-names: LaFleur
    email: mtlafleur@gmail.com
repository-code: 'https://github.com/SeaCelo/SDGclassy'
abstract: >-
  SDG classification of texts using LDA topic model


  This script is based on my work to classify UN
  publications according to each SDG. This tool provides a
  way to easily compute "SDG scores" for individual or a
  collection of texts. Each SDG is defined by a collection
  of training texts for each of the 17 SDGs taken from
  official UN publications.


  To read the details of the methodology: "Art is long, life
  is short: An SDG Classification System for DESA
  Publications"
  (https://www.un.org/development/desa/publications/working-paper/wp159).
license: GPL-3.0
commit: 4cc71a631f9b8133e97ee6f9b6e8fafdfa01f517
date-released: '2022-10-12'

GitHub Events

Total
  • Watch event: 1
  • Delete event: 2
  • Push event: 3
  • Pull request event: 1
  • Create event: 1
Last Year
  • Watch event: 1
  • Delete event: 2
  • Push event: 3
  • Pull request event: 1
  • Create event: 1