Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: victorhu3
  • License: cc0-1.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 10.8 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Contributing License Code of conduct Citation Codeowners Support

README.md

NAICS Repository Classification

This repository contains Juypter notebooks and datasets working towards the goal of classifying Github repositories into 2022 North American Industry Classification System codes. Pull requests or suggestions are welcome.

Data

Data collection was automated through the Github REST API.

See feature_extraction for data collection scripts, data processing scripts, and datasets. See finished_dataset for the data post-processed with various embedding models for convenience.

Models

See the baseline_models for attempting classification with similarity-based approaches. See the models folder for the neural network, linear regression, and UMAP and clustering approaches.

Remarks

  • Repository description, README, and Github topics were found to be the most predictive features in classifying into NAICS codes.
  • More novel features like images and organization names are potentially helpful but difficult to generalize for all repos. For instance, classes of images are extremely broad (logos, artwork, real-life, abstract).
  • Currently, the trained models can classify into the 20 overarching NACIS codes. More specific industry sub-codes (Ex: 11111 - Soybean farming) would significantly expand the possible classes.
  • The data suffers from class imbalance. Certain codes are overrepresented (51 - Information) while other codes (55 - Management of Companies) are difficult to find quality data for.

License

This project is released under CC0-1.0.

Maintainers

See CODEOWNERS

Support

See SUPPORT

Owner

  • Login: victorhu3
  • Kind: user

GitHub Events

Total
Last Year