naics-repository-classification
https://github.com/victorhu3/naics-repository-classification
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.6%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: victorhu3
- License: cc0-1.0
- Language: Jupyter Notebook
- Default Branch: main
- Size: 10.8 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
NAICS Repository Classification
This repository contains Juypter notebooks and datasets working towards the goal of classifying Github repositories into 2022 North American Industry Classification System codes. Pull requests or suggestions are welcome.
Data
Data collection was automated through the Github REST API.
See feature_extraction for data collection scripts, data processing scripts, and datasets. See finished_dataset for the data post-processed with various embedding models for convenience.
Models
See the baseline_models for attempting classification with similarity-based approaches. See the models folder for the neural network, linear regression, and UMAP and clustering approaches.
Remarks
- Repository description, README, and Github topics were found to be the most predictive features in classifying into NAICS codes.
- More novel features like images and organization names are potentially helpful but difficult to generalize for all repos. For instance, classes of images are extremely broad (logos, artwork, real-life, abstract).
- Currently, the trained models can classify into the 20 overarching NACIS codes. More specific industry sub-codes (Ex: 11111 - Soybean farming) would significantly expand the possible classes.
- The data suffers from class imbalance. Certain codes are overrepresented (51 - Information) while other codes (55 - Management of Companies) are difficult to find quality data for.
License
This project is released under CC0-1.0.
Maintainers
See CODEOWNERS
Support
See SUPPORT
Owner
- Login: victorhu3
- Kind: user
- Repositories: 6
- Profile: https://github.com/victorhu3