ptmtorrent

Code to generate the PTMTorrent dataset

https://github.com/softwaresystemslaboratory/ptmtorrent

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 9 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
    Organization softwaresystemslaboratory has institutional domain (ssl.cs.luc.edu)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.4%) to scientific vocabulary

Keywords

model-hub pre-trained-model ptm torrent
Last synced: 6 months ago · JSON representation

Repository

Code to generate the PTMTorrent dataset

Basic Info
  • Host: GitHub
  • Owner: SoftwareSystemsLaboratory
  • License: bsd-3-clause
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 746 KB
Statistics
  • Stars: 10
  • Watchers: 0
  • Forks: 11
  • Open Issues: 3
  • Releases: 7
Topics
model-hub pre-trained-model ptm torrent
Created about 3 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation Zenodo

README.md

PTMTorrent

Code to generate the PTMTorrent dataset

Python Version DOI Release Project

Table of Contents

About

This repository contains the scripts to generate the PTMTorrent dataset. The dataset contains sets of pre-trained machine learning models (PTM) git repositories hosted on popular model hubs. Supporting metadata from each model hub as well as standardized metadata specified by this JSON Schema is also included in.

Supported Model Hubs

The following model hubs are supported by our software:

Dependencies

This project is dependent upon the following software:

Package dependencies are given in pypoetry.toml and handled by poetry

How To Install

To run this project, it must be packaged and installed first.

The package can either be installed from our GitHub Releases or built and installed From Source.

From GitHub Releases

  1. Download the latest .tar.gz or .whl file from here.
  2. Install via pip: python3.10 -m pip install ptm_torrent-*

From Source

Instructions were written for Linux operating systems

  1. Clone the project locally: git clone https://github.com/SoftwareSystemsLaboratory/PTM-Torrent
  2. cd into the project: cd PTM-Torrent
  3. Create a Python 3.10 virtual environment: python3.10 -m venv env
  4. Activate virtual environment: source env/bin/activate
  5. Upgrade pip: python -m pip install --upgrade pip
  6. Install poetry: python -m pip install poetry
  7. Install Python dependencies through poetry: python -m poetry install
  8. Build with poetry: python -m poetry build
  9. Install with pip: python -m pip install dist/ptm_torrent*.tar.gz

How to Run

After installing the package, this project can be ran as individual scripts per model hub.

As Individual Scripts

Each model hub's scripts are separated by directory in the ptm_torrent directory. The directory for each specific model hub's scripts, the main runner script, and download size, is listed in the table below:

| Model Hub | Scripts Directory | Script Name | Download Size | | -------------- | -------------------------- | ------------- | ------------- | | Hugging Face | ptm_torrent/huggingface | __main__.py | 61 TB | | Modelhub | ptm_torrent/modelhub | __main__.py | 721 MB | | ModelZoo | ptm_torrent/modelzoo | __main__.py | 151 GB | | ONNX Model Zoo | pmt_torrent/onnxmodelzoo | __main__.py | 441 MB | | Pytorch Hub | ptm_torrent/pytorchhub | __main__.py | 1.5 GB |

There are other supporting scripts within each model hub's scripts directory. These scripts are ran in order by the model hub's __main__.py file. The order in which to run these scripts (should the __main__.py file be insufficient) is described in each model hub's README.md file within the scripts directory.

NOTE: Hugging Face's __main__.py can be parameritized to allow for a specific percentage of the model hub to be downloaded. By default, it is the first 0.1 (10%) of models sorted by downloads in descending order.

To run any of the scripts, execute the following command pattern:

  • python 'Scripts Directory'/'Script Name'

For example, to run Hugging Face's scripts:

  • python ptm_torrent/huggingface/__main__.py

Data Representation

Each model hub script generates the following directory structure per model hub:

shell 📦data ┗ 📂MODELHUB ┃ ┣ 📂html ┃ ┃ ┗ 📂metadata ┃ ┃ ┃ ┃ ┗ 📂models ┃ ┃ ┣ 📂json ┃ ┃ ┃ ┗ 📂metadata ┃ ┃ ┃ ┃ ┗ 📂models ┃ ┃ ┗ 📂repos ┃ ┃ ┃ ┗ 📂AUTHOR ┃ ┃ ┃ ┃ ┗ 📂MODEL

This directory structure is generated relative to where the script is ran from. Example: if the script was ran from the home directory (~), then the data directory would be stored at ~/data.

Where:

  • data/MODELHUB is the same name as the Python module directory that contained the script.
  • data/MODELHUB/repos/AUTHOR is the author name of the repository that was cloned.
  • data/MODELHUB/repos/AUTHOR/MODEL is the name of the repository that was cloned.

Model hub scripts do not overwrite the directory. In other words, it is a safe operation to run multiple model hub scripts from the same directory sequentially or concurrently.

Specifics about the types of metadata files and content that are produced by the scripts can be found in each model hub's script directory's README.md file.

Pre-Packaged Dataset

An existing dataset is available on this Purdue University Globus share.

If you are unfamiliar with Globus, we prepared a guide in the globus-docs/ directory.

Example Usage of Dataset

An example usage of the dataset is described within the example directory.

How to Cite

DOI

This project has a DOI on Zenodo. Please visit our Zenodo page for the latest citation information.

References

References are sorted by alphabetical order and not how they appear in this document.

[1] “Git.” https://git-scm.com/ (accessed Jan. 25, 2023).

[2] “Git Large File Storage,” Git Large File Storage. https://git-lfs.com/ (accessed Jan. 25, 2023).

[3] “Hugging Face – The AI community building the future.,” Jan. 03, 2023. https://huggingface.co/ (accessed Jan. 25, 2023).

[4] “Model Zoo - Deep learning code and pretrained models for transfer learning, educational purposes, and more.” https://modelzoo.co/ (accessed Jan. 25, 2023).

[5] “Modelhub.” https://modelhub.ai/ (accessed Jan. 25, 2023).

[6] “MSR 2023 - Data and Tool Showcase Track - MSR 2023.” https://conf.researchr.org/track/msr-2023/msr-2023-data-showcase (accessed Jan. 25, 2023).

[7] “ONNX Model Zoo.” Open Neural Network Exchange, Jan. 25, 2023. Accessed: Jan. 25, 2023. [Online]. Available: https://github.com/onnx/models

[8] “pip documentation v22.3.1.” https://pip.pypa.io/en/stable/ (accessed Jan. 25, 2023).

[9] “Poetry - Python dependency management and packaging made easy.” https://python-poetry.org/ (accessed Jan. 25, 2023).

[10] “Python Release Python 3.10.9,” Python.org. https://www.python.org/downloads/release/python-3109/ (accessed Jan. 25, 2023).

[11] “PyTorch Hub.” https://www.pytorch.org/hub (accessed Jan. 25, 2023).

[12] W. Jiang et al., “SoftwareSystemsLaboratory/PTMTorrent.” Zenodo, Jan. 25, 2023. doi: 10.5281/zenodo.7570357.

Owner

  • Name: Software and Systems Laboratory
  • Login: SoftwareSystemsLaboratory
  • Kind: organization
  • Email: ssl@cs.luc.edu
  • Location: Loyola University Chicago

Fostering innovation via experimentation and collaboration with an emphasis on openness.

Citation (CITATION.cff)


      

GitHub Events

Total
  • Watch event: 3
Last Year
  • Watch event: 3

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 4
  • Total pull requests: 28
  • Average time to close issues: about 3 hours
  • Average time to close pull requests: about 11 hours
  • Total issue authors: 1
  • Total pull request authors: 8
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.04
  • Merged pull requests: 24
  • Bot issues: 0
  • Bot pull requests: 2
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • davisjam (4)
Pull Request Authors
  • NicholasSynovic (15)
  • tschorlemmer (3)
  • pjjajal (3)
  • bhavesh-pareek (2)
  • dependabot[bot] (2)
  • Wenxin-Jiang (1)
  • davisjam (1)
  • AravTewari (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (2)

Dependencies

.github/workflows/release.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • softprops/action-gh-release v0.1.14 composite
example/requirements.txt pypi
  • clime-metrics *
poetry.lock pypi
  • black 22.12.0 develop
  • click 8.1.3 develop
  • isort 5.11.4 develop
  • mypy-extensions 0.4.3 develop
  • pathspec 0.10.3 develop
  • platformdirs 2.6.2 develop
  • tomli 2.0.1 develop
  • beautifulsoup4 4.11.1
  • bs4 0.0.1
  • certifi 2022.12.7
  • charset-normalizer 3.0.1
  • colorama 0.4.6
  • filelock 3.9.0
  • huggingface-hub 0.11.1
  • idna 3.4
  • lxml 4.9.2
  • markdown 3.4.1
  • numpy 1.24.1
  • packaging 23.0
  • pandas 1.5.3
  • progress 1.6
  • python-dateutil 2.8.2
  • pytz 2022.7.1
  • pyyaml 6.0
  • requests 2.28.2
  • six 1.16.0
  • soupsieve 2.3.2.post1
  • tqdm 4.64.1
  • typing-extensions 4.4.0
  • urllib3 1.26.14
pyproject.toml pypi
  • bs4 ^0.0.1
  • huggingface-hub ^0.11.1
  • lxml ^4.9.2
  • markdown ^3.4.1
  • pandas ^1.5.2
  • progress ^1.6
  • python ^3.10
  • requests ^2.28.2