hfcommunity

HFCommunity offers an offline up-to-date relational database built from the data available at the Hugging Face Hub, providing queriable data about the repositories hosted in the Hub

https://github.com/som-research/hfcommunity

Science Score: 52.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
    Organization som-research has institutional domain (som-research.uoc.edu)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.7%) to scientific vocabulary

Keywords

data-science database dataset huggingface
Last synced: 9 months ago · JSON representation ·

Repository

HFCommunity offers an offline up-to-date relational database built from the data available at the Hugging Face Hub, providing queriable data about the repositories hosted in the Hub

Basic Info
Statistics
  • Stars: 15
  • Watchers: 4
  • Forks: 2
  • Open Issues: 0
  • Releases: 1
Topics
data-science database dataset huggingface
Created almost 4 years ago · Last pushed over 1 year ago
Metadata Files
Readme Contributing License Code of conduct Citation Governance

README.md

HFCommunity

HFCommunity is a dataset built via a data collection process relying on the Hugging Face Hub (HFH) API and Git.

HFCommunity dataset is provided as a relational database, and therefore it can be queried via SQL-like languages to enable empirical analysis on ML projects.

The following figure shows the architecture of HFCommunity.

HFCommunity Architecture

As can be seen, HFCommunity is composed of two main components:

  • Dataset Extractor. The Dataset Extractor includes extractors for the different HFH data elements (i.e., datasets, models, and spaces) and a database importer to store the extracted data. Note that the database importer follows the conceptual schema for HFCommunity, which includes the main entities and relationships to query HFH data (e.g., model, dataset, space, issue or discussion elements).

  • Website. The Website is a web application that includes the main technical documentation of the tool and the last HFCommunity dataset dumps to be downloaded. A new release of HFCommunity is released every month.

Dataset Extractor

The Dataset Extractor has been developed in Python and is in charge of importing the HFH data into the HFCommunity dataset.

To execute the Dataset Extractor please refer to the docs.

Website

The website of HFCommunity is located here.

The technical documentation of the tool is located here.

How to cite HFCommunity

This repository has the CITATION.cff file, which activates the "Cite this repository" button in the About section (right side of the repository). The citation is in APA and BibTex format.

Contributing

This project is part of a research line of the SOM Research Lab and BESSER project, but we are open to contributions from the community. Any comment is more than welcome!

If you are interested in contributing to this project, please read the CONTRIBUTING.md file.

Code of Conduct

At SOM Research Lab and BESSER we are dedicated to creating and maintaining welcoming, inclusive, safe, and harassment-free development spaces. Anyone participating will be subject to and agrees to sign on to our Code of Conduct.

Governance

The development and community management of this project follows the governance rules described in the GOVERNANCE.md document.

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

The CC BY-SA license allows users to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.

Creative Commons License

Owner

  • Name: SOM Research Lab
  • Login: SOM-Research
  • Kind: organization
  • Email: rclariso@uoc.edu
  • Location: Barcelona

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  HFCommunity: A Tool to Analyze the Hugging Face Hub
  Community
message: >-
  If you use this software, please cite the article from
  preferred-citation.
type: software
authors:
  - orcid: 'https://orcid.org/0000-0002-5334-9041'
    affiliation: IN3 - UOC
    email: aait_mimoune@uoc.edu
    given-names: Adem
    family-names: Ait
  - orcid: 'https://orcid.org/0000-0002-2326-1700'
    affiliation: IN3 - UOC
    email: jcanovasi@uoc.edu
    given-names: Javier Luis
    family-names: Cánovas Izquierdo
  - orcid: 'https://orcid.org/0000-0003-2418-2489'
    given-names: Jordi
    family-names: Cabot
    email: jordi.cabot@list.lu
    affiliation: >-
      Luxembourg Institute of Science and Technology –
      University of Luxembourg
identifiers:
  - type: doi
    value: 10.1109/SANER56733.2023.00080
    description: The paper presenting the tool
repository-code: 'https://github.com/SOM-Research/HFCommunity'
url: 'https://som-research.github.io/HFCommunity/'
repository: 'https://som-research.github.io/HFCommunity/docs/index.html'
keywords:
  - Mining Software Repositories
  - Data Analysis
  - Hugging Face
license: CC-BY-SA-4.0
version: '1.0'
preferred-citation:
  type: conference-paper
  authors:
  - family-names: "Ait"
    given-names: "Adem"
    orcid: "https://orcid.org/0000-0002-5334-9041"
  - family-names: "Cánovas Izquierdo"
    given-names: "Javier Luis"
    orcid: "https://orcid.org/0000-0002-2326-1700"
  - family-names: "Cabot"
    given-names: "Jordi"
    orcid: "https://orcid.org/0000-0003-2418-2489"
  title: "HFCommunity: A Tool to Analyze the Hugging Face Hub Community"
  conference: 
    name: "IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2023, Taipa, Macao, March 21-24, 2023"
  start: 728 
  end: 732 
  year: 2023
  publisher:
    name: IEEE
  url: "https://doi.org/10.1109/SANER56733.2023.00080"
  doi: "10.1109/SANER56733.2023.00080"

GitHub Events

Total
  • Watch event: 6
Last Year
  • Watch event: 6

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 10
  • Total pull requests: 4
  • Average time to close issues: 4 days
  • Average time to close pull requests: 2 months
  • Total issue authors: 4
  • Total pull request authors: 2
  • Average comments per issue: 2.6
  • Average comments per pull request: 0.5
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 2
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • jlcanovas (5)
  • jcabot (3)
  • lynette-1 (1)
  • zhimin-z (1)
Pull Request Authors
  • dependabot[bot] (2)
  • ademait (2)
Top Labels
Issue Labels
enhancement (8) question (2)
Pull Request Labels
dependencies (2)

Dependencies

extractor/requirements.txt pypi
  • PyDriller ==2.1
  • Requests ==2.28.1
  • clean-text ==0.6.0
  • huggingface-hub ==0.19.2
  • mysql-connector-python ==8.0.29
  • python-dateutil ==2.8.2
  • pytz ==2022.1