https://github.com/bigscience-workshop/data_tooling

Tools for managing datasets for governance and training.

https://github.com/bigscience-workshop/data_tooling

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.9%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Tools for managing datasets for governance and training.

Basic Info
  • Host: GitHub
  • Owner: bigscience-workshop
  • License: apache-2.0
  • Language: HTML
  • Default Branch: master
  • Homepage:
  • Size: 218 MB
Statistics
  • Stars: 83
  • Watchers: 16
  • Forks: 46
  • Open Issues: 141
  • Releases: 0
Created about 5 years ago · Last pushed 11 months ago
Metadata Files
Readme License

README.md

Data Tooling and Governance

Tools for managing datasets for governance and training large language models.

Issues we aim to address

  • How do we automatically curate data to create datasets that are performant and comply with BigScience ethical values?
  • How do we remediate a dataset for personally identifiable information without degrading performance?
  • How should we store and serve the dataset?
  • How do we store and serve meta-data in datasets?
  • How do we address contestation of data?
  • How do we prove legal compliance in the use of datasets?
  • How do we prevent dissemination of the data beyond approved uses?
  • How do we keep trusted data secure?

Format to distribute data samples

Current consensus is to use jsonl.

Metadata guideline

Trying to keep things as simple as possible, the proposed metadata guideline is simply a flat format in a key/value format where values format could be constrained. The goal is not to be exhaustive on all possible metadata but to be pragmatic to align on the things that are already in the process of being recorded.

Metadata subjects

For now 3 "objects of interests" are forseen in the project on which metadata could be applied:

  • data sources
  • data set
  • data sample (or document)

Key general format

Simply text, in small cap and avoiding any punctuation (including spaces to be replaced by underline character '_' if really necessary) to ease automated parsing.

Value general format

It will vary for each metadata key with an open standard to be used as reference whenever applicable. All text should be encoded in UTF-8 to cope with any language scripts.

Some proposed value formats:

| Type of value | Value | |----------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | language | ISO639-3 => 3 letter codes with the most coverage (see for referencewikipedia list of ISO_639-3 codes) ex: afr for Afrikaans | | | Note: IETF bcp 47 was proposed as an alternative in the data sourcing group, a list of standard values would be necessary to be practical (like the wikipedia list for ISO639-3) | timestamp | ISO_8601 normalized to UTC time zone ex: 2021-07-06T15:47:46+00:00 | | URL | Full length URL including the scheme (eg http/ftp...) ex: https://en.wikipedia.org/wiki/URL | | text | free text encoded in UTF-8

Annotation of text content

For data sample, it is foreseen that information might be extracted from the original text content such as named entities using position reference to the original content.

TO BE CONTINUED

Data source metadata format

TO BE CONTINUED

Dataset metadata format

TO BE CONTINUED

Data sample metadata format

| Key | Value | |----------------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | hash | unique fixed length that could be used as identifier - proposal to use murmur hash 128 bits | | mainlanguage | ISO639-3 => 3 letter codes with the most coverage (see for referencewikipedia list of ISO_639-3 codes) ex: afr for Afrikaans | | otherlanguages | list of ISO639-3 codes of all languages possibly found in the sample without specific order | | collectiontimestamp | timestamp of original collection of the data (eg. for web crawl) if precisely known the default would be the timestamp of dataset creation | | publicationtimestamp | timestamp of the publication of the data (first time a web page has been online or last edit, radio/tv shows publication...) | | original_content | free text

TO BE CONTINUED

ex: json ["89faeee174d2ddbc2b761207efbc8464", "fra", ["eng", "deu"], "2021-07-06T19:06:02Z", null, "je crois il est parti à Stuttgart ou bien à London"]

Owner

  • Name: BigScience Workshop
  • Login: bigscience-workshop
  • Kind: organization
  • Email: bigscience-contact@googlegroups.com

Research workshop on large language models - The Summer of Language Models 21

GitHub Events

Total
  • Watch event: 6
  • Push event: 9
Last Year
  • Watch event: 6
  • Push event: 9

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 90
  • Total pull requests: 97
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 3 days
  • Total issue authors: 9
  • Total pull request authors: 30
  • Average comments per issue: 3.89
  • Average comments per pull request: 1.13
  • Merged pull requests: 81
  • Bot issues: 0
  • Bot pull requests: 10
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • albertvillanova (60)
  • huu4ontocord (20)
  • olinguyen (2)
  • ggdupont (2)
  • cccntu (2)
  • JoeyOhman (1)
  • asoroa (1)
  • mavela (1)
  • yuvalkirstain (1)
Pull Request Authors
  • huu4ontocord (21)
  • SaulLu (12)
  • pre-commit-ci[bot] (10)
  • olinguyen (9)
  • edugp (4)
  • clancyoftheoverflow (3)
  • Luvata (3)
  • majauhar (3)
  • ianyu93 (3)
  • ChenghaoMou (3)
  • Skylion007 (2)
  • asoroa (2)
  • jtboing (2)
  • ggdupont (2)
  • paulovn (2)
Top Labels
Issue Labels
data catalog (60) language modeling script (34) tooling (11) need custodian permission (6) need data sourcing feedback (6) corpus (5) duplicate (5) wontfix (4) good first issue (4) metadata (3) data format (3) help wanted (2) documentation (1) evaluation (1) filter (1) tokenizer (1)
Pull Request Labels
tooling (1)

Dependencies

cc_pseudo_crawl/python_scripts/extract_text/requirements.txt pypi
  • git *
cc_pseudo_crawl/python_scripts/requirements.txt pypi
  • boto3 *
  • bs4 *
  • datasets *
  • pyathena *
  • surt *
  • tldextract *
  • warcio *
index_search/requirements.txt pypi
  • datasets bigscience_datatooling
  • elasticsearch ==7.10.1
  • iso-639 ==0.4.5
  • ray *
  • simplejson *
kenlm_training/setup.py pypi
  • beautifulsoup4 >=4.7.1
  • datasets ==1.16.1
  • fasttext >=0.9.1
  • func_argparse >=1.1.1
  • kenlm *
  • pandas >=0.23.4
  • psutil >=5.6.3
  • requests >=2.22.0
  • sacremoses *
  • sentencepiece >=0.1.82
  • submitit >=1.0.0
  • typing_extensions *
perplexity_lenses/poetry.lock pypi
  • 142 dependencies
perplexity_lenses/pyproject.toml pypi
  • pytest ^5.2 develop
  • black ^21.10b0
  • bokeh 2.2.2
  • datasets 1.14.0
  • embedding-lenses 0.9.0
  • flake8 ^4.0.1
  • huggingface-hub 0.0.19
  • kenlm *
  • numba ^0.54.1
  • numpy 1.20.0
  • python >=3.7,<3.10
  • scikit-learn 0.24.2
  • sentence-transformers 2.0.0
  • streamlit 1.1.0
  • transformers 4.11.3
  • typer ^0.4.0
  • umap-learn ^0.5.2
  • watchdog 2.1.3
perplexity_lenses/requirements.txt pypi
  • bokeh ==2.2.2
  • embedding-lenses ==0.9.0
  • huggingface-hub ==0.0.19
  • numpy ==1.20.0
  • sentence-transformers ==2.0.0
  • streamlit ==1.1.0
  • transformers ==4.11.3
  • typer ==0.4.0
  • umap-learn ==0.5.2
  • watchdog ==2.1.3
pii-manager/requirements.txt pypi
  • python-stdnum >=1.17,<2.0
  • regex >=2021.11.10
poetry.lock pypi
  • 126 dependencies
pyproject.toml pypi
  • black ^21.7b0 develop
  • flake8 ^3.8.4 develop
  • isort ^5.6.4 develop
  • jupyterlab ^3.0.16 develop
  • pdbpp ^0.10.2 develop
  • pytest ^6.2.4 develop
  • PyYAML ^6.0
  • datasets ^1.12.1
  • fsspec ^2021.11.0
  • kenlm *
  • nltk ^3.6.5
  • python ^3.7.10
  • regex ^2021.11.10
  • scikit-learn ^1.0.1
  • simhash-py ^0.4.0
  • tqdm ^4.62.3
  • transformers ^4.12.3
  • typer ^0.4.0
requirements.txt pypi
  • dataset >=1.5.0
  • datasets >=1.8.0
  • fasttext >=0.9.2
  • fsspec *
  • ftfy *
  • indexed_gzip >=1.6.1
  • langid >=1.1.6
  • nltk *
  • scikit-learn *
  • sentencepiece *
  • sqlalchemy >=1.4.20
  • transformers *
  • wordfreq *
tokenizer/python_script/requirements.txt pypi
  • datasets >=1.18.0
  • pyarrow >=6.0.0
.github/workflows/add-issue-to-project.yml actions
  • tibdex/github-app-token 36464acb844fc53b9b8b2401da68844f6b05ebb0 composite
.github/workflows/pii-manager.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
index_search/docker-compose.yml docker
  • docker.elastic.co/elasticsearch/elasticsearch 7.13.2
  • docker.elastic.co/kibana/kibana 7.13.2