https://github.com/bigscience-workshop/data_tooling
Tools for managing datasets for governance and training.
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary
Repository
Tools for managing datasets for governance and training.
Basic Info
Statistics
- Stars: 83
- Watchers: 16
- Forks: 46
- Open Issues: 141
- Releases: 0
Metadata Files
README.md
Data Tooling and Governance
Tools for managing datasets for governance and training large language models.
Issues we aim to address
- How do we automatically curate data to create datasets that are performant and comply with BigScience ethical values?
- How do we remediate a dataset for personally identifiable information without degrading performance?
- How should we store and serve the dataset?
- How do we store and serve meta-data in datasets?
- How do we address contestation of data?
- How do we prove legal compliance in the use of datasets?
- How do we prevent dissemination of the data beyond approved uses?
- How do we keep trusted data secure?
Format to distribute data samples
Current consensus is to use jsonl.
Metadata guideline
Trying to keep things as simple as possible, the proposed metadata guideline is simply a flat format in a key/value format where values format could be constrained. The goal is not to be exhaustive on all possible metadata but to be pragmatic to align on the things that are already in the process of being recorded.
Metadata subjects
For now 3 "objects of interests" are forseen in the project on which metadata could be applied:
- data sources
- data set
- data sample (or document)
Key general format
Simply text, in small cap and avoiding any punctuation (including spaces to be replaced by underline character '_' if really necessary) to ease automated parsing.
Value general format
It will vary for each metadata key with an open standard to be used as reference whenever applicable. All text should be encoded in UTF-8 to cope with any language scripts.
Some proposed value formats:
| Type of value | Value | |----------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | language | ISO639-3 => 3 letter codes with the most coverage (see for referencewikipedia list of ISO_639-3 codes) ex: afr for Afrikaans | | | Note: IETF bcp 47 was proposed as an alternative in the data sourcing group, a list of standard values would be necessary to be practical (like the wikipedia list for ISO639-3) | timestamp | ISO_8601 normalized to UTC time zone ex: 2021-07-06T15:47:46+00:00 | | URL | Full length URL including the scheme (eg http/ftp...) ex: https://en.wikipedia.org/wiki/URL | | text | free text encoded in UTF-8
Annotation of text content
For data sample, it is foreseen that information might be extracted from the original text content such as named entities using position reference to the original content.
TO BE CONTINUED
Data source metadata format
TO BE CONTINUED
Dataset metadata format
TO BE CONTINUED
Data sample metadata format
| Key | Value | |----------------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | hash | unique fixed length that could be used as identifier - proposal to use murmur hash 128 bits | | mainlanguage | ISO639-3 => 3 letter codes with the most coverage (see for referencewikipedia list of ISO_639-3 codes) ex: afr for Afrikaans | | otherlanguages | list of ISO639-3 codes of all languages possibly found in the sample without specific order | | collectiontimestamp | timestamp of original collection of the data (eg. for web crawl) if precisely known the default would be the timestamp of dataset creation | | publicationtimestamp | timestamp of the publication of the data (first time a web page has been online or last edit, radio/tv shows publication...) | | original_content | free text
TO BE CONTINUED
ex:
json
["89faeee174d2ddbc2b761207efbc8464", "fra", ["eng", "deu"], "2021-07-06T19:06:02Z", null, "je crois il est parti à Stuttgart ou bien à London"]
Owner
- Name: BigScience Workshop
- Login: bigscience-workshop
- Kind: organization
- Email: bigscience-contact@googlegroups.com
- Website: https://bigscience.huggingface.co
- Twitter: BigScienceW
- Repositories: 28
- Profile: https://github.com/bigscience-workshop
Research workshop on large language models - The Summer of Language Models 21
GitHub Events
Total
- Watch event: 6
- Push event: 9
Last Year
- Watch event: 6
- Push event: 9
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 90
- Total pull requests: 97
- Average time to close issues: about 2 months
- Average time to close pull requests: 3 days
- Total issue authors: 9
- Total pull request authors: 30
- Average comments per issue: 3.89
- Average comments per pull request: 1.13
- Merged pull requests: 81
- Bot issues: 0
- Bot pull requests: 10
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- albertvillanova (60)
- huu4ontocord (20)
- olinguyen (2)
- ggdupont (2)
- cccntu (2)
- JoeyOhman (1)
- asoroa (1)
- mavela (1)
- yuvalkirstain (1)
Pull Request Authors
- huu4ontocord (21)
- SaulLu (12)
- pre-commit-ci[bot] (10)
- olinguyen (9)
- edugp (4)
- clancyoftheoverflow (3)
- Luvata (3)
- majauhar (3)
- ianyu93 (3)
- ChenghaoMou (3)
- Skylion007 (2)
- asoroa (2)
- jtboing (2)
- ggdupont (2)
- paulovn (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- git *
- boto3 *
- bs4 *
- datasets *
- pyathena *
- surt *
- tldextract *
- warcio *
- datasets bigscience_datatooling
- elasticsearch ==7.10.1
- iso-639 ==0.4.5
- ray *
- simplejson *
- beautifulsoup4 >=4.7.1
- datasets ==1.16.1
- fasttext >=0.9.1
- func_argparse >=1.1.1
- kenlm *
- pandas >=0.23.4
- psutil >=5.6.3
- requests >=2.22.0
- sacremoses *
- sentencepiece >=0.1.82
- submitit >=1.0.0
- typing_extensions *
- 142 dependencies
- pytest ^5.2 develop
- black ^21.10b0
- bokeh 2.2.2
- datasets 1.14.0
- embedding-lenses 0.9.0
- flake8 ^4.0.1
- huggingface-hub 0.0.19
- kenlm *
- numba ^0.54.1
- numpy 1.20.0
- python >=3.7,<3.10
- scikit-learn 0.24.2
- sentence-transformers 2.0.0
- streamlit 1.1.0
- transformers 4.11.3
- typer ^0.4.0
- umap-learn ^0.5.2
- watchdog 2.1.3
- bokeh ==2.2.2
- embedding-lenses ==0.9.0
- huggingface-hub ==0.0.19
- numpy ==1.20.0
- sentence-transformers ==2.0.0
- streamlit ==1.1.0
- transformers ==4.11.3
- typer ==0.4.0
- umap-learn ==0.5.2
- watchdog ==2.1.3
- python-stdnum >=1.17,<2.0
- regex >=2021.11.10
- 126 dependencies
- black ^21.7b0 develop
- flake8 ^3.8.4 develop
- isort ^5.6.4 develop
- jupyterlab ^3.0.16 develop
- pdbpp ^0.10.2 develop
- pytest ^6.2.4 develop
- PyYAML ^6.0
- datasets ^1.12.1
- fsspec ^2021.11.0
- kenlm *
- nltk ^3.6.5
- python ^3.7.10
- regex ^2021.11.10
- scikit-learn ^1.0.1
- simhash-py ^0.4.0
- tqdm ^4.62.3
- transformers ^4.12.3
- typer ^0.4.0
- dataset >=1.5.0
- datasets >=1.8.0
- fasttext >=0.9.2
- fsspec *
- ftfy *
- indexed_gzip >=1.6.1
- langid >=1.1.6
- nltk *
- scikit-learn *
- sentencepiece *
- sqlalchemy >=1.4.20
- transformers *
- wordfreq *
- datasets >=1.18.0
- pyarrow >=6.0.0
- tibdex/github-app-token 36464acb844fc53b9b8b2401da68844f6b05ebb0 composite
- actions/checkout v2 composite
- actions/setup-python v1 composite
- docker.elastic.co/elasticsearch/elasticsearch 7.13.2
- docker.elastic.co/kibana/kibana 7.13.2