https://github.com/bigscience-workshop/data_tooling

Tools for managing datasets for governance and training.

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Tools for managing datasets for governance and training.

Basic Info

Host: GitHub
Owner: bigscience-workshop
License: apache-2.0
Language: HTML
Default Branch: master
Homepage:
Size: 218 MB

Statistics

Stars: 83
Watchers: 16
Forks: 46
Open Issues: 141
Releases: 0

Created about 5 years ago · Last pushed 11 months ago

Metadata Files

Readme License

Data Tooling and Governance

Tools for managing datasets for governance and training large language models.

Issues we aim to address

How do we automatically curate data to create datasets that are performant and comply with BigScience ethical values?
How do we remediate a dataset for personally identifiable information without degrading performance?
How should we store and serve the dataset?
How do we store and serve meta-data in datasets?
How do we address contestation of data?
How do we prove legal compliance in the use of datasets?
How do we prevent dissemination of the data beyond approved uses?
How do we keep trusted data secure?

Format to distribute data samples

Current consensus is to use jsonl.

Metadata guideline

Trying to keep things as simple as possible, the proposed metadata guideline is simply a flat format in a key/value format where values format could be constrained. The goal is not to be exhaustive on all possible metadata but to be pragmatic to align on the things that are already in the process of being recorded.

Metadata subjects

For now 3 "objects of interests" are forseen in the project on which metadata could be applied:

data sources
data set
data sample (or document)

Key general format

Simply text, in small cap and avoiding any punctuation (including spaces to be replaced by underline character '_' if really necessary) to ease automated parsing.

Value general format

It will vary for each metadata key with an open standard to be used as reference whenever applicable. All text should be encoded in UTF-8 to cope with any language scripts.

Some proposed value formats:

| Type of value | Value | |----------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | language | ISO639-3 => 3 letter codes with the most coverage (see for referencewikipedia list of ISO_639-3 codes) ex: afr for Afrikaans | | | Note: IETF bcp 47 was proposed as an alternative in the data sourcing group, a list of standard values would be necessary to be practical (like the wikipedia list for ISO639-3) | timestamp | ISO_8601 normalized to UTC time zone ex: 2021-07-06T15:47:46+00:00 | | URL | Full length URL including the scheme (eg http/ftp...) ex: https://en.wikipedia.org/wiki/URL | | text | free text encoded in UTF-8

Annotation of text content

For data sample, it is foreseen that information might be extracted from the original text content such as named entities using position reference to the original content.

TO BE CONTINUED

Data source metadata format

TO BE CONTINUED

Dataset metadata format

TO BE CONTINUED

Data sample metadata format

| Key | Value | |----------------------- |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | hash | unique fixed length that could be used as identifier - proposal to use murmur hash 128 bits | | mainlanguage | ISO639-3 => 3 letter codes with the most coverage (see for referencewikipedia list of ISO_639-3 codes) ex: afr for Afrikaans | | otherlanguages | list of ISO639-3 codes of all languages possibly found in the sample without specific order | | collectiontimestamp | timestamp of original collection of the data (eg. for web crawl) if precisely known the default would be the timestamp of dataset creation | | publicationtimestamp | timestamp of the publication of the data (first time a web page has been online or last edit, radio/tv shows publication...) | | original_content | free text

TO BE CONTINUED

ex: json ["89faeee174d2ddbc2b761207efbc8464", "fra", ["eng", "deu"], "2021-07-06T19:06:02Z", null, "je crois il est parti à Stuttgart ou bien à London"]

Owner

Name: BigScience Workshop
Login: bigscience-workshop
Kind: organization
Email: bigscience-contact@googlegroups.com

Website: https://bigscience.huggingface.co
Twitter: BigScienceW
Repositories: 28
Profile: https://github.com/bigscience-workshop

Research workshop on large language models - The Summer of Language Models 21

GitHub Events

Total

Watch event: 6
Push event: 9

Last Year

Watch event: 6
Push event: 9

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 90
Total pull requests: 97
Average time to close issues: about 2 months
Average time to close pull requests: 3 days
Total issue authors: 9
Total pull request authors: 30
Average comments per issue: 3.89
Average comments per pull request: 1.13
Merged pull requests: 81
Bot issues: 0
Bot pull requests: 10

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

albertvillanova (60)
huu4ontocord (20)
olinguyen (2)
ggdupont (2)
cccntu (2)
JoeyOhman (1)
asoroa (1)
mavela (1)
yuvalkirstain (1)

Pull Request Authors

huu4ontocord (21)
SaulLu (12)
pre-commit-ci[bot] (10)
olinguyen (9)
edugp (4)
clancyoftheoverflow (3)
Luvata (3)
majauhar (3)
ianyu93 (3)
ChenghaoMou (3)
Skylion007 (2)
asoroa (2)
jtboing (2)
ggdupont (2)
paulovn (2)

Top Labels

Issue Labels

data catalog (60) language modeling script (34) tooling (11) need custodian permission (6) need data sourcing feedback (6) corpus (5) duplicate (5) wontfix (4) good first issue (4) metadata (3) data format (3) help wanted (2) documentation (1) evaluation (1) filter (1) tokenizer (1)

Pull Request Labels

tooling (1)

Dependencies

cc_pseudo_crawl/python_scripts/extract_text/requirements.txt pypi

git *

cc_pseudo_crawl/python_scripts/requirements.txt pypi

boto3 *
bs4 *
datasets *
pyathena *
surt *
tldextract *
warcio *

index_search/requirements.txt pypi

datasets bigscience_datatooling
elasticsearch ==7.10.1
iso-639 ==0.4.5
ray *
simplejson *

kenlm_training/setup.py pypi

beautifulsoup4 >=4.7.1
datasets ==1.16.1
fasttext >=0.9.1
func_argparse >=1.1.1
kenlm *
pandas >=0.23.4
psutil >=5.6.3
requests >=2.22.0
sacremoses *
sentencepiece >=0.1.82
submitit >=1.0.0
typing_extensions *

perplexity_lenses/poetry.lock pypi

142 dependencies

perplexity_lenses/pyproject.toml pypi

pytest ^5.2 develop
black ^21.10b0
bokeh 2.2.2
datasets 1.14.0
embedding-lenses 0.9.0
flake8 ^4.0.1
huggingface-hub 0.0.19
kenlm *
numba ^0.54.1
numpy 1.20.0
python >=3.7,<3.10
scikit-learn 0.24.2
sentence-transformers 2.0.0
streamlit 1.1.0
transformers 4.11.3
typer ^0.4.0
umap-learn ^0.5.2
watchdog 2.1.3

perplexity_lenses/requirements.txt pypi

bokeh ==2.2.2
embedding-lenses ==0.9.0
huggingface-hub ==0.0.19
numpy ==1.20.0
sentence-transformers ==2.0.0
streamlit ==1.1.0
transformers ==4.11.3
typer ==0.4.0
umap-learn ==0.5.2
watchdog ==2.1.3

pii-manager/requirements.txt pypi

python-stdnum >=1.17,<2.0
regex >=2021.11.10

poetry.lock pypi

126 dependencies

pyproject.toml pypi

black ^21.7b0 develop
flake8 ^3.8.4 develop
isort ^5.6.4 develop
jupyterlab ^3.0.16 develop
pdbpp ^0.10.2 develop
pytest ^6.2.4 develop
PyYAML ^6.0
datasets ^1.12.1
fsspec ^2021.11.0
kenlm *
nltk ^3.6.5
python ^3.7.10
regex ^2021.11.10
scikit-learn ^1.0.1
simhash-py ^0.4.0
tqdm ^4.62.3
transformers ^4.12.3
typer ^0.4.0

requirements.txt pypi

dataset >=1.5.0
datasets >=1.8.0
fasttext >=0.9.2
fsspec *
ftfy *
indexed_gzip >=1.6.1
langid >=1.1.6
nltk *
scikit-learn *
sentencepiece *
sqlalchemy >=1.4.20
transformers *
wordfreq *

tokenizer/python_script/requirements.txt pypi

datasets >=1.18.0
pyarrow >=6.0.0

.github/workflows/add-issue-to-project.yml actions

tibdex/github-app-token 36464acb844fc53b9b8b2401da68844f6b05ebb0 composite

.github/workflows/pii-manager.yml actions

actions/checkout v2 composite
actions/setup-python v1 composite

index_search/docker-compose.yml docker

docker.elastic.co/elasticsearch/elasticsearch 7.13.2
docker.elastic.co/kibana/kibana 7.13.2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bigscience-workshop/data_tooling

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Data Tooling and Governance

Issues we aim to address

Format to distribute data samples

Metadata guideline

Metadata subjects

Key general format

Value general format

Annotation of text content

Data source metadata format

Dataset metadata format

Data sample metadata format

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies