https://github.com/bigscience-workshop/metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.

Basic Info

Host: GitHub
Owner: bigscience-workshop
License: apache-2.0
Language: Python
Default Branch: master
Size: 927 KB

Statistics

Stars: 31
Watchers: 18
Forks: 11
Open Issues: 38
Releases: 0

Created almost 5 years ago · Last pushed about 3 years ago

Metadata Files

Readme License

BigScience Modeling Metadata

This repository contains code for including metadata such as URLs, timestamps, website descriptions and HTML tags during language model pretraining. The purpose is to explore, solely from a modeling perspective, how to make good use of metadata to improve various aspects of the model (such as its zero-shot text generation abilities). This repository is not intended for general contributions to metadata that are not concerned with modeling.

Usage

sh accelerate launch --fp16 train.py max_train_steps=100 eval_num_per_epoch=1 data_config.per_device_eval_batch_size=4

Get Help

sh python bsmetadata/train.py [-h] [--help]

Metadata Format

This script expects metadata to be in JSON lines (.jsonl) format. Each JSON line is required to have the following fields:

text: The actual input text.
metadata: A list of metadata associated with the given text.

The script supports two different kinds of metadata: global metadata, which applies to the whole text, and local metadata, which applies only to parts of it.

Global Metadata

Global metadata is required to have the following fields:

key: A unique key to identify this kind of metadata (e.g., url or timestamp).
type: This must be set to global.
value: The actual value associated with this metadata instance (e.g., an actual URL or timestamp).

Local Metadata

Local metadata is required to have the following fields:

key: A unique key to identify this kind of metadata (e.g., entity or html).
type: This must be set to local.
char_start_idx: The index of the first character in text that is associated with this metadata instance.
char_end_idx: The index of the first character in text that is not associated with this metadata instance.
value: The actual value associated with this metadata instance (e.g., an entity name or HTML tag).

And there are also two optional keys (which must be set or not set for all metadata sharing the same key value): - relative_start_pos position at which the start metadata is placed at a given character index. - relative_end_pos position at which the end metadata is placed at a given character index. The counter is common between relative_start_pos and relative_end_pos for a given key value.

Example

Assume that the following text is extracted from https://www.bbc.com/sport/live/olympics/50974152, which is an article that was published on 2018-12-10T13:45:00.000Z (line 2-3 show the output of an entity tagger applied to this text).

html <body><div>It was a brilliant first round. You have to break down the Cuban's rhythm you can't let them get into rhythm. The risk with that is <a>Yafai</a> has got to go him.\n</div></body> ^^^^^ Entity: Galal Yafai

This text would be represented as the following input example with two global metadata instances (url and timestamp) and five local metadata instances (1 entity and 4 html). Note that this entire input should be in a single line in the actual dataset.

javascript { "text": "It was a brilliant first round. You have to break down the Cuban's rhythm you can't let them get into rhythm. The risk with that is Yafai has got to go him.\n", "metadata": [ {"key": "url", "type": "global", "value": "https://www.bbc.com/sport/live/olympics/50974152"}, {"key": "timestamp", "type": "global", "value": "2018-12-10T13:45:00.000Z"}, {"key": "entity", "type": "local", "char_start_idx": 132, "char_end_idx": 137, "value": "Galal Yafai"}, {'key': 'html', 'type': 'local', 'char_start_idx': 132, 'relative_start_pos': 0, 'char_end_idx': 137, 'relative_end_pos': 0, 'value': 'a', 'html_attrs': {'attrs': [], 'values': []}}, {'key': 'html', 'type': 'local', 'char_start_idx': 0, 'relative_start_pos': 2, 'char_end_idx': 156, 'relative_end_pos': 0, 'value': 'p', 'html_attrs': {'attrs': [], 'values': []}}, {'key': 'html', 'type': 'local', 'char_start_idx': 0, 'relative_start_pos': 1, 'char_end_idx': 157, 'relative_end_pos': 0, 'value': 'div', 'html_attrs': {'attrs': [], 'values': []}}, {'key': 'html', 'type': 'local', 'char_start_idx': 0, 'relative_start_pos': 0, 'char_end_idx': 157, 'relative_end_pos': 1, 'value': 'body', 'html_attrs': {'attrs': [], 'values': []}}, ] }

Below is a table showing relative_start_pos (start) and relative_end_pos (end) fields work on some HTML tags examples.

| Input | start () | end() | start () | end() | | - | - | - | - | - | |  ...  | 0 | 1 | 2 | 0 | |  ...  | 0 | 1 | 1 | 0 | |  ...  | 0 | 0 | 1 | 2 | |  ...  | 0 | 2 | 0 | 1 |

Pre-processing Metadata:

Entity Tags.

Pre-requisite steps for preprocessing Entity Tags

Run preprocessing_scripts/download_entity_processing_files.sh <"location of the folder to save the files"> to download all the required files. (approximate size of files to be downloaded in total is around 20GB and after extracting the zipped folders, space required is around 60GB.)

Contribute 🧠

After installing the development dependencies, first you need to install the package in editable mode. You can do it by running in a bash at the root of the repository

pip install -e .

and then you can execute the tests by running:

sh python -m pytest .

In order to have a unified code style, we have implemented some formatting tools. Before you commit or PR, it would be great if you could run:

sh make style && make quality

Owner

Name: BigScience Workshop
Login: bigscience-workshop
Kind: organization
Email: bigscience-contact@googlegroups.com

Website: https://bigscience.huggingface.co
Twitter: BigScienceW
Repositories: 28
Profile: https://github.com/bigscience-workshop

Research workshop on large language models - The Summer of Language Models 21

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 51
Total pull requests: 130
Average time to close issues: 3 months
Average time to close pull requests: 24 days
Total issue authors: 6
Total pull request authors: 11
Average comments per issue: 0.88
Average comments per pull request: 0.81
Merged pull requests: 104
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

norakassner (19)
SaulLu (13)
tianjianjiang (12)
manandey (2)
cccntu (1)
shanyas10 (1)

Pull Request Authors

SaulLu (33)
tianjianjiang (24)
cccntu (22)
manandey (20)
shanyas10 (8)
jordiclive (7)
timoschick (5)
Muennighoff (3)
ppommer (3)
chkla (2)
masoudjs (1)

Top Labels

Issue Labels

enhancement (8) wontfix (6) #dataset (3) duplicate (3) question (1) documentation (1) bug (1) Epic (1) #paragraph_extraction (1)

Pull Request Labels

enhancement (9) bug (5) #dataset (3) #paragraph_extraction (1)

Dependencies

.github/workflows/code_quality.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

.github/workflows/test.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

poetry.lock pypi

126 dependencies

requirements-dev.txt pypi

black >=22.3.0 development
flake8 >=3.8.3 development
isort >=5.5.4 development
pytest ==6.2.4 development

requirements.txt pypi

accelerate >=0.4.0,<1
datasets >=1.18.4
deepspeed >=0.6.1
gensim >=3.8.3,<4
htmlmin ==0.1.12
hydra_core >=1.1,<1.2
loguru >=0.6.0
lxml ==4.6.5
nltk ==3.6.7
numpy <1.22
numpy >=1.22
pandas <1.4
pandas >=1.4
pyarrow >=7.0.0,<8
tokenizers *
transformers >=4.22,<5
wandb >=0.10.32,<1

requirements_resolved_with_extras_and_dev.txt pypi

accelerate ==0.13.2 development
aiohttp ==3.8.3 development
aiosignal ==1.2.0 development
antlr4-python3-runtime ==4.8 development
anyascii ==0.3.1 development
async-timeout ==4.0.2 development
asynctest ==0.13.0 development
atomicwrites ==1.4.1 development
attrs ==22.1.0 development
black ==22.10.0 development
bpemb ==0.3.4 development
certifi ==2022.9.24 development
charset-normalizer ==2.1.1 development
click ==8.1.3 development
cloudpickle ==2.2.0 development
colorama ==0.4.6 development
cycler ==0.11.0 development
cython ==0.29.14 development
datasets ==2.6.1 development
deepspeed ==0.7.4 development
deprecated ==1.2.13 development
dill ==0.3.5.1 development
docker-pycreds ==0.4.0 development
filelock ==3.8.0 development
flair ==0.5.1 development
flake8 ==5.0.4 development
fonttools ==4.38.0 development
frozenlist ==1.3.1 development
fsspec ==2022.10.0 development
future ==0.18.2 development
gensim ==3.8.3 development
gitdb ==4.0.9 development
gitpython ==3.1.29 development
hjson ==3.1.0 development
htmlmin ==0.1.12 development
huggingface-hub ==0.10.1 development
hydra-core ==1.1.2 development
hyperopt ==0.2.7 development
idna ==3.4 development
importlib-metadata ==4.2.0 development
importlib-resources ==5.2.3 development
iniconfig ==1.1.1 development
isort ==5.10.1 development
jieba ==0.42.1 development
joblib ==1.2.0 development
kiwisolver ==1.4.4 development
konoha ==5.3.0 development
langdetect ==1.0.9 development
lmdb ==1.3.0 development
loguru ==0.6.0 development
lxml ==4.6.5 development
marisa-trie ==0.7.8 development
matplotlib ==3.5.3 development
mccabe ==0.7.0 development
mpld3 ==0.3 development
multidict ==6.0.2 development
multiprocess ==0.70.13 development
mwparserfromhell ==0.6.4 development
mypy-extensions ==0.4.3 development
networkx ==2.6.3 development
ninja ==1.10.2.4 development
nltk ==3.6.7 development
numpy ==1.21.6 development
numpy ==1.22.4 development
omegaconf ==2.1.2 development
overrides ==3.1.0 development
packaging ==21.3 development
pandas ==1.3.5 development
pandas ==1.5.1 development
pathspec ==0.10.1 development
pathtools ==0.1.2 development
pillow ==9.2.0 development
platformdirs ==2.5.2 development
pluggy ==0.13.1 development
promise ==2.3 development
protobuf ==4.21.9 development
psutil ==5.9.3 development
py ==1.11.0 development
py-cpuinfo ==9.0.0 development
py4j ==0.10.9.7 development
pyarrow ==7.0.0 development
pycodestyle ==2.9.1 development
pydantic ==1.10.2 development
pyflakes ==2.5.0 development
pyparsing ==3.0.9 development
pytest ==6.2.4 development
python-dateutil ==2.8.2 development
pytz ==2022.5 development
pyyaml ==6.0 development
regex ==2022.9.13 development
requests ==2.28.1 development
responses ==0.18.0 development
scikit-learn ==1.0.2 development
scipy ==1.7.3 development
segtok ==1.5.11 development
sentencepiece ==0.1.97 development
sentry-sdk ==1.10.1 development
setproctitle ==1.3.2 development
setuptools ==62.6.0 development
setuptools-scm ==6.4.2 development
shortuuid ==1.0.9 development
six ==1.16.0 development
smart-open ==6.2.0 development
smmap ==5.0.0 development
sqlitedict ==2.0.0 development
tabulate ==0.9.0 development
threadpoolctl ==3.1.0 development
tokenizers ==0.13.1 development
toml ==0.10.2 development
tomli ==2.0.1 development
torch ==1.9.0 development
tqdm ==4.64.1 development
transformers ==4.23.1 development
typed-ast ==1.5.4 development
typing-extensions ==4.4.0 development
urllib3 ==1.26.12 development
wandb ==0.13.4 development
wheel ==0.37.1 development
wikipedia2vec ==1.0.5 development
win32-setctime ==1.1.0 development
wrapt ==1.14.1 development
xxhash ==3.1.0 development
yarl ==1.8.1 development
zipp ==3.10.0 development

pyproject.toml pypi

setup.py pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bigscience-workshop/metadata

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

BigScience Modeling Metadata

Usage

Get Help

Metadata Format

Global Metadata

Local Metadata

Example

Pre-processing Metadata:

Entity Tags.

Contribute 🧠

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies