https://github.com/bigscience-workshop/metadata
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary
Repository
Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Basic Info
- Host: GitHub
- Owner: bigscience-workshop
- License: apache-2.0
- Language: Python
- Default Branch: master
- Size: 927 KB
Statistics
- Stars: 31
- Watchers: 18
- Forks: 11
- Open Issues: 38
- Releases: 0
Metadata Files
README.md
BigScience Modeling Metadata
This repository contains code for including metadata such as URLs, timestamps, website descriptions and HTML tags during language model pretraining. The purpose is to explore, solely from a modeling perspective, how to make good use of metadata to improve various aspects of the model (such as its zero-shot text generation abilities). This repository is not intended for general contributions to metadata that are not concerned with modeling.
Usage
sh
accelerate launch --fp16 train.py max_train_steps=100 eval_num_per_epoch=1 data_config.per_device_eval_batch_size=4
Get Help
sh
python bsmetadata/train.py [-h] [--help]
Metadata Format
This script expects metadata to be in JSON lines (.jsonl) format. Each JSON line is required to have the following fields:
text: The actual input text.metadata: A list of metadata associated with the given text.
The script supports two different kinds of metadata: global metadata, which applies to the whole text, and local metadata, which applies only to parts of it.
Global Metadata
Global metadata is required to have the following fields:
key: A unique key to identify this kind of metadata (e.g.,urlortimestamp).type: This must be set toglobal.value: The actual value associated with this metadata instance (e.g., an actual URL or timestamp).
Local Metadata
Local metadata is required to have the following fields:
key: A unique key to identify this kind of metadata (e.g.,entityorhtml).type: This must be set tolocal.char_start_idx: The index of the first character intextthat is associated with this metadata instance.char_end_idx: The index of the first character intextthat is not associated with this metadata instance.value: The actual value associated with this metadata instance (e.g., an entity name or HTML tag).
And there are also two optional keys (which must be set or not set for all metadata sharing the same key value):
- relative_start_pos position at which the start metadata is placed at a given character index.
- relative_end_pos position at which the end metadata is placed at a given character index.
The counter is common between relative_start_pos and relative_end_pos for a given key value.
Example
Assume that the following text is extracted from https://www.bbc.com/sport/live/olympics/50974152, which is an article that was published on 2018-12-10T13:45:00.000Z (line 2-3 show the output of an entity tagger applied to this text).
html
<body><div><p>It was a brilliant first round. You have to break down the Cuban's rhythm you can't let them get into rhythm. The risk with that is <a>Yafai</a> has got to go him.</p>\n</div></body>
^^^^^
Entity: Galal Yafai
This text would be represented as the following input example with two global metadata instances (url and timestamp) and five local metadata instances (1 entity and 4 html). Note that this entire input should be in a single line in the actual dataset.
javascript
{
"text": "It was a brilliant first round. You have to break down the Cuban's rhythm you can't let them get into rhythm. The risk with that is Yafai has got to go him.\n",
"metadata": [
{"key": "url", "type": "global", "value": "https://www.bbc.com/sport/live/olympics/50974152"},
{"key": "timestamp", "type": "global", "value": "2018-12-10T13:45:00.000Z"},
{"key": "entity", "type": "local", "char_start_idx": 132, "char_end_idx": 137, "value": "Galal Yafai"},
{'key': 'html', 'type': 'local', 'char_start_idx': 132, 'relative_start_pos': 0, 'char_end_idx': 137, 'relative_end_pos': 0, 'value': 'a', 'html_attrs': {'attrs': [], 'values': []}},
{'key': 'html', 'type': 'local', 'char_start_idx': 0, 'relative_start_pos': 2, 'char_end_idx': 156, 'relative_end_pos': 0, 'value': 'p', 'html_attrs': {'attrs': [], 'values': []}},
{'key': 'html', 'type': 'local', 'char_start_idx': 0, 'relative_start_pos': 1, 'char_end_idx': 157, 'relative_end_pos': 0, 'value': 'div', 'html_attrs': {'attrs': [], 'values': []}},
{'key': 'html', 'type': 'local', 'char_start_idx': 0, 'relative_start_pos': 0, 'char_end_idx': 157, 'relative_end_pos': 1, 'value': 'body', 'html_attrs': {'attrs': [], 'values': []}},
]
}
Below is a table showing relative_start_pos (start) and relative_end_pos (end) fields work on some HTML tags examples.
| Input | start (<i>) | end(<i>) | start (<b>) | end(<b>) |
| - | - | - | - | - |
| <i></i><b> ... </b> | 0 | 1 | 2 | 0 |
| <i><b> ... </b></i> | 0 | 1 | 1 | 0 |
| <i> ... </i><b></b> | 0 | 0 | 1 | 2 |
| <i> ... <b></b></i> | 0 | 2 | 0 | 1 |
Pre-processing Metadata:
Entity Tags.
Pre-requisite steps for preprocessing Entity Tags
- Run
preprocessing_scripts/download_entity_processing_files.sh <"location of the folder to save the files">to download all the required files. (approximate size of files to be downloaded in total is around 20GB and after extracting the zipped folders, space required is around 60GB.)
Contribute 🧠
After installing the development dependencies, first you need to install the package in editable mode. You can do it by running in a bash at the root of the repository
pip install -e .
and then you can execute the tests by running:
sh
python -m pytest .
In order to have a unified code style, we have implemented some formatting tools. Before you commit or PR, it would be great if you could run:
sh
make style && make quality
Owner
- Name: BigScience Workshop
- Login: bigscience-workshop
- Kind: organization
- Email: bigscience-contact@googlegroups.com
- Website: https://bigscience.huggingface.co
- Twitter: BigScienceW
- Repositories: 28
- Profile: https://github.com/bigscience-workshop
Research workshop on large language models - The Summer of Language Models 21
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 51
- Total pull requests: 130
- Average time to close issues: 3 months
- Average time to close pull requests: 24 days
- Total issue authors: 6
- Total pull request authors: 11
- Average comments per issue: 0.88
- Average comments per pull request: 0.81
- Merged pull requests: 104
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- norakassner (19)
- SaulLu (13)
- tianjianjiang (12)
- manandey (2)
- cccntu (1)
- shanyas10 (1)
Pull Request Authors
- SaulLu (33)
- tianjianjiang (24)
- cccntu (22)
- manandey (20)
- shanyas10 (8)
- jordiclive (7)
- timoschick (5)
- Muennighoff (3)
- ppommer (3)
- chkla (2)
- masoudjs (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v3 composite
- actions/setup-python v4 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- 126 dependencies
- black >=22.3.0 development
- flake8 >=3.8.3 development
- isort >=5.5.4 development
- pytest ==6.2.4 development
- accelerate >=0.4.0,<1
- datasets >=1.18.4
- deepspeed >=0.6.1
- gensim >=3.8.3,<4
- htmlmin ==0.1.12
- hydra_core >=1.1,<1.2
- loguru >=0.6.0
- lxml ==4.6.5
- nltk ==3.6.7
- numpy <1.22
- numpy >=1.22
- pandas <1.4
- pandas >=1.4
- pyarrow >=7.0.0,<8
- tokenizers *
- transformers >=4.22,<5
- wandb >=0.10.32,<1
- accelerate ==0.13.2 development
- aiohttp ==3.8.3 development
- aiosignal ==1.2.0 development
- antlr4-python3-runtime ==4.8 development
- anyascii ==0.3.1 development
- async-timeout ==4.0.2 development
- asynctest ==0.13.0 development
- atomicwrites ==1.4.1 development
- attrs ==22.1.0 development
- black ==22.10.0 development
- bpemb ==0.3.4 development
- certifi ==2022.9.24 development
- charset-normalizer ==2.1.1 development
- click ==8.1.3 development
- cloudpickle ==2.2.0 development
- colorama ==0.4.6 development
- cycler ==0.11.0 development
- cython ==0.29.14 development
- datasets ==2.6.1 development
- deepspeed ==0.7.4 development
- deprecated ==1.2.13 development
- dill ==0.3.5.1 development
- docker-pycreds ==0.4.0 development
- filelock ==3.8.0 development
- flair ==0.5.1 development
- flake8 ==5.0.4 development
- fonttools ==4.38.0 development
- frozenlist ==1.3.1 development
- fsspec ==2022.10.0 development
- future ==0.18.2 development
- gensim ==3.8.3 development
- gitdb ==4.0.9 development
- gitpython ==3.1.29 development
- hjson ==3.1.0 development
- htmlmin ==0.1.12 development
- huggingface-hub ==0.10.1 development
- hydra-core ==1.1.2 development
- hyperopt ==0.2.7 development
- idna ==3.4 development
- importlib-metadata ==4.2.0 development
- importlib-resources ==5.2.3 development
- iniconfig ==1.1.1 development
- isort ==5.10.1 development
- jieba ==0.42.1 development
- joblib ==1.2.0 development
- kiwisolver ==1.4.4 development
- konoha ==5.3.0 development
- langdetect ==1.0.9 development
- lmdb ==1.3.0 development
- loguru ==0.6.0 development
- lxml ==4.6.5 development
- marisa-trie ==0.7.8 development
- matplotlib ==3.5.3 development
- mccabe ==0.7.0 development
- mpld3 ==0.3 development
- multidict ==6.0.2 development
- multiprocess ==0.70.13 development
- mwparserfromhell ==0.6.4 development
- mypy-extensions ==0.4.3 development
- networkx ==2.6.3 development
- ninja ==1.10.2.4 development
- nltk ==3.6.7 development
- numpy ==1.21.6 development
- numpy ==1.22.4 development
- omegaconf ==2.1.2 development
- overrides ==3.1.0 development
- packaging ==21.3 development
- pandas ==1.3.5 development
- pandas ==1.5.1 development
- pathspec ==0.10.1 development
- pathtools ==0.1.2 development
- pillow ==9.2.0 development
- platformdirs ==2.5.2 development
- pluggy ==0.13.1 development
- promise ==2.3 development
- protobuf ==4.21.9 development
- psutil ==5.9.3 development
- py ==1.11.0 development
- py-cpuinfo ==9.0.0 development
- py4j ==0.10.9.7 development
- pyarrow ==7.0.0 development
- pycodestyle ==2.9.1 development
- pydantic ==1.10.2 development
- pyflakes ==2.5.0 development
- pyparsing ==3.0.9 development
- pytest ==6.2.4 development
- python-dateutil ==2.8.2 development
- pytz ==2022.5 development
- pyyaml ==6.0 development
- regex ==2022.9.13 development
- requests ==2.28.1 development
- responses ==0.18.0 development
- scikit-learn ==1.0.2 development
- scipy ==1.7.3 development
- segtok ==1.5.11 development
- sentencepiece ==0.1.97 development
- sentry-sdk ==1.10.1 development
- setproctitle ==1.3.2 development
- setuptools ==62.6.0 development
- setuptools-scm ==6.4.2 development
- shortuuid ==1.0.9 development
- six ==1.16.0 development
- smart-open ==6.2.0 development
- smmap ==5.0.0 development
- sqlitedict ==2.0.0 development
- tabulate ==0.9.0 development
- threadpoolctl ==3.1.0 development
- tokenizers ==0.13.1 development
- toml ==0.10.2 development
- tomli ==2.0.1 development
- torch ==1.9.0 development
- tqdm ==4.64.1 development
- transformers ==4.23.1 development
- typed-ast ==1.5.4 development
- typing-extensions ==4.4.0 development
- urllib3 ==1.26.12 development
- wandb ==0.13.4 development
- wheel ==0.37.1 development
- wikipedia2vec ==1.0.5 development
- win32-setctime ==1.1.0 development
- wrapt ==1.14.1 development
- xxhash ==3.1.0 development
- yarl ==1.8.1 development
- zipp ==3.10.0 development