https://github.com/bigscience-workshop/lam

Libraries, Archives and Museums (LAM)

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Libraries, Archives and Museums (LAM)

Basic Info

Host: GitHub
Owner: bigscience-workshop
License: apache-2.0
Default Branch: main
Size: 43.9 KB

Statistics

Stars: 82
Watchers: 28
Forks: 7
Open Issues: 34
Releases: 0

Created about 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme License

BigLAM (Libraries, Archives and Museums)

🤗 Hugging Face x 🌸 BigScience initiative to create an open source, community resource of LAM datasets.

BigScience 🌸 is an open scientific collaboration of nearly 600 researchers from 50 countries and 250 institutions who collaborate on various projects within the natural language processing (NLP) space to broaden the accessibility of language datasets while working on challenging scientific questions around training language models.

We are running a datasets hackathon focused on making data from Libraries, Archives, and Museums (LAMS) with potential machine learning applications accessible via the Hugging Face Hub. You might also know this field as 'GLAM' - galleries, libraries, archives and museums.

We are doing this to help make these datasets more discoverable, open them up to new audiences, and help ensure that machine learning datasets more closely reflect the richness of human culture.

Goals

We aim to enable easy discovery and programmatic access to these datasets using Hugging Face's 🤗 Datasets Hub. As part of this, we want to:

Identify datasets that would be useful to have more easily accessible
Make these datasets available via the Datasets Hub
Document these datasets

Why are we doing this?

Some of the reasons we think that this effort is important:

There is a growing interest in using Machine Learning with LAM materials[^ai4lam]. The availability of datasets is one of the barriers to this effort. We want to make existing datasets more discoverable and easily accessible[^cordell]. Making datasets suitable for machine learning more easily discoverable will help reduce this barrier.
LAMs hold interesting data that currently we believe is underutilized by the broader machine learning ecosystem.
LAMs have the potential to play a positive role in making the development, sharing, and preservation of machine learning datasets a responsible way (see Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning). We want this hackathon to help develop practices we believe can positively impact the machine learning ecosystem.

Training (large) historic language models

There is a growing interest in using language models with historical texts.[^histlms] Although we are not only focused on collecting datasets for this purpose, we hope that some of the materials we gather as part of this sprint will be helpful in efforts to train language models on historic text data.

How can I contribute?

There are a few ways to contribute to the hackathon:

✨ Suggesting datasets that might be of interest (see the Wiki for guidance on the kinds of data we're interested in)
🤗 Making those datasets available via the Hugging Face Hub
🤳🏾 Invite institutions with open datasets to join the hackathon
📝 Documenting datasets by adding additional metadata and working on the data cards for those datasets.

Joining the hackathon

To join the hackathon, start by introducing yourself on our GitHub discussion board https://github.com/bigscience-workshop/lam/discussions/19.

Once you have said hi on the discussion boards you should request to join BigLAM Hugging Face organization.

For guidance, please check out the Wiki.

If you have questions:

first, check out the FAQs
if you don't find the answer in the FAQs, please ask on the discussions board

Dates

Initially we plan to run the hackathon until ~~August 19th 2022.~~ the end of October 2022.

[^ai4lam]: See for example, https://sites.google.com/view/ai4lam [^cordell]: R. Cordell, ‘Machine Learning + Libraries’, LC Labs. Accessed: Mar. 28, 2021. [Online]. Available: https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf, p.34 [^histlms]: Schweter, S., März, L., Schmid, K., & Çano, E. (2022). hmBERT: Historical Multilingual Language Models for Named Entity Recognition. ArXiv, abs/2205.15575., Manjavacas, E., & Fonteyn, L. (2022). Adapting vs. Pre-training Language Models for Historical Languages. Journal of Data Mining & Digital Humanities.

Owner

Name: BigScience Workshop
Login: bigscience-workshop
Kind: organization
Email: bigscience-contact@googlegroups.com

Website: https://bigscience.huggingface.co
Twitter: BigScienceW
Repositories: 28
Profile: https://github.com/bigscience-workshop

Research workshop on large language models - The Summer of Language Models 21

GitHub Events

Total

Watch event: 3

Last Year

Watch event: 3

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 129
Total pull requests: 25
Average time to close issues: about 2 months
Average time to close pull requests: about 2 hours
Total issue authors: 11
Total pull request authors: 4
Average comments per issue: 4.16
Average comments per pull request: 0.56
Merged pull requests: 25
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

davanstrien (43)
albertvillanova (12)
cakiki (3)
clancyoftheoverflow (3)
ymaurer (2)
nabsiddiqui (2)
giganttheo (1)
adikeinan (1)
thebooort (1)
versae (1)
Skorkmaz88 (1)

Pull Request Authors

davanstrien (6)
albertvillanova (6)
stefan-it (1)
mialondon (1)

Top Labels

Issue Labels

dataset (44) maintenance (15) candidate-dataset (5) ready for review (4) good first issue (4) documentation (3)

Pull Request Labels

documentation (1)

Dependencies

.github/workflows/add-issue-to-project.yml actions

tibdex/github-app-token 36464acb844fc53b9b8b2401da68844f6b05ebb0 composite

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bigscience-workshop/lam

Science Score: 10.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

BigLAM (Libraries, Archives and Museums)

Goals

Why are we doing this?

Training (large) historic language models

How can I contribute?

Joining the hackathon

Dates

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies