https://github.com/bigscience-workshop/lam
Libraries, Archives and Museums (LAM)
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary
Repository
Libraries, Archives and Museums (LAM)
Basic Info
- Host: GitHub
- Owner: bigscience-workshop
- License: apache-2.0
- Default Branch: main
- Size: 43.9 KB
Statistics
- Stars: 82
- Watchers: 28
- Forks: 7
- Open Issues: 34
- Releases: 0
Metadata Files
README.md
BigLAM (Libraries, Archives and Museums)
🤗 Hugging Face x 🌸 BigScience initiative to create an open source, community resource of LAM datasets.
BigScience 🌸 is an open scientific collaboration of nearly 600 researchers from 50 countries and 250 institutions who collaborate on various projects within the natural language processing (NLP) space to broaden the accessibility of language datasets while working on challenging scientific questions around training language models.
We are running a datasets hackathon focused on making data from Libraries, Archives, and Museums (LAMS) with potential machine learning applications accessible via the Hugging Face Hub. You might also know this field as 'GLAM' - galleries, libraries, archives and museums.
We are doing this to help make these datasets more discoverable, open them up to new audiences, and help ensure that machine learning datasets more closely reflect the richness of human culture.
Goals
We aim to enable easy discovery and programmatic access to these datasets using Hugging Face's 🤗 Datasets Hub. As part of this, we want to:
- Identify datasets that would be useful to have more easily accessible
- Make these datasets available via the Datasets Hub
- Document these datasets
Why are we doing this?
Some of the reasons we think that this effort is important:
- There is a growing interest in using Machine Learning with LAM materials[^ai4lam]. The availability of datasets is one of the barriers to this effort. We want to make existing datasets more discoverable and easily accessible[^cordell]. Making datasets suitable for machine learning more easily discoverable will help reduce this barrier.
- LAMs hold interesting data that currently we believe is underutilized by the broader machine learning ecosystem.
- LAMs have the potential to play a positive role in making the development, sharing, and preservation of machine learning datasets a responsible way (see Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning). We want this hackathon to help develop practices we believe can positively impact the machine learning ecosystem.
Training (large) historic language models
There is a growing interest in using language models with historical texts.[^histlms] Although we are not only focused on collecting datasets for this purpose, we hope that some of the materials we gather as part of this sprint will be helpful in efforts to train language models on historic text data.
How can I contribute?
There are a few ways to contribute to the hackathon:
- ✨ Suggesting datasets that might be of interest (see the Wiki for guidance on the kinds of data we're interested in)
- 🤗 Making those datasets available via the Hugging Face Hub
- 🤳🏾 Invite institutions with open datasets to join the hackathon
- 📝 Documenting datasets by adding additional metadata and working on the data cards for those datasets.
Joining the hackathon
To join the hackathon, start by introducing yourself on our GitHub discussion board https://github.com/bigscience-workshop/lam/discussions/19.
Once you have said hi on the discussion boards you should request to join BigLAM Hugging Face organization.
For guidance, please check out the Wiki.
If you have questions:
- first, check out the FAQs
- if you don't find the answer in the FAQs, please ask on the discussions board
Dates
Initially we plan to run the hackathon until ~~August 19th 2022.~~ the end of October 2022.
[^ai4lam]: See for example, https://sites.google.com/view/ai4lam [^cordell]: R. Cordell, ‘Machine Learning + Libraries’, LC Labs. Accessed: Mar. 28, 2021. [Online]. Available: https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf, p.34 [^histlms]: Schweter, S., März, L., Schmid, K., & Çano, E. (2022). hmBERT: Historical Multilingual Language Models for Named Entity Recognition. ArXiv, abs/2205.15575., Manjavacas, E., & Fonteyn, L. (2022). Adapting vs. Pre-training Language Models for Historical Languages. Journal of Data Mining & Digital Humanities.
Owner
- Name: BigScience Workshop
- Login: bigscience-workshop
- Kind: organization
- Email: bigscience-contact@googlegroups.com
- Website: https://bigscience.huggingface.co
- Twitter: BigScienceW
- Repositories: 28
- Profile: https://github.com/bigscience-workshop
Research workshop on large language models - The Summer of Language Models 21
GitHub Events
Total
- Watch event: 3
Last Year
- Watch event: 3
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 129
- Total pull requests: 25
- Average time to close issues: about 2 months
- Average time to close pull requests: about 2 hours
- Total issue authors: 11
- Total pull request authors: 4
- Average comments per issue: 4.16
- Average comments per pull request: 0.56
- Merged pull requests: 25
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- davanstrien (43)
- albertvillanova (12)
- cakiki (3)
- clancyoftheoverflow (3)
- ymaurer (2)
- nabsiddiqui (2)
- giganttheo (1)
- adikeinan (1)
- thebooort (1)
- versae (1)
- Skorkmaz88 (1)
Pull Request Authors
- davanstrien (6)
- albertvillanova (6)
- stefan-it (1)
- mialondon (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- tibdex/github-app-token 36464acb844fc53b9b8b2401da68844f6b05ebb0 composite