snorkel

A system for quickly generating training data with weak supervision

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: acm.org
✓
Committers with academic emails
13 of 83 committers (15.7%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.5%) to scientific vocabulary

Keywords

ai data-augmentation data-science data-slicing labeling machine-learning python snorkel training-data weak-supervision

Keywords from Contributors

tokenizer named-entity-recognition distributed text-classification cython entity-linking spacy cryptography jax transformer

Last synced: 10 months ago · JSON representation

Repository

A system for quickly generating training data with weak supervision

Basic Info

Host: GitHub
Owner: snorkel-team
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://snorkel.org
Size: 279 MB

Statistics

Stars: 5,880
Watchers: 165
Forks: 857
Open Issues: 16
Releases: 16

Topics

ai data-augmentation data-science data-slicing labeling machine-learning python snorkel training-data weak-supervision

Created over 10 years ago · Last pushed about 2 years ago

Metadata Files

Readme Changelog Contributing License

README.md

PyPI - Python Version PyPI Conda

Programmatically Build and Manage Training Data

Announcement

The Snorkel team is now focusing their efforts on Snorkel Flow, an end-to-end AI application development platform based on the core ideas behind Snorkel—you can check it out here or join us in building it!

The Snorkel project started at Stanford in 2015 with a simple technical bet: that it would increasingly be the training data, not the models, algorithms, or infrastructure, that decided whether a machine learning project succeeded or failed. Given this premise, we set out to explore the radical idea that you could bring mathematical and systems structure to the messy and often entirely manual process of training data creation and management, starting by empowering users to programmatically label, build, and manage training data.

To say that the Snorkel project succeeded and expanded beyond what we had ever expected would be an understatement. The basic goals of a research repo like Snorkel are to provide a minimum viable framework for testing and validating hypotheses. Four years later, we’ve been fortunate to do not just this, but to develop and deploy early versions of Snorkel in partnership with some of the world’s leading organizations like Google, Intel, Stanford Medicine, and many more; author over sixty peer-reviewed publications on our findings around Snorkel and related innovations in weak supervision modeling, data augmentation, multi-task learning, and more; be included in courses at top-tier universities; support production deployments in systems that you’ve likely used in the last few hours; and work with an amazing community of researchers and practitioners from industry, medicine, government, academia, and beyond.

However, we realized increasingly–from conversations with users in weekly office hours, workshops, online discussions, and industry partners–that the Snorkel project was just the very first step. The ideas behind Snorkel change not just how you label training data, but so much of the entire lifecycle and pipeline of building, deploying, and managing ML: how users inject their knowledge; how models are constructed, trained, inspected, versioned, and monitored; how entire pipelines are developed iteratively; and how the full set of stakeholders in any ML deployment, from subject matter experts to ML engineers, are incorporated into the process.

Over the last year, we have been building the platform to support this broader vision: Snorkel Flow, an end-to-end machine learning platform for developing and deploying AI applications. Snorkel Flow incorporates many of the concepts of the Snorkel project with a range of newer techniques around weak supervision modeling, data augmentation, multi-task learning, data slicing and structuring, monitoring and analysis, and more, all of which integrate in a way that is greater than the sum of its parts–and that we believe makes ML truly faster, more flexible, and more practical than ever before.

Moving forward, we will be focusing our efforts on Snorkel Flow. We are extremely grateful for all of you that have contributed to the Snorkel project, and are excited for you to check out our next chapter here.

Quick Links

Getting Started

The quickest way to familiarize yourself with the Snorkel library is to walk through the Get Started page on the Snorkel website, followed by the full-length tutorials in the Snorkel tutorials repository. These tutorials demonstrate a variety of tasks, domains, labeling techniques, and integrations that can serve as templates as you apply Snorkel to your own applications.

Installation

Snorkel requires Python 3.11 or later. To install Snorkel, we recommend using pip:

bash pip install snorkel

or conda:

bash conda install snorkel -c conda-forge

For information on installing from source and contributing to Snorkel, see our contributing guidelines.

Details on installing with conda

The following example commands give some more color on installing with `conda`. These commands assume that your `conda` installation is Python 3.11, and that you want to use a virtual environment called `snorkel-env`. ```bash # [OPTIONAL] Activate a virtual environment called "snorkel" conda create --yes -n snorkel-env python=3.11 conda activate snorkel-env # We specify PyTorch here to ensure compatibility, but it may not be necessary. conda install pytorch==1.1.0 -c pytorch conda install snorkel==0.9.0 -c conda-forge ```

A quick note for Windows users

If you're using Windows, we highly recommend using Docker (you can find an example in our [tutorials repo](https://github.com/snorkel-team/snorkel-tutorials/blob/master/Dockerfile)) or the [Linux subsystem](https://docs.microsoft.com/en-us/windows/wsl/faq). We've done limited testing on Windows, so if you want to contribute instructions or improvements, feel free to open a PR!

Discussion

Issues

We use GitHub Issues for posting bugs and feature requests — anything code-related. Just make sure you search for related issues first and use our Issues templates. We may ask for contributions if a prompt fix doesn't fit into the immediate roadmap of the core development team.

Contributions

We welcome contributions from the Snorkel community! This is likely the fastest way to get a change you'd like to see into the library.

Small contributions can be made directly in a pull request (PR). If you would like to contribute a larger feature, we recommend first creating an issue with a proposed design for discussion. For ideas about what to work on, we've labeled specific issues as help wanted.

To set up a development environment for contributing back to Snorkel, see our contributing guidelines. All PRs must pass the continuous integration tests and receive approval from a member of the Snorkel development team before they will be merged.

Community Forum

For broader Q&A, discussions about using Snorkel, tutorial requests, etc., use the Snorkel community forum hosted on Spectrum. We hope this will be a venue for you to interact with other Snorkel users — please don't be shy about posting!

Announcements

To stay up-to-date on Snorkel-related announcements (e.g. version releases, upcoming workshops), subscribe to the Snorkel mailing list. We promise to respect your inboxes — communication will be sparse!

Twitter

Owner

Name: Snorkel Team
Login: snorkel-team
Kind: organization
Email: hello@snorkel.org

Website: https://snorkel.org
Repositories: 3
Profile: https://github.com/snorkel-team

GitHub Events

Total

Watch event: 129
Fork event: 8

Last Year

Watch event: 129
Fork event: 8

Committers

Last synced: about 3 years ago

All Time

Total Commits: 2,238
Total Committers: 83
Avg Commits per committer: 26.964
Development Distribution Score (DDS): 0.69

Past Year

Commits: 10
Committers: 5
Avg Commits per committer: 2.0
Development Distribution Score (DDS): 0.6

Top Committers

Name	Email	Commits
Alex Ratner	a**r@g**m	694
Henry Ehrenberg	h**g@o**m	507
Stephen Bach	s**h@g**m	421
Braden Hancock	h**n@g**m	126
Vincent Chen	v**n@u**m	56
Bryan He	b**e@s**u	50
Jaeho Shin	n**j@c**u	38
Jason Alan Fries	j**s@s**u	37
Jason Alan Fries	f**n@g**m	28
Paroma Varma	v**a@g**m	27
Catalin Voss	c**n@c**u	23
jasontlam	j**m@l**a	22
Vincent Chen	v**n@y**m	19
thodrek	t**k@s**u	18
Daniel Himmelstein	d**n@g**m	15
senwu	s**u@s**u	13
Manas Joglekar	b**a@g**m	11
David Nicholson	d**9@g**m	7
Felix Sonntag	f**g@o**m	7
Hang Yao	h**o@h**m	7
Humza Iqbal	h**9@g**m	7
Peter M. Landwehr	p**h@c**u	5
Páidí Creed	p**d@g**m	5
Shawn Roberts	r**3@g**m	5
Xiao Ling	x**g@l**o	5
regoldman	r**n@g**m	4
rsmith49	r**h@s**i	4
Namit Chaturvedi	n**v@n**z	4
Luke Hsiao	l**o@u**m	3
dependabot[bot]	4**]@u**m	3
and 53 more...

Committer Domains (Top 20 + Academic)

stanford.edu: 7 cs.stanford.edu: 2 snorkel.ai: 2 live.ca: 1 cs.cmu.edu: 1 lattice.io: 1 nchaturv-mn1.linkedin.biz: 1 hammerlab.org: 1 hal.hitachi.com: 1 umich.edu: 1 apache.org: 1 hotmail.co.uk: 1 edx.org: 1 cs.wisc.edu: 1 cmu.edu: 1 mail.com: 1 caspian.co.uk: 1 globalbitlabs.com: 1 elsevier.com: 1 exascale.info: 1

Issues and Pull Requests

Last synced: 12 months ago

All Time

Total issues: 60
Total pull requests: 56
Average time to close issues: 4 months
Average time to close pull requests: 2 months
Total issue authors: 50
Total pull request authors: 26
Average comments per issue: 4.22
Average comments per pull request: 2.07
Merged pull requests: 40
Bot issues: 0
Bot pull requests: 3

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

rjurney (4)
yinxiangshi (3)
hardianlawi (3)
moscow25 (2)
anjani-dhrangadhariya (2)
pratikchhapolika (2)
meidata (1)
datduong (1)
dmracek (1)
ndenStanford (1)
nsankar (1)
gionanide (1)
e-hossam96 (1)
JayThibs (1)
arfeen93 (1)

Pull Request Authors

humzaiqbal (10)
bhancock8 (9)
rsmith49 (6)
zexuan-zhou (3)
dependabot[bot] (3)
fpoms (3)
asottile (2)
kamelCased (2)
anerirana (2)
marekmodry (2)
hardianlawi (2)
jaiwiwjwjwisn (2)
minhtuev (1)
run3134 (1)
zehua99 (1)

Top Labels

Issue Labels

no-issue-activity (23) feature request (10) no-stale (6) help wanted (5) Q&A (3) snorkel-extraction (1) installation (1)

Pull Request Labels

no-pr-activity (12) no-stale (4) dependencies (3)

Packages

Total packages: 3
Total downloads:
- pypi 37,236 last-month

Total dependent packages: 9
(may contain duplicates)
Total dependent repositories: 69
(may contain duplicates)
Total versions: 36
Total maintainers: 5

pypi.org: snorkel

A system for quickly generating training data with weak supervision

Homepage: https://github.com/snorkel-team/snorkel
Documentation: https://snorkel.readthedocs.io/
License: Apache License 2.0
Latest release: 0.10.0
published over 2 years ago

Versions: 11
Dependent Packages: 8
Dependent Repositories: 65
Downloads: 37,236 Last month
Docker Downloads: 0

Rankings

Stargazers count: 0.4%

Average: 1.4%

Forks count: 1.5%

Dependent packages count: 1.6%

Downloads: 1.7%

Dependent repos count: 1.8%

Maintainers (5)

ajratner henryre stephenbach bhancock8 snorkel

Last synced: 10 months ago

proxy.golang.org: github.com/snorkel-team/snorkel

Documentation: https://pkg.go.dev/github.com/snorkel-team/snorkel#section-documentation
License: apache-2.0
Latest release: v0.10.0
published over 2 years ago

Versions: 15
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 7.0%

Average: 8.2%

Dependent repos count: 9.3%

Last synced: 11 months ago

conda-forge.org: snorkel

Snorkel is a system for programmatically building and managing training datasets to rapidly and flexibly fuel machine learning models. Users write programmatic operations to label, transform, and structure training datasets for machine learning, without needing to hand label any training data. Snorkel then uses modern, theoretically-grounded modeling techniques to clean and integrate the resulting training data.

Homepage: https://snorkel.org
License: Apache-2.0
Latest release: 0.9.9
published almost 4 years ago

Versions: 10
Dependent Packages: 1
Dependent Repositories: 4

Rankings

Stargazers count: 4.7%

Forks count: 5.6%

Average: 13.8%

Dependent repos count: 16.1%

Dependent packages count: 29.0%

Last synced: 11 months ago

Dependencies

.github/workflows/stale.yml actions

actions/stale v1 composite

docs/requirements-doc.txt pypi

sphinx ==2.1.2
sphinx_autodoc_typehints ==1.7.0
sphinx_rtd_theme ==0.4.3

requirements-pyspark.txt pypi

pyspark ==3.2.2

requirements.txt pypi

black >=22.3
blis >=0.3.0
dask >=2020.12.0
dill >=0.3.0
distributed >=2020.12.0
flake8 >=3.7.0
isort >=4.3.0
munkres >=1.0.6
mypy ==0.760
networkx >=2.2
numpy >=1.16.5
pandas >=1.0.0
protobuf >=3.19.5
pydocstyle >=4.0.0
pytest >=5.0.0,<6.0.0
pytest-cov >=2.7.0
pytest-doctestplus >=0.3.0
scikit-learn >=0.20.2
scipy >=1.2.0
spacy >=2.1.0
tensorboard >=2.9.1
torch >=1.2.0
tox >=3.13.0
tqdm >=4.33.0

setup.py pypi

munkres >=1.0.6
networkx >=2.2
numpy >=1.16.5
pandas >=1.0.0
scikit-learn >=0.20.2
scipy >=1.2.0
tensorboard >=2.9.1
torch >=1.2.0
tqdm >=4.33.0