what_are_embeddings

A deep dive into embeddings starting from fundamentals

https://github.com/veekaybee/what_are_embeddings

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.9%) to scientific vocabulary

Keywords

embeddings machine-learning machine-learning-algorithms nlp-machine-learning
Last synced: 6 months ago · JSON representation ·

Repository

A deep dive into embeddings starting from fundamentals

Basic Info
Statistics
  • Stars: 1,029
  • Watchers: 11
  • Forks: 82
  • Open Issues: 0
  • Releases: 7
Topics
embeddings machine-learning machine-learning-algorithms nlp-machine-learning
Created over 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme Citation

README.md

What are embeddings?

This repository contains the generated LaTex document, website, and complementary notebook code for "What are Embeddings".

DOI

Abstract

Over the past decade, embeddings --- numerical representations of non-tabular machine learning features used as input to deep learning models --- have become a foundational data structure in industrial machine learning systems. TF-IDF, PCA, and one-hot encoding have always been key tools in machine learning systems as ways to compress and make sense of large amounts of textual data. However, traditional approaches were limited in the amount of context they could reason about with increasing amounts of data. As the volume, velocity, and variety of data captured by modern applications has exploded, creating approaches specifically tailored to scale has become increasingly important.

Google's Word2Vec paper made an important step in moving from simple statistical representations to semantic meaning of words. The subsequent rise of the Transformer architecture and transfer learning, as well as the latest surge in generative methods has enabled the growth of embeddings as a foundational machine learning data structure. This survey paper aims to provide a deep dive into what embeddings are, their history, and usage patterns in industry.

Running

The LaTex document is written in Overleaf and deployed to GitHub, where it's compiled via Actions. The site is likewise generated via Actions from the site branch. The notebooks are flying fast and free and not under any kind of CI whatsoever.

Contributing

If you have any changes that you'd like to make to the document including clarification or typo fixes, you'll need to build the LaTeX artifact. I use GitHub to track issues and feature requests, as well as accept pull requests. Pull requests are the best way to propose changes to the codebase:

  1. Fork the repo and create your branch from main.
  2. Make your changes in your fork.
  3. Make sure that your LaTeX document compiles. The GH action that triggers the PDF is set to run on PR into main.
  4. Ensure that the document compiles to a PDF correctly and inspect the output.
  5. Make sure your code lints.
  6. Issue that pull request!

Citing

bibtex @software{Boykis_What_are_embeddings_2023, author = {Boykis, Vicki}, doi = {10.5281/zenodo.8015029}, month = jun, title = {{What are embeddings?}}, url = {https://github.com/veekaybee/what_are_embeddings}, version = {1.0.1}, year = {2023} }

Owner

  • Name: Vicki Boykis
  • Login: veekaybee
  • Kind: user
  • Location: Philadelphia, PA

Recsys, Engineering, LLMs, IR, ML

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Boykis"
  given-names: "Vicki"
title: "What are embeddings?"
version: 1.0.1
doi: 10.5281/zenodo.8015029 
date-released: 2023-06-08
url: "https://github.com/veekaybee/what_are_embeddings"

GitHub Events

Total
  • Create event: 4
  • Release event: 1
  • Issues event: 2
  • Watch event: 90
  • Delete event: 3
  • Issue comment event: 3
  • Push event: 6
  • Pull request event: 3
  • Fork event: 10
Last Year
  • Create event: 4
  • Release event: 1
  • Issues event: 2
  • Watch event: 90
  • Delete event: 3
  • Issue comment event: 3
  • Push event: 6
  • Pull request event: 3
  • Fork event: 10

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 204
  • Total Committers: 18
  • Avg Commits per committer: 11.333
  • Development Distribution Score (DDS): 0.108
Past Year
  • Commits: 3
  • Committers: 2
  • Avg Commits per committer: 1.5
  • Development Distribution Score (DDS): 0.333
Top Committers
Name Email Commits
Vicki Boykis v****s@g****m 182
Ankush Chander a****r@g****m 2
Barry McCardel b****3@g****m 2
Rohan Alexander R****r 2
Emlyn Corrin e****n@g****m 2
Arik Friedman a****n@g****m 2
Alan Gerber a****r 1
Andrew Schechtman-Rook r****6@g****m 1
Benjamin Dumke-von der Ehe m****l@b****e 1
Daniel David Leybzon d****n@g****m 1
GraceUnderFiero M****r@g****m 1
Johann Sebastian Schicho 6****o 1
Krishan Bhasin 8****n 1
Moshe Kaplan m****n@g****m 1
Pietro Peterlongo p****o@g****m 1
bernie gray b****3 1
michal-mmm 8****m 1
tbayer t****r 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 14
  • Total pull requests: 33
  • Average time to close issues: 17 days
  • Average time to close pull requests: 13 days
  • Total issue authors: 14
  • Total pull request authors: 22
  • Average comments per issue: 0.93
  • Average comments per pull request: 0.52
  • Merged pull requests: 30
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 1
  • Average time to close issues: 9 days
  • Average time to close pull requests: about 7 hours
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 1.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • angeek (1)
  • ubernion (1)
  • abetusk (1)
  • atacan-ykteknoloji (1)
  • tbayer (1)
  • AndrewRook (1)
  • jul-carras (1)
  • EdwinTh (1)
  • ivoras (1)
  • domesticmouse (1)
  • krisrjohnson (1)
  • aresreact (1)
Pull Request Authors
  • veekaybee (7)
  • RohanAlexander (2)
  • zack-overflow (2)
  • barrald (2)
  • mikepqr (2)
  • weedge (2)
  • balpha (1)
  • moshekaplan (1)
  • emlyn (1)
  • tbayer (1)
  • bfgray3 (1)
  • AndrewRook (1)
  • GraceUnderFiero (1)
  • KrishanBhasin (1)
  • ivoras (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/main.yaml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v3 composite
  • xu-cheng/latex-action v2 composite
.github/workflows/static.yml actions
  • actions/checkout v3 composite
  • actions/configure-pages v3 composite
  • actions/deploy-pages v2 composite
  • actions/upload-pages-artifact v1 composite
notebooks/requirements.txt pypi
  • Jinja2 ==3.1.2
  • MarkupSafe ==2.1.2
  • Pillow ==9.5.0
  • PyYAML ==6.0
  • Pygments ==2.15.1
  • QtPy ==2.3.1
  • Send2Trash ==1.8.2
  • anyio ==3.6.2
  • appnope ==0.1.3
  • argon2-cffi ==21.3.0
  • argon2-cffi-bindings ==21.2.0
  • arrow ==1.2.3
  • arxiv ==1.4.7
  • asttokens ==2.2.1
  • attrs ==23.1.0
  • backcall ==0.2.0
  • beautifulsoup4 ==4.12.2
  • bleach ==6.0.0
  • cachetools ==5.3.0
  • certifi ==2023.5.7
  • cffi ==1.15.1
  • chardet ==3.0.4
  • comm ==0.1.3
  • contourpy ==1.0.7
  • cycler ==0.11.0
  • debugpy ==1.6.7
  • decorator ==5.1.1
  • defusedxml ==0.7.1
  • entrypoints ==0.4
  • executing ==1.2.0
  • fastjsonschema ==2.17.1
  • feedparser ==6.0.10
  • fonttools ==4.39.4
  • fqdn ==1.5.1
  • idna ==2.8
  • ipykernel ==6.23.1
  • ipython ==8.13.2
  • ipython-genutils ==0.2.0
  • ipywidgets ==8.0.6
  • isoduration ==20.11.0
  • jedi ==0.18.2
  • joblib ==1.2.0
  • jsonpointer ==2.3
  • jsonschema ==4.17.3
  • jupyter ==1.0.0
  • jupyter-console ==6.6.3
  • jupyter-events ==0.6.3
  • jupyter_client ==8.2.0
  • jupyter_core ==5.3.0
  • jupyter_server ==2.5.0
  • jupyter_server_terminals ==0.4.4
  • jupyterlab-pygments ==0.2.2
  • jupyterlab-widgets ==3.0.7
  • kiwisolver ==1.4.4
  • matplotlib ==3.7.1
  • matplotlib-inline ==0.1.6
  • mistune ==2.0.5
  • nbclassic ==1.0.0
  • nbclient ==0.7.4
  • nbconvert ==7.4.0
  • nbformat ==5.8.0
  • nest-asyncio ==1.5.6
  • notebook ==6.5.4
  • notebook_shim ==0.2.3
  • numpy ==1.24.3
  • packaging ==23.1
  • pandas ==2.0.1
  • pandocfilters ==1.5.0
  • parso ==0.8.3
  • pexpect ==4.8.0
  • pickleshare ==0.7.5
  • platformdirs ==3.5.1
  • portpicker ==1.2.0
  • prometheus-client ==0.16.0
  • prompt-toolkit ==3.0.38
  • psutil ==5.9.5
  • ptyprocess ==0.7.0
  • pure-eval ==0.2.2
  • pyasn1 ==0.5.0
  • pyasn1-modules ==0.3.0
  • pycparser ==2.21
  • pyparsing ==3.0.9
  • pyrsistent ==0.19.3
  • python-dateutil ==2.8.2
  • python-json-logger ==2.0.7
  • pytz ==2023.3
  • pyzmq ==25.0.2
  • qtconsole ==5.4.3
  • requests ==2.21.0
  • rfc3339-validator ==0.1.4
  • rfc3986-validator ==0.1.1
  • rsa ==4.9
  • scikit-learn ==1.2.2
  • scipy ==1.10.1
  • sgmllib3k ==1.0.0
  • simplegeneric ==0.8.1
  • six ==1.12.0
  • sniffio ==1.3.0
  • soupsieve ==2.4.1
  • stack-data ==0.6.2
  • terminado ==0.13.3
  • threadpoolctl ==3.1.0
  • tinycss2 ==1.2.1
  • tornado ==6.3.2
  • tqdm ==4.65.0
  • traitlets ==5.9.0
  • tzdata ==2023.3
  • uri-template ==1.2.0
  • urllib3 ==1.24.3
  • wcwidth ==0.2.6
  • webcolors ==1.13
  • webencodings ==0.5.1
  • websocket-client ==1.5.2
  • widgetsnbextension ==4.0.7