ocaml-semsearch-jsoo
OCaml + js_of_ocaml + SBERT + TensorFlow.js
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary
Keywords
ml
ocaml
semantic-search
Last synced: 6 months ago
·
JSON representation
·
Repository
OCaml + js_of_ocaml + SBERT + TensorFlow.js
Basic Info
Statistics
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 0
- Releases: 0
Topics
ml
ocaml
semantic-search
Created over 2 years ago
· Last pushed about 2 years ago
Metadata Files
Readme
License
Citation
README.org
#+title: ocaml-semsearch-jsoo
OCaml + [[https://github.com/ocsigen/js_of_ocaml][js_of_ocaml]] + [[https://www.sbert.net/][SBERT]] + [[https://www.tensorflow.org/js/][TensorFlow.js]].
This project converts a SBERT model from PyTorch to TensorFlow to TensorFlow.js, and loads that model in OCaml code transpiled to JavaScript. The test code runs it under Node.js, but this can run in the browser as well (with pure JS, WASM, WebGL or WebGPU TensorFlow.js backend).
Why? If you need semantic text embeddings in a =js_of_ocaml= project, this is one of the easiest ways to do so.
If you're interested in running your code natively and don't want anything to do with JavaScript, this is not for you.
This package internally uses =require()= (for TensorFlow.js, the BERT
tokenizer...), so make sure that the dependencies described in =package.json= are available at runtime, including any optional dependencies corresponding to your desired TF.js backend.
* System requirements
Versions indicated are what I used, not hard requirements.
- opam: 2.1.5
- yarn: 1.22.19
- Python: 3.11.3
* Supported TF.js backends
=cpu=, =tensorflow= and =wasm= were tested. Others should work as well, but may require changes.
* Usage
** Set up
#+begin_src bash
# Export a SBERT model to TensorFlow.js
./export_model.sh
# Install Node dependencies
yarn install --ignore-optional
yarn add -O @tensorflow/tfjs-backend-wasm
# Set up opam switch
opam switch create . ocaml.4.14.1 --no-install --yes
# Install dev dependencies
opam install ocaml-lsp-server merlin --yes
# Install main dependencies
cd semsearch-jsoo
opam install . --deps-only --yes --with-test
#+end_src
** Try it out
#+begin_src bash
# in semsearch-jsoo
dune test
#+end_src
* Caveats
- The embedding dimension cannot be configured at runtime for now.
- The tokenizer is fixed, relies on [[https://www.npmjs.com/package/bert-tokenizer][bert-tokenizer (npm)]]
- Hasn't been tested beyond ASCII text.
- The vector search is linear.
* Possible / future improvements
- Generally cleaning up the code
- =dune-project= needs proper dependency specifications.
- JS bindings:
- =brr=?
- [[https://github.com/LexiFi/gen_js_api][LexiFi/gen_js_api]]?
- Documentation
- =Makefile=
- Allow loading models from IndexedDB
- Pure-OCaml BERT tokenizer: would drop an unnecessary JS dependency.
- Bind to [[https://github.com/huggingface/candle][huggingface/candle]], which can target WASM, and has support for BERT.
This would remove the need for the ugly Torch -> TF -> TFJS conversion.
It also has the benefit of enabling portability to native code, though that would require binding twice (once through FFI, once through JSOO).
- Approximate Nearest Neighbor search algorithms instead of a linear search. Note that in practice, optimized, vectorized linear search is _very_ fast (4-16ms for a few thousand entries). This is only necessary when scaling up dramatically or with tight real-time constraints.
* Acknowledgements
Thanks to Philipp Schmid for [[https://www.philschmid.de/tensorflow-sentence-transformers][his article on converting Sentence Transformers to TensorFlow]].
* Citation
If you've used this software in a scientific publication, please cite it as follows:
#+begin_src bibtex
@software{BERREBY_ocaml-semsearch-jsoo_2023,
author = {BERREBY, Yohaï-Eliel},
month = aug,
title = {{ocaml-semsearch-jsoo}},
url = {https://github.com/yberreby/ocaml-semsearch-jsoo},
version = {0.0.1},
year = {2023}
}
#+end_src
Owner
- Name: Yohaï-Eliel Berreby
- Login: yberreby
- Kind: user
- Location: Montréal, Canada
- Company: McGil University
- Website: https://www.linkedin.com/in/yberreby/
- Twitter: yberreby
- Repositories: 46
- Profile: https://github.com/yberreby
Researcher @ McGill
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "BERREBY" given-names: "Yohaï-Eliel" orcid: "https://orcid.org/0009-0001-7091-9093" title: "ocaml-semsearch-jsoo" version: 0.0.1 date-released: 2023-08-10 url: "https://github.com/yberreby/ocaml-semsearch-jsoo"
GitHub Events
Total
Last Year
Dependencies
package.json
npm
- @tensorflow/tfjs ^4.10.0
- @tensorflow/tfjs-core 4.10.0
- bert-tokenizer ^1.1.8
yarn.lock
npm
- 120 dependencies
to-tfjs/requirements-frozen.txt
pypi
- Automat ==22.10.0
- Babel ==2.12.1
- Beaker ==1.12.0
- Brotli ==1.0.9
- CacheControl ==0.13.1
- Cython ==3.0.0
- Flask ==2.2.5
- Flask-Cors ==4.0.0
- Jinja2 ==3.1.2
- Mako ==1.2.4
- Markdown ==3.4.4
- MarkupSafe ==2.1.3
- Pillow ==10.0.0
- PyAudio ==0.2.13
- PyGObject ==3.44.1
- PyNaCl ==1.4.0
- PyQt6 ==6.5.2
- PyQt6-WebEngine ==6.5.0
- PyQt6-sip ==13.5.2
- PySocks ==1.7.1
- PyYAML ==6.0.1
- Pygments ==2.15.1
- Reflector ==2023.6.28.0.36.1
- SecretStorage ==3.3.3
- Send2Trash ==1.8.2
- Sphinx ==7.1.2
- TBB ==0.2
- Twisted ==22.10.0
- Werkzeug ==2.3.2
- aiosqlite ==0.19.0
- alabaster ==0.7.13
- anki ==2.1.65
- anyio ==3.7.1
- anytree ==2.8.0
- apipkg ==3.0.1
- appdirs ==1.4.4
- apsw ==3.42.0.0
- aqt ==2.1.65
- argon2-cffi ==21.3.0
- argon2-cffi-bindings ==21.2.0
- arrow ==1.2.3
- asciidoc ==10.2.0
- asn1crypto ==1.5.1
- asttokens ==2.2.1
- async-generator ==1.10
- async-lru ==2.0.4
- async-timeout ==4.0.2
- attrs ==22.2.0
- autobahn ==23.6.2
- autocommand ==2.2.2
- backcall ==0.2.0
- beautifulsoup4 ==4.12.2
- black ==23.7.0
- bleach ==6.0.0
- brotlicffi ==1.0.9.2
- btrfsutil ==6.3.3
- build ==0.10.0
- cachy ==0.3.0
- certifi ==2023.7.22
- cffi ==1.15.1
- chardet ==5.2.0
- cleo ==2.0.1
- click ==8.1.6
- clikit ==0.6.2
- colorama ==0.4.6
- comm ==0.1.4
- commonmark ==0.9.1
- configobj ==5.0.8
- constantly ==15.1.0
- contextlib2 ==21.6.0
- coverage ==6.5.0
- crashtest ==0.4.1
- cryptography ==41.0.2
- css-parser ==1.0.9
- cssselect ==1.2.0
- dbus-python ==1.3.2
- debugpy ==1.6.7
- decorator ==5.1.1
- defusedxml ==0.7.1
- deprecation ==2.1.0
- distlib ==0.3.6
- distro ==1.8.0
- dnspython ==2.3.0
- docutils ==0.20.1
- dtrx ==8.5.3
- dulwich ==0.21.5
- entrypoints ==0.4
- exceptiongroup ==1.1.2
- executing ==1.2.0
- fastjsonschema ==2.18.0
- faust-cchardet ==2.1.18
- feedparser ==6.0.10
- fido2 ==1.1.2
- filelock ==3.12.2
- fqdn ==1.5.1
- future ==0.18.3
- greenlet ==2.0.2
- hkdf ==0.0.3
- html2text ==2020.1.16
- html5-parser ==0.4.11
- html5lib ==1.1
- httplib2 ==0.22.0
- humanize ==4.7.0
- hyperlink ==21.0.0
- hypothesis ==6.82.0
- idna ==3.4
- ifaddr ==0.2.0
- imagesize ==1.4.1
- importlib-metadata ==5.0.0
- incremental ==22.10.0
- inflate64 ==0.3.1
- inflect ==7.0.0
- iniconfig ==2.0.0
- installer ==0.7.0
- ipykernel ==6.25.0
- ipython ==8.14.0
- ipython-genutils ==0.2.0
- isoduration ==20.11.0
- itsdangerous ==2.1.2
- jaraco.classes ==3.3.0
- jaraco.context ==4.3.0
- jaraco.functools ==3.8.0
- jaraco.text ==3.11.1
- jedi ==0.18.2
- jeepney ==0.8.0
- json5 ==0.9.14
- jsonpointer ==2.4
- jsonschema ==4.18.4
- jsonschema-specifications ==2023.7.1
- jupyter-console ==6.6.3
- jupyter-events ==0.7.0
- jupyter-ydoc ==1.0.2
- jupyter_client ==8.3.0
- jupyter_core ==5.3.1
- jupyter_packaging ==0.12.3
- jupyter_server ==2.7.0
- jupyter_server_fileid ==0.9.0
- jupyter_server_terminals ==0.4.4
- jupyter_server_ydoc ==0.8.0
- jupyterlab ==4.0.4
- jupyterlab-pygments ==0.2.2
- jupyterlab_server ==2.24.0
- jupytext ==1.15.0
- keyring ==24.2.0
- lark ==1.1.5
- lensfun ==0.3.4
- libtorrent ==2.0.9
- lit ==15.0.7.dev0
- lockfile ==0.12.2
- louis ==3.26.0
- lxml ==4.9.2
- magic-wormhole ==0.12.0
- mallard-ducktype ==1.0.2
- markdown-it-py ==2.2.0
- matplotlib-inline ==0.1.6
- mdit-py-plugins ==0.3.5
- mdurl ==0.1.2
- mechanize ==0.4.8
- meson ==1.2.0
- mistune ==2.0.5
- more-itertools ==10.0.0
- msgpack ==1.0.5
- multivolumefile ==0.2.3
- mypy ==1.3.0
- mypy-extensions ==1.0.0
- nbclassic ==1.0.0
- nbclient ==0.8.0
- nbconvert ==7.7.3
- nbformat ==5.9.2
- nest-asyncio ==1.5.7
- netifaces ==0.11.0
- netsnmp-python ==1.0a1
- nftables ==0.1
- notebook ==7.0.2
- notebook_shim ==0.2.3
- nspektr ==0.4.0
- numpy ==1.25.1
- odfpy ==1.4.2
- ordered-set ==4.1.0
- orjson ==3.9.2
- overrides ==7.3.1
- packaging ==23.1
- pandocfilters ==1.5.0
- parso ==0.8.3
- pastel ==0.2.1
- pathspec ==0.11.2
- pdftotext ==2.2.2
- perf ==0.1
- pexpect ==4.8.0
- pickleshare ==0.7.5
- pkginfo ==1.9.6
- platformdirs ==3.9.1
- pluggy ==1.1.0
- ply ==3.11
- poetry ==1.5.1
- poetry-core ==1.6.1
- poetry-plugin-export ==1.3.0
- prometheus-client ==0.17.0
- prompt-toolkit ==3.0.39
- protobuf ==4.23.4
- psutil ==5.9.5
- ptyprocess ==0.7.0
- pulsemixer ==1.5.1
- pure-eval ==0.2.2
- py ==1.11.1.dev0
- py7zr ==0.20.5
- pyOpenSSL ==23.2.0
- pyaml ==23.5.9
- pyasn1 ==0.4.8
- pyasn1-modules ==0.2.8
- pybcj ==1.0.1
- pycairo ==1.24.0
- pychm ==0.8.6
- pycparser ==2.21
- pycryptodome ==3.18.0
- pycryptodomex ==3.12.0
- pydantic ==1.10.9
- pyenchant ==3.2.2
- pylev ==1.4.0
- pynvim ==0.4.3
- pyparsing ==3.0.9
- pyppmd ==1.0.0
- pyproject_hooks ==1.0.0
- pyrsistent ==0.19.3
- pyscard ==2.0.7
- pyserial ==3.5
- pyte ==0.8.1
- pytest ==7.4.0
- python-dateutil ==2.8.2
- python-json-logger ==2.0.7
- python-magic ==0.4.27
- python-sane ==2.9.1
- pytz ==2023.3
- pyxdg ==0.28
- pyzmq ==25.1.0
- pyzstd ==0.15.7
- ranger-fm ==1.9.3
- rapidfuzz ==3.1.2
- referencing ==0.30.0
- regex ==2023.6.3
- reportlab ==3.6.12
- requests ==2.28.2
- requests-toolbelt ==1.0.0
- requests-unixsocket ==0.3.0
- resolvelib ==1.0.1
- retrying ==1.3.3
- rfc3339-validator ==0.1.4
- rfc3986-validator ==0.1.1
- rfc3987 ==1.3.8
- rich ==13.5.2
- rpds-py ==0.9.2
- ruamel.yaml ==0.17.22
- ruamel.yaml.clib ==0.2.7
- s3cmd ==2.3.0
- scikit-build ==0.17.1
- scour ==0.38.2
- service-identity ==23.1.0
- setproctitle ==1.3.2
- sgmllib3k ==1.0.0
- shellingham ==1.5.0.post1
- six ==1.16.0
- sniffio ==1.3.0
- snowballstemmer ==2.2.0
- sortedcontainers ==2.4.0
- soupsieve ==2.4.1
- spake2 ==0.8
- speedtest-cli ==2.1.3
- sphinxcontrib-applehelp ==1.0.4
- sphinxcontrib-devhelp ==1.0.2
- sphinxcontrib-htmlhelp ==2.0.1
- sphinxcontrib-jsmath ==1.0.1
- sphinxcontrib-qthelp ==1.0.3
- sphinxcontrib-serializinghtml ==1.1.5
- stack-data ==0.6.2
- tenacity ==8.2.3.dev0
- terminado ==0.17.1
- terminator ==2.1.3
- testpath ==0.6.0
- texttable ==1.6.7
- thefuck ==3.32
- tinycss2 ==1.2.1
- toml ==0.10.2
- tomli ==2.0.1
- tomlkit ==0.11.8
- tornado ==6.2
- tqdm ==4.65.0
- traitlets ==5.9.0
- trash-cli ==0.23.2.13.2
- trove-classifiers ==2023.7.8
- txaio ==23.1.1
- txtorcon ==23.5.0
- typing_extensions ==4.7.1
- uc-micro-py ==1.0.2
- ufw ==0.36.2
- unrardll ==0.1.7
- uri-template ==1.3.0
- urllib3 ==1.26.15
- validate ==5.0.8
- validate-pyproject ==0.13.post1.dev0
- virtualenv ==20.21.0
- waitress ==2.1.2
- wcwidth ==0.2.6
- webcolors ==1.13
- webencodings ==0.5.1
- websocket-client ==1.6.1
- wsaccel ==0.6.4
- xonsh ==0.14.0
- y-py ==0.6.0
- youtube-dl ==2021.12.17
- ypy-websocket ==0.12.1
- yt-dlp ==2023.7.6
- yubikey-manager ==5.1.1
- zeroconf ==0.63.0
- zipp ==3.16.1
- zope.interface ==6.0
- zstandard ==0.21.0
to-tfjs/requirements.txt
pypi
- numpy *
- sentence-transformers *
- tensorflowjs *
- transformers *