https://github.com/chris-santiago/bookmarks-topics
Using unsupervised learning and language modeling to cluster and reorganize web bookmarks.
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.6%) to scientific vocabulary
Keywords
Repository
Using unsupervised learning and language modeling to cluster and reorganize web bookmarks.
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
bookmarks-topics
This project is a continuation of the stale bookmarks_clustering project. It's updated to use newer embedding and generative models, mostly via BERTopic library.
Usage
Prerequisites
- This project uses Task to run and manage tasks, so you'll need to first install that on your machine.
- This project uses OpenAI's API. You'll need an API key from OpenAI; place it in a
.envfile within this project's root directory. The key should beOPENAI_KEYand the value is your API key. For example:
toml
OPENAI_KEY=sk-proj-_mySuperSecretOpenAIkey
- Export your bookmarks to an HTML file. Note: this project used Google Chrome bookmarks.
Setup
Clone this repo and install the project and dependencies:
bash
git clone https://github.com/chris-santiago/bookmarks-topics.git
cd bookmarks-topics
conda env create -f environment.yaml
pip install .
Quick Start
Once you've completed the prerequisites and setup the project environment, you can run the entire pipeline using the command:
bash
task cluster-bookmarks -- "bookmarks.input_path=your/path/to/bookmarks.html"
This will parse your bookmarks file and fetch content from all the bookmarked URLs, before running the clustering algorithm. You may not want to organize ALL of your bookmarks, but rather a subset. In this case, you can pass a comma-separated list of specific folders:
bash
task cluster-bookmarks -- "bookmarks.input_path=your/path/to/bookmarks.html" "bookmarks.folders=[My first folder,My second folder]"
Once complete, your re-organized bookmarks are placed within a newly-created ouputs/topics/ directory, within this project's root directory. That directory is organized by date and time; find the folder that corresponds with your most recent run and import the new_bookmarks.html file back into your browser. You can also view a breakdown of bookmarks and topics in the bookmarks_topics.json file, within that same directory.
Note: If you haven't added task to your PATH then you can replace that command with ./bin/task
Example Output
HTML
html
<!DOCTYPE NETSCAPE-Bookmark-file-1>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
<DT><H3>JavaScript D3.js</H3>
<DL><p>
<DT><A HREF="https://stackoverflow.com/questions/32205507/moving-the-axes-in-d3-js">javascript - Moving the axes in d3.js - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/25158688/d3-csv-accessor-function-for-loop">javascript - D3.csv accessor function for loop - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/33482812/javascript-take-every-nth-element-of-array">Javascript: take every nth Element of Array - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/23227991/how-to-add-in-zero-values-into-a-time-series-in-d3-js-javascript">How to add in zero values into a time series in d3.js / JavaScript - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/1187518/how-to-get-the-difference-between-two-arrays-in-javascript">How to get the difference between two arrays in JavaScript? - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/16179021/d3-js-specify-text-for-x-axis">javascript - d3.js Specify text for x-axis - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/43646573/d3-get-attributes-from-element/43646752">javascript - D3 get attributes from element - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/28572015/how-to-select-unique-values-in-d3-js-from-data/28572315">javascript - How to select unique values in d3.js from data - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/10644778/targeting-nested-elements-with-css">html - Targeting nested elements with CSS - Stack Overflow</A>
<DT><A HREF="https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference/5044#5044">MathJax basic tutorial and quick reference - Mathematics Meta Stack Exchange</A>
<DT><A HREF="https://stackoverflow.com/questions/46945784/how-to-debug-javascript-in-visual-studio-code-with-live-server-running">How to Debug JavaScript in Visual Studio Code with live-server Running - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/52788743/intellij-error-java-release-version-10-not-supported/54963753">jetbrains ide - IntelliJ: Error: java: release version 10 not supported - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/20197961/reversed-y-axis-d3">javascript - reversed Y-axis D3 - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/49281258/plot-multiple-lines-in-a-for-loop-in-d3">d3.js - Plot multiple lines in a for loop in d3 - Stack Overflow</A>
</DL><p>
JSON
json
[
{
"url": "https://appliedcausalinference.github.io/aci_book",
"title": "Applied Causal Inference",
"topic": "Bayesian Causal Inference"
},
{
"url": "https://astral.sh/blog/u",
"title": "uv: Python packaging in Rust",
"topic": "Python Development Tools"
},
{
"url": "https://bayesiancomputationbook.com/markdown/chp_01.htm",
"title": "1. Bayesian Inference \u2014 Bayesian Modeling and Computation in Python",
"topic": "Bayesian Causal Inference"
}
]
Tinkering
This project is configured using Hydra, and current configs are found in the conf directory. You can modify behavior by changing these configs, directly, or by overriding on the command line.
| Config | Use | Path |
|--------|-----------------------------------------------------------|---------------------------|
| Main | Main configuration file. Use this to tune the topic model | conf/config.yaml |
| Prompt | Configure LLM prompts. | conf/prompt/* |
| Paths | Configure your local paths. | conf/paths/default.yaml |
| Hydra | Configure hydra. | conf/hydra/default.yaml |
CLI Override
You can override much of the configuration directly from the command line by passing Hydra overrides after -- in the command. For example:
bash
task cluster-bookmarks -- "topics.topic_model.top_n_words=5"
Tasks
You can, of course, also run individual tasks that will execute corresponding Python modules. This is useful when tuning the topic model (task: topics) and want to avoid fetching and parsing HTML from your bookmarked URLs.
bash
task: Available tasks for this project:
* bookmarks: Read bookmarks file
* check-config: Check Hydra configuration
* cluster-bookmarks: Run entire bookmarks clustering pipeline.
* fetch-html: Get bookmarks raw html
* lint: Check source code for errors (will run before tasks)
* parse-html: Parse bookmarks raw html
* topics: Get topics
Owner
- Name: Chris Santiago
- Login: chris-santiago
- Kind: user
- Repositories: 64
- Profile: https://github.com/chris-santiago
GitHub Events
Total
- Watch event: 1
- Delete event: 1
- Push event: 11
- Pull request event: 2
- Create event: 4
Last Year
- Watch event: 1
- Delete event: 1
- Push event: 11
- Pull request event: 2
- Create event: 4
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- chris-santiago (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- hydra-core ==1.3.2
- openai-1.54.4 *
- polars ==1.13.1
- ruff ==0.7.3