https://github.com/capjamesg/nanosearch
Build a search engine from a website sitemap.
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.0%) to scientific vocabulary
Keywords
Repository
Build a search engine from a website sitemap.
Basic Info
- Host: GitHub
- Owner: capjamesg
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://jamesg.blog/2024/05/29/nanosearch/
- Size: 9.77 KB
Statistics
- Stars: 11
- Watchers: 2
- Forks: 2
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
nanosearch
Nanosearch is an in-memory search engine designed for small (< 10,000 URL) websites.
With Nanosearch, you can build a search engine in a few lines of code.
Nanosearch supports the BM25 and TF/IDF algorithms.
Nanosearch also computes a link graph and uses the number of inlinks to a page as a ranking factor. This is useful for ranking results for queries where there are multiple relevant pages by keyword.
Installation
bash
pip install nanosearch
Quickstart
Build a Search Engine from a Sitemap
```python from nanosearch import NanoSearchBM25
engine = NanoSearchBM25().fromsitemap( "https://jamesg.blog/sitemap.xml", titletransforms=[lambda x: x.split("|")[0]] ) results = engine.search("coffee")
print(results) ```
Build a Search Engine from a List of URLs
```python from nanosearch import NanoSearchBM25
urls = [ "https://jamesg.blog/", "https://jamesg.blog/coffee", ]
engine = NanoSearchBM25().from_urls(urls) results = engine.search("coffee")
print(results) ```
Save an Index to Disk
You can save an index to disk and load it later with:
```python engine.tonanosearchjson("index.json")
engine = NanoSearchBM25().fromnanosearchjson("index.json") ```
Supported Algorithms
Nanosearch supports the following search algorithms:
- TF/IDF
- BM25
License
This project is licensed under an MIT license.
Owner
- Name: James
- Login: capjamesg
- Kind: user
- Location: Scotland
- Company: @Roboflow
- Website: jamesg.blog
- Repositories: 320
- Profile: https://github.com/capjamesg
from words, wonder.
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- beautifulsoup4 *
- getsitemap *
- numpy *
- rank-bm25 *
- requests *
- scikit-learn *