https://github.com/capjamesg/nanosearch

Build a search engine from a website sitemap.

https://github.com/capjamesg/nanosearch

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.0%) to scientific vocabulary

Keywords

search-engine web-search
Last synced: 5 months ago · JSON representation

Repository

Build a search engine from a website sitemap.

Basic Info
Statistics
  • Stars: 11
  • Watchers: 2
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Topics
search-engine web-search
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

nanosearch

Nanosearch is an in-memory search engine designed for small (< 10,000 URL) websites.

With Nanosearch, you can build a search engine in a few lines of code.

Nanosearch supports the BM25 and TF/IDF algorithms.

Nanosearch also computes a link graph and uses the number of inlinks to a page as a ranking factor. This is useful for ranking results for queries where there are multiple relevant pages by keyword.

Installation

bash pip install nanosearch

Quickstart

Build a Search Engine from a Sitemap

```python from nanosearch import NanoSearchBM25

engine = NanoSearchBM25().fromsitemap( "https://jamesg.blog/sitemap.xml", titletransforms=[lambda x: x.split("|")[0]] ) results = engine.search("coffee")

print(results) ```

Build a Search Engine from a List of URLs

```python from nanosearch import NanoSearchBM25

urls = [ "https://jamesg.blog/", "https://jamesg.blog/coffee", ]

engine = NanoSearchBM25().from_urls(urls) results = engine.search("coffee")

print(results) ```

Save an Index to Disk

You can save an index to disk and load it later with:

```python engine.tonanosearchjson("index.json")

engine = NanoSearchBM25().fromnanosearchjson("index.json") ```

Supported Algorithms

Nanosearch supports the following search algorithms:

  • TF/IDF
  • BM25

License

This project is licensed under an MIT license.

Owner

  • Name: James
  • Login: capjamesg
  • Kind: user
  • Location: Scotland
  • Company: @Roboflow

from words, wonder.

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

setup.py pypi
  • beautifulsoup4 *
  • getsitemap *
  • numpy *
  • rank-bm25 *
  • requests *
  • scikit-learn *