https://github.com/damslabumbc/llm-scraper

LLM Web Scraper

https://github.com/damslabumbc/llm-scraper

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.7%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

LLM Web Scraper

Basic Info
  • Host: GitHub
  • Owner: DAMSlabUMBC
  • Language: Python
  • Default Branch: main
  • Size: 302 MB
Statistics
  • Stars: 4
  • Watchers: 1
  • Forks: 3
  • Open Issues: 0
  • Releases: 0
Created almost 2 years ago · Last pushed 12 months ago
Metadata Files
Readme Contributing

README.md

LLM Scraper 🕺

A dynamic web-scraper that uses LLMs to extract and analyze web contents.

How does it work?

We use Beautiful Soup to parse HTML content and send each unique element to an LLM for content analysis. In the final step, we use the results from the LLM to generate triplets, which are then input into a knowledge graph.

| Feature | Model | | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------- | | WebDriver | Selenium ✅ | | HTML Parser | Beautiful Soup ✅ | | Knowledge Graph | OpenAI ✅ | | Analyze Text | OpenAI ✅ | | Analyze Images | BLIP ✅ | | Analyze Videos | OpenAI ✅ | | Analyze Audio Recordings | OpenAI ✅ | | Analyze Code Snippets | OpenAI ✅ |

ArangoDB

After retrieving triplets, we generate a knowledge graph. In order to generate the knowledge graph, the user must have ArangoDB installed on their system and set up a root account. Download ArangoDB for Ubuntu, Docker, Debian, etc. here. Alternatively, you can download ArangoDB on Windows as well as MacOS. As you install ArangoDB, you will be prompted to set up your root account.

Finally, before you start running the KG code you must change the username and password of the connection to match your ArangoDB account. This is found in KG.py.

db = client.db("IoT-KG", username="root", password="yourPassword")

Owner

  • Name: DAMS lab
  • Login: DAMSlabUMBC
  • Kind: organization

DAta Management & Semantics lab at UMBC

GitHub Events

Total
  • Watch event: 4
  • Delete event: 13
  • Issue comment event: 2
  • Member event: 2
  • Push event: 176
  • Pull request event: 66
  • Fork event: 2
  • Create event: 19
Last Year
  • Watch event: 4
  • Delete event: 13
  • Issue comment event: 2
  • Member event: 2
  • Push event: 176
  • Pull request event: 66
  • Fork event: 2
  • Create event: 19

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 32
  • Average time to close issues: N/A
  • Average time to close pull requests: about 18 hours
  • Total issue authors: 0
  • Total pull request authors: 4
  • Average comments per issue: 0
  • Average comments per pull request: 0.06
  • Merged pull requests: 21
  • Bot issues: 0
  • Bot pull requests: 5
Past Year
  • Issues: 0
  • Pull requests: 32
  • Average time to close issues: N/A
  • Average time to close pull requests: about 18 hours
  • Issue authors: 0
  • Pull request authors: 4
  • Average comments per issue: 0
  • Average comments per pull request: 0.06
  • Merged pull requests: 21
  • Bot issues: 0
  • Bot pull requests: 5
Top Authors
Issue Authors
Pull Request Authors
  • haimgia2 (15)
  • KentTDang (14)
  • dependabot[bot] (5)
  • Niraj-Dhakall (4)
  • BimaPDev (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (5) python (1)

Dependencies

server/requirements.txt pypi
  • Pillow ==11.0.0
  • Requests ==2.32.3
  • beautifulsoup4 ==4.12.3
  • openai ==1.52.0
  • python-dotenv ==1.0.1
  • torch ==2.5.0
  • transformers ==4.45.2
server/scripts/setup.py pypi