https://github.com/damslabumbc/llm-scraper
LLM Web Scraper
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.7%) to scientific vocabulary
Repository
LLM Web Scraper
Basic Info
- Host: GitHub
- Owner: DAMSlabUMBC
- Language: Python
- Default Branch: main
- Size: 302 MB
Statistics
- Stars: 4
- Watchers: 1
- Forks: 3
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
LLM Scraper 🕺
A dynamic web-scraper that uses LLMs to extract and analyze web contents.
How does it work?
We use Beautiful Soup to parse HTML content and send each unique element to an LLM for content analysis. In the final step, we use the results from the LLM to generate triplets, which are then input into a knowledge graph.
| Feature | Model | | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------- | | WebDriver | Selenium ✅ | | HTML Parser | Beautiful Soup ✅ | | Knowledge Graph | OpenAI ✅ | | Analyze Text | OpenAI ✅ | | Analyze Images | BLIP ✅ | | Analyze Videos | OpenAI ✅ | | Analyze Audio Recordings | OpenAI ✅ | | Analyze Code Snippets | OpenAI ✅ |
ArangoDB
After retrieving triplets, we generate a knowledge graph. In order to generate the knowledge graph, the user must have ArangoDB installed on their system and set up a root account. Download ArangoDB for Ubuntu, Docker, Debian, etc. here. Alternatively, you can download ArangoDB on Windows as well as MacOS. As you install ArangoDB, you will be prompted to set up your root account.
Finally, before you start running the KG code you must change the username and password of the connection to match your ArangoDB account. This is found in KG.py.
db = client.db("IoT-KG", username="root", password="yourPassword")
Owner
- Name: DAMS lab
- Login: DAMSlabUMBC
- Kind: organization
- Repositories: 2
- Profile: https://github.com/DAMSlabUMBC
DAta Management & Semantics lab at UMBC
GitHub Events
Total
- Watch event: 4
- Delete event: 13
- Issue comment event: 2
- Member event: 2
- Push event: 176
- Pull request event: 66
- Fork event: 2
- Create event: 19
Last Year
- Watch event: 4
- Delete event: 13
- Issue comment event: 2
- Member event: 2
- Push event: 176
- Pull request event: 66
- Fork event: 2
- Create event: 19
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 0
- Total pull requests: 32
- Average time to close issues: N/A
- Average time to close pull requests: about 18 hours
- Total issue authors: 0
- Total pull request authors: 4
- Average comments per issue: 0
- Average comments per pull request: 0.06
- Merged pull requests: 21
- Bot issues: 0
- Bot pull requests: 5
Past Year
- Issues: 0
- Pull requests: 32
- Average time to close issues: N/A
- Average time to close pull requests: about 18 hours
- Issue authors: 0
- Pull request authors: 4
- Average comments per issue: 0
- Average comments per pull request: 0.06
- Merged pull requests: 21
- Bot issues: 0
- Bot pull requests: 5
Top Authors
Issue Authors
Pull Request Authors
- haimgia2 (15)
- KentTDang (14)
- dependabot[bot] (5)
- Niraj-Dhakall (4)
- BimaPDev (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- Pillow ==11.0.0
- Requests ==2.32.3
- beautifulsoup4 ==4.12.3
- openai ==1.52.0
- python-dotenv ==1.0.1
- torch ==2.5.0
- transformers ==4.45.2