durghotona-gpt
Durghotona GPT is a web scraping and LLM based application that can collect and process accident news automatically and generates an Excel file eliminating manual data collection
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: ieee.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary
Repository
Durghotona GPT is a web scraping and LLM based application that can collect and process accident news automatically and generates an Excel file eliminating manual data collection
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Durghotona GPT
Durghotona GPT is a web scraping and LLM-based application that can automatically collect and process accident news to generate an Excel file.
Introduction
Welcome to Durghotona GPT's GitHub repository. This project aims to generate an accident dataset fully automatically. The program first visits a newspaper website specified by the user. It then collects accident news automatically using Selenium. The collected news is processed through an LLM specified by the user to generate an accident dataset. Currently, this application can scrape news from three websites in Bangladesh: First one is Prothom Alo, 2nd one is The Daily Star, and the 3rd one is Dhaka Tribune. The user can choose from three LLMs: GPT-4, GPT-3.5, and Llama3. This dataset can be useful for building Machine Learning models and policy decision-making.
This project was built as part of the author's thesis work. This work is accepted in 27th International Conference on Computer and Information Technology (ICCIT). The paper is available on IEEE Xplore. If you find this work useful, please give our paper a cite: link
How to Use
There are two ways to use this program:
- Using the app as a web application
- Using the app locally on the user's PC
The explanation of each method is given below:
Using it as a Web Application
The authors built this work as a Streamlit application and hosted it on Hugging Face Spaces. If you want to see a demo of this work, it is strongly recommended to use this method.
Steps:
- Go to this link. If you face any loading issues, connect to a VPN and try again.
You will see an interface like this:
Choose a newspaper from which you want to collect news.
Choose an LLM to process the news.
Click the 'Generate Dataset' button.
You will need to wait a few minutes, depending on your input, as the process takes some time.
Once the process is finished, you will see the generated dataset and can download it as a CSV file.
Disclaimer: Since this is only a demo, it has some limitations. It does not allow users to collect news from the "Dhaka Tribune" newspaper and does not allow to use GPT-3.5. Also, due to the costs involved in using the LLMs, a maximum of 20 news reports can be processed. If you want to avoid these limitations, follow the second method.
Using it Locally on the User's PC
This method is for those who want to run the application on their own PC and avoid the limitations of the first method. The steps are given below. Note that we are using Anaconda for this purpose:
- Download this repository and unzip it into a folder.
- To use this app locally on your PC, you will need two API keys: one from OpenAI and another from Groq. You will need to make some changes to the downloaded files using these keys. The steps are as follows:
i) Open the "LLMautomationGPT" Python file. On line 17, paste your OpenAI API key inside the quotation marks as shown below. Then save and close the file:
ii) Now open the "LLMautomationGPT35" Python file. Go to line 15 and paste your OpenAI API key as shown. Save and close the file.
iii) Now open the "LLMautomationGroq" Python file. Go to line 17 and paste your Groq API key as shown. Save and close the file.
- Next, open Anaconda Prompt and navigate to the folder where you have kept the downloaded files.
Create a virtual environment using Anaconda. The virtual environment must use Python 3.12.3. Here we are giving an example. Note that during this process, Anaconda might ask for permissions several times. You have to type "y" and press enter in these cases. In this example, we are naming our virtual environment
Accident_envand ensuring our Python version is 3.12.3. To create a virtual environment, run the following command in Anaconda Prompt:conda create -n Accident_env python=3.12.3 anacondaAfter creating the virtual environment, we must activate the virtual environment in Anaconda and install the dependencies listed in the
requirements.txtfile using pip. To activate the virtual environment and install dependencies, run the following commands in Anaconda Prompt:conda activate Accident_env pip install -r requirements.txtIf the installation is successful, a window similar to the image below should appear:Once the installation is completed, type the following command in Anaconda Prompt:
streamlit run app.pyIf everything is successful, a browser window will open up containing the web app, as shown below:
Now you are done! You can collect accident data to your heart's content 😉
Owner
- Login: Thamed-Chowdhury
- Kind: user
- Repositories: 1
- Profile: https://github.com/Thamed-Chowdhury
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: Automated Accident Dataset Generator
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: MD Thamed Bin Zaman Chowdhury
family-names: Chowdhury
name-particle: Thamed
email: zamanthamed@gmail.com
affiliation: 'Department of Civil Engineering, BUET'
keywords:
- web scraping
- large language models
- automation
- road accident
- newspaper analysis
GitHub Events
Total
- Watch event: 2
Last Year
- Watch event: 2
Dependencies
- langchain_community ==0.2.6
- langchain_core ==0.2.10
- langchain_groq ==0.1.5
- langchain_openai ==0.1.10
- lxml ==5.2.2
- lxml_html_clean ==0.1.1
- newspaper3k ==0.2.8
- pandas ==2.2.2
- python-dotenv ==1.0.1
- selenium ==4.22.0
- setuptools ==70.0.0
- streamlit ==1.35.0
- streamlit-lottie ==0.0.5
- webdriver-manager ==4.0.1