durghotona-gpt

Durghotona GPT is a web scraping and LLM based application that can collect and process accident news automatically and generates an Excel file eliminating manual data collection

https://github.com/thamed-chowdhury/durghotona-gpt

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ieee.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Durghotona GPT is a web scraping and LLM based application that can collect and process accident news automatically and generates an Excel file eliminating manual data collection

Basic Info

Host: GitHub
Owner: Thamed-Chowdhury
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 599 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License Citation

Durghotona GPT

Durghotona GPT is a web scraping and LLM-based application that can automatically collect and process accident news to generate an Excel file.

Introduction

Welcome to Durghotona GPT's GitHub repository. This project aims to generate an accident dataset fully automatically. The program first visits a newspaper website specified by the user. It then collects accident news automatically using Selenium. The collected news is processed through an LLM specified by the user to generate an accident dataset. Currently, this application can scrape news from three websites in Bangladesh: First one is Prothom Alo, 2nd one is The Daily Star, and the 3rd one is Dhaka Tribune. The user can choose from three LLMs: GPT-4, GPT-3.5, and Llama3. This dataset can be useful for building Machine Learning models and policy decision-making.

This project was built as part of the author's thesis work. This work is accepted in 27th International Conference on Computer and Information Technology (ICCIT). The paper is available on IEEE Xplore. If you find this work useful, please give our paper a cite: link

How to Use

There are two ways to use this program:

Using the app as a web application
Using the app locally on the user's PC

The explanation of each method is given below:

Using it as a Web Application

The authors built this work as a Streamlit application and hosted it on Hugging Face Spaces. If you want to see a demo of this work, it is strongly recommended to use this method.

Steps:

Go to this link. If you face any loading issues, connect to a VPN and try again.
You will see an interface like this:
Choose a newspaper from which you want to collect news.
Choose an LLM to process the news.
Click the 'Generate Dataset' button.
You will need to wait a few minutes, depending on your input, as the process takes some time.
Once the process is finished, you will see the generated dataset and can download it as a CSV file.

Disclaimer: Since this is only a demo, it has some limitations. It does not allow users to collect news from the "Dhaka Tribune" newspaper and does not allow to use GPT-3.5. Also, due to the costs involved in using the LLMs, a maximum of 20 news reports can be processed. If you want to avoid these limitations, follow the second method.

Using it Locally on the User's PC

This method is for those who want to run the application on their own PC and avoid the limitations of the first method. The steps are given below. Note that we are using Anaconda for this purpose:

Download this repository and unzip it into a folder.
To use this app locally on your PC, you will need two API keys: one from OpenAI and another from Groq. You will need to make some changes to the downloaded files using these keys. The steps are as follows:

i) Open the "LLMautomationGPT" Python file. On line 17, paste your OpenAI API key inside the quotation marks as shown below. Then save and close the file:

ii) Now open the "LLMautomationGPT35" Python file. Go to line 15 and paste your OpenAI API key as shown. Save and close the file.

iii) Now open the "LLMautomationGroq" Python file. Go to line 17 and paste your Groq API key as shown. Save and close the file.

Next, open Anaconda Prompt and navigate to the folder where you have kept the downloaded files.
Create a virtual environment using Anaconda. The virtual environment must use Python 3.12.3. Here we are giving an example. Note that during this process, Anaconda might ask for permissions several times. You have to type "y" and press enter in these cases. In this example, we are naming our virtual environment Accident_env and ensuring our Python version is 3.12.3. To create a virtual environment, run the following command in Anaconda Prompt: conda create -n Accident_env python=3.12.3 anaconda
After creating the virtual environment, we must activate the virtual environment in Anaconda and install the dependencies listed in the requirements.txt file using pip. To activate the virtual environment and install dependencies, run the following commands in Anaconda Prompt: conda activate Accident_env pip install -r requirements.txt If the installation is successful, a window similar to the image below should appear:
Once the installation is completed, type the following command in Anaconda Prompt: streamlit run app.py
If everything is successful, a browser window will open up containing the web app, as shown below:
Now you are done! You can collect accident data to your heart's content 😉

Owner

Login: Thamed-Chowdhury
Kind: user

Repositories: 1
Profile: https://github.com/Thamed-Chowdhury

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Automated Accident Dataset Generator
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: MD Thamed Bin Zaman Chowdhury
    family-names: Chowdhury
    name-particle: Thamed
    email: zamanthamed@gmail.com
    affiliation: 'Department of Civil Engineering, BUET'
keywords:
  - web scraping
  - large language models
  - automation
  - road accident
  - newspaper analysis

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Dependencies

requirements.txt pypi

langchain_community ==0.2.6
langchain_core ==0.2.10
langchain_groq ==0.1.5
langchain_openai ==0.1.10
lxml ==5.2.2
lxml_html_clean ==0.1.1
newspaper3k ==0.2.8
pandas ==2.2.2
python-dotenv ==1.0.1
selenium ==4.22.0
setuptools ==70.0.0
streamlit ==1.35.0
streamlit-lottie ==0.0.5
webdriver-manager ==4.0.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science