crawlgpt
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.5%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: gh18l
- License: mit
- Language: Python
- Default Branch: main
- Size: 18.4 MB
Statistics
- Stars: 203
- Watchers: 6
- Forks: 33
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
CrawlGPT
⚡ Fully automated web crawler. Crawling all information you want on the Internet with GPT-3.5. Built with 🦜️🔗LangChain👍👍⚡
Simple Demo
https://github.com/gh18l/CrawlGPT/assets/16774158/593fc8f3-71d1-45fb-aca1-0bfc6a92a996
What it can do?
- Fully automated web crawler. Simulate the process of humans searching for data as much as possible.
- Automatically collect all specified details across the entire internet or given web domain based on a given theme.
- Automatically search for answers on the internet to fill in missing specified details while crawling.
- ✍️👇A simple exmple👇✍️
- Input:
- the theme you want to crawl:
Cases of mergers and acquisitions of fast food industry enterprises in America after 2010 - 0-th specific detail:
When the merger occurred - 1-th specific detail:
Acquirer - 2-th specific detail:
Acquired party - 3-th specific detail:
The CEO of acquirer - 4-th specific detail:
The CEO of acquired party - (Optional) Limited web domain:
["nytimes.com", "cnn.com"]
- the theme you want to crawl:
- Output: JSON containing all specified details about the theme. The format of output is
{ "events_num": N, "details": ### The length of this list is N. [ { "When the merger occurred": <answer>, "Acquirer": <answer>, "Acquired party": <answer>, "The CEO of acquirer": <answer>, "The CEO of acquired party": <answer>, "source_url": <url> }, { "When the merger occurred": <answer>, "Acquirer": <answer>, "Acquired party": <answer>, "The CEO of acquirer": <answer>, "The CEO of acquired party": <answer>, "source_url": <url> }, .............. ] }
- Input:
Why web crawler need GPT?
- GPT can extract the necessary information by directly understanding the content of each webpage, rather than writing complex crawling rules.
- GPT can connect to the internet to determine the accuracy of crawler results or supplement missing information.
How it do?
- Thinking about suitable Google search queries based on the theme with GPT-3.5.
- Simulate Google search in entire Internet or given web domain(if any) using each query.
- Browse every website.
- Extract specific details of the theme from the content of the website with GPT-3.5.
- Similar to Auto-GPT, it will independently search for missing details on the Internet based on the langchain implementation of MRKL and ReAct.
- Encapsulate all results into a JSON.
Quick Install
OPENAI_API_KEY: You must have a openai api key and modifyos.environ["OPENAI_API_KEY"]inpipeline.py.SERPER_API_KEY: For searching correct and real-time information, you need have a google serper api key. It will take you a short time to register. Modifyos.environ["SERPER_API_KEY"]inpipeline.pyand you have 1000 queries for free every month.- Hyper Parameters:
QUERY_NUM: The Number of Google searches with different query. Default is 2.QUERY_RESULTS_NUM: The number of results returned per search. Default is 4.THEME: The theme of web crawler.DETAIL_LIST: The specific details of the web crawler theme.(Optional) URL_DOMAIN_LIST: The valid web domain or url prefix.
- Install
python3.11. - Install necessary dependencies:
pip install -r requirements.txt - Run it:
python pipeline.py > output.txt. - Read results from
final_dict.json.
TODO
- [x] Support crawl in given list of web domain.
- [ ] The langchain implementation of MRKL and ReAct carries the risk of divergent output. That is, the content of response may exceed our limit.
- [ ] Automatically write research reports based on crawling results.
- [ ] GPT consumes a huge amount of token while browsing webpage😢. Reduce the consumption.
- [ ] Browse the PDF files from the pdf link in website.
- [ ] Modify the entire pipeline to registration free(except for OpenAI).
Contact me and communication
I am currently working as an AI engineer at Alibaba in Beijing, China. I think communication can eliminate information gaps.
I am interested in llm applications, such as conversational search, AI agent, external data enhancement, etc. Welcome to communicate via email(hanxyz1818@gmail.com) or WeChat (if you have one).
I am also preparing to make an actual product. My immature ideas recently is about llm+crypto and llm+code. If you are also interested in them or you have some other ideas, also welcome to contact me.
Owner
- Name: Han
- Login: gh18l
- Kind: user
- Repositories: 1
- Profile: https://github.com/gh18l
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Chase" given-names: "Harrison" title: "LangChain" date-released: 2022-10-17 url: "https://github.com/hwchase17/langchain"
GitHub Events
Total
- Watch event: 7
- Fork event: 1
Last Year
- Watch event: 7
- Fork event: 1
Dependencies
- autodoc_pydantic ==1.8.0
- myst_nb *
- myst_parser *
- nbsphinx ==0.8.9
- pydata-sphinx-theme ==0.13.1
- sphinx ==4.5.0
- sphinx-autobuild ==2021.3.14
- sphinx-panels *
- sphinx-typlog-theme ==0.8.0
- sphinx_book_theme *
- sphinx_copybutton *
- sphinx_rtd_theme ==1.0.0
- toml *
- bs4 *
- langchain *
- openai *
- tiktoken *