wikiverify
Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (5.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: jsteng19
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Size: 33.2 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
wikiverify
This project explores using LLMs to verify that a cited source supports a passage of text in tertiary-source research material such as Wikipedia articles. Many Wikipedia articles cite reliable sources, but do not include the supporting passage of text directly in their reference. Finding the direct supporting material can be tedious when the source is a book, academic paper, or even a news article.
The core idea of the project is to use embedding to vectorize chunks of text, then use similarity search to find chunks relevant to the supported text in question. Feeding this context to a generative model (GPT) can yield a concise supporting passage of text or determine that there are no directly supporting passages.
Using recursive chunking, even sources containing millions of tokens can be efficiently searched .
Roadmap:
- Use a similarity score cutoff to determine chunks in context
- Explore different semantic chunking methods for large sources Adverse testing: test related but unsupported claims
Requirements
langchain v0.0.354 \ openai v1.6.1 \ pinecone-client v3.1.0 \ tiktoken v0.5.2
Owner
- Name: Jake Stenger
- Login: jsteng19
- Kind: user
- Location: UCSB
- Repositories: 2
- Profile: https://github.com/jsteng19
Citation (citation-checker.ipynb)
{
"cells": [
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"!pip install -qU \\\n",
" langchain==0.0.354 \\\n",
" openai==1.6.1 \\\n",
" datasets==2.10.1 \\\n",
" pinecone-client==3.1.0 \\\n",
" tiktoken==0.5.2"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: python-dotenv in c:\\users\\jakes\\appdata\\local\\programs\\python\\python39\\lib\\site-packages (1.0.1)\n"
]
}
],
"source": [
"!pip install python-dotenv"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from langchain.chat_models import ChatOpenAI\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = os.getenv(\"OPENAI_API_KEY\")\n",
"chat = ChatOpenAI(\n",
" openai_api_key=os.environ[\"OPENAI_API_KEY\"],\n",
" model='gpt-3.5-turbo'\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"def parse_file(file_path):\n",
" with open(file_path, 'r', encoding='utf-8') as file:\n",
" content = file.read()\n",
" return content\n",
"\n",
"# Adding the project path to the relative filepath\n",
"project_path = os.getcwd() # Get the current working directory\n",
"file_path = os.path.join(project_path, 'data', 'An overview of the last 10 years of genetically engineered crop safety research.txt')\n",
"text = parse_file(file_path)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Constant-size chunking:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\jakes\\AppData\\Local\\Programs\\Python\\Python39\\lib\\site-packages\\scipy\\__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.3)\n",
" warnings.warn(f\"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of \"\n"
]
}
],
"source": [
"from langchain.text_splitter import NLTKTextSplitter\n",
"text_splitter = NLTKTextSplitter(chunk_size=1000, chunk_overlap=100)\n",
"chunks = text_splitter.split_text(text)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"from pinecone import Pinecone\n",
"\n",
"api_key = os.getenv(\"PINECONE_API_KEY\")\n",
"pc = Pinecone(api_key=api_key)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"from pinecone import ServerlessSpec\n",
"\n",
"spec = ServerlessSpec(\n",
" cloud=\"aws\", region=\"us-east-1\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'dimension': 1536,\n",
" 'index_fullness': 0.0,\n",
" 'namespaces': {'': {'vector_count': 79}},\n",
" 'total_vector_count': 79}"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import time\n",
"\n",
"index_name = 'citation-checker'\n",
"existing_indexes = [\n",
" index_info[\"name\"] for index_info in pc.list_indexes()\n",
"]\n",
"\n",
"# check if index already exists (it shouldn't if this is first time)\n",
"if index_name not in existing_indexes:\n",
" # if does not exist, create index\n",
" pc.create_index(\n",
" index_name,\n",
" dimension=1536, # dimensionality of ada 002\n",
" metric='dotproduct',\n",
" spec=spec\n",
" )\n",
" # wait for index to be initialized\n",
" while not pc.describe_index(index_name).status['ready']:\n",
" time.sleep(1)\n",
"\n",
"# connect to index\n",
"index = pc.Index(index_name)\n",
"time.sleep(1)\n",
"# view index stats\n",
"index.describe_index_stats()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\jakes\\AppData\\Local\\Programs\\Python\\Python39\\lib\\site-packages\\langchain_core\\_api\\deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.embeddings.openai.OpenAIEmbeddings` was deprecated in langchain-community 0.1.0 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAIEmbeddings`.\n",
" warn_deprecated(\n"
]
}
],
"source": [
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"\n",
"embed_model = OpenAIEmbeddings(model=\"text-embedding-ada-002\")"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"79"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(chunks)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(79, 1536)"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res = embed_model.embed_documents(chunks)\n",
"len(res), len(res[0])"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'dimension': 1536,\n",
" 'index_fullness': 0.0,\n",
" 'namespaces': {'': {'vector_count': 79}},\n",
" 'total_vector_count': 79}"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ids = [str(i) for i in list(range(len(res)))]\n",
"metadata = [{'chunk': s, 'index': i} for s, i in zip(chunks, ids)]\n",
"index.upsert(vectors = zip(ids, res, metadata))\n",
"index.describe_index_stats()\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\jakes\\AppData\\Local\\Programs\\Python\\Python39\\lib\\site-packages\\langchain_core\\_api\\deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.vectorstores.pinecone.Pinecone` was deprecated in langchain-community 0.0.18 and will be removed in 0.2.0. An updated version of the class exists in the langchain-pinecone package and should be used instead. To use it run `pip install -U langchain-pinecone` and import as `from langchain_pinecone import Pinecone`.\n",
" warn_deprecated(\n",
"c:\\Users\\jakes\\AppData\\Local\\Programs\\Python\\Python39\\lib\\site-packages\\langchain_community\\vectorstores\\pinecone.py:68: UserWarning: Passing in `embedding` as a Callable is deprecated. Please pass in an Embeddings object instead.\n",
" warnings.warn(\n"
]
}
],
"source": [
"from langchain.vectorstores import Pinecone\n",
"\n",
"# initialize the vector store object\n",
"vectorstore = Pinecone(\n",
" index, embed_model.embed_query, \"chunk\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content='GE food/feed consumption\\nThe scientific records grouped under this topic are numerous\\nand constitute 40.5% of the GE food&feed literature, clearly\\nindicating the importance of the human health issues.\\n\\nThe\\ndistribution over the year is uniform, but a peak was observed\\nin 2008, probably due to the scientific fervors that followed\\nthe publication of experimental studies conducted by the\\nprivate companies after 2006 (Table 1; Figure 2).\\n\\nAccording\\nto the literature, the concerns about GE food/feed consumption that emerge from the scientific and social debates can be\\nsummarized as follows: safety of the inserted transgenic DNA\\nand the transcribed RNA, safety of the protein(s) encoded by\\nthe transgene(s) and safety of the intended and unintended\\nchange of crop composition (Dona & Arvanitoyannis, 2009;\\nParrot et al., 2010).\\n\\nSafety of the inserted transgenic DNA and the transcribed\\nRNA\\nDNA.', metadata={'index': '27'}),\n",
" Document(page_content='We have reviewed the scientific literature on GE crop safety\\nduring the last 10 years, built a classified and manageable list of scientific papers, and analyzed\\nthe distribution and composition of the published literature.\\n\\nWe selected original research\\npapers, reviews, relevant opinions and reports addressing all the major issues that emerged\\nin the debate on GE crops, trying to catch the scientific consensus that has matured since GE\\nplants became widely cultivated worldwide.\\n\\nThe scientific research conducted so far has not\\ndetected any significant hazards directly connected with the use of GE crops; however, the\\ndebate is still intense.\\n\\nAn improvement in the efficacy of scientific communication could have\\na significant impact on the future of agricultural GE.\\n\\nOur collection of scientific records is\\navailable to researchers, communicators and teachers at all levels to help create an informed,\\nbalanced public perception on the important issue of GE use in agriculture.', metadata={'index': '1'}),\n",
" Document(page_content='Conclusions\\nThe technology to produce GE plants is celebrating its 30th\\nanniversary.\\n\\nIt has brought about a dramatic increase in\\nscientific production over the years leading to high impact\\nfindings either in basic research (such as RNAi-mediated\\ngene silencing) and applied research (GE crops), but the\\nadoption of GE plants in the agricultural system has raised\\nissues about environmental and food/feed safety.\\n\\nWe have reviewed the scientific literature on GE crop\\nsafety for the last 10 years that catches the scientific\\nconsensus matured since GE plants became widely cultivated\\nworldwide, and we can conclude that the scientific research\\nconducted so far has not detected any significant hazard\\ndirectly connected with the use of GM crops.\\n\\nThe analysis of\\nthe record list shows that the Biodiversity topic dominated,\\nfollowed by Traceability and GE food/feed consumption,\\nwhich contributed equally in terms of the number of records\\n(Table 1; Figure 3).', metadata={'index': '50'})]"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"vectorstore.similarity_search(query, k=3)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"def context_prompt(quote, context):\n",
" prompt = f\"\"\"Determine if the below quote is supported by the below contents. If it is supported, directly quote the supporting content from the context. \n",
"\n",
" Contexts:\n",
" {context}\n",
"\n",
" Quote: {quote}\"\"\"\n",
" return prompt"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Determine if the below quote is supported by the below contents. If it is supported, directly quote the supporting content from the context. \\n\\n Contexts:\\n GE food/feed consumption\\nThe scientific records grouped under this topic are numerous\\nand constitute 40.5% of the GE food&feed literature, clearly\\nindicating the importance of the human health issues.\\n\\nThe\\ndistribution over the year is uniform, but a peak was observed\\nin 2008, probably due to the scientific fervors that followed\\nthe publication of experimental studies conducted by the\\nprivate companies after 2006 (Table 1; Figure 2).\\n\\nAccording\\nto the literature, the concerns about GE food/feed consumption that emerge from the scientific and social debates can be\\nsummarized as follows: safety of the inserted transgenic DNA\\nand the transcribed RNA, safety of the protein(s) encoded by\\nthe transgene(s) and safety of the intended and unintended\\nchange of crop composition (Dona & Arvanitoyannis, 2009;\\nParrot et al., 2010).\\n\\nSafety of the inserted transgenic DNA and the transcribed\\nRNA\\nDNA.\\nWe have reviewed the scientific literature on GE crop safety\\nduring the last 10 years, built a classified and manageable list of scientific papers, and analyzed\\nthe distribution and composition of the published literature.\\n\\nWe selected original research\\npapers, reviews, relevant opinions and reports addressing all the major issues that emerged\\nin the debate on GE crops, trying to catch the scientific consensus that has matured since GE\\nplants became widely cultivated worldwide.\\n\\nThe scientific research conducted so far has not\\ndetected any significant hazards directly connected with the use of GE crops; however, the\\ndebate is still intense.\\n\\nAn improvement in the efficacy of scientific communication could have\\na significant impact on the future of agricultural GE.\\n\\nOur collection of scientific records is\\navailable to researchers, communicators and teachers at all levels to help create an informed,\\nbalanced public perception on the important issue of GE use in agriculture.\\nConclusions\\nThe technology to produce GE plants is celebrating its 30th\\nanniversary.\\n\\nIt has brought about a dramatic increase in\\nscientific production over the years leading to high impact\\nfindings either in basic research (such as RNAi-mediated\\ngene silencing) and applied research (GE crops), but the\\nadoption of GE plants in the agricultural system has raised\\nissues about environmental and food/feed safety.\\n\\nWe have reviewed the scientific literature on GE crop\\nsafety for the last 10 years that catches the scientific\\nconsensus matured since GE plants became widely cultivated\\nworldwide, and we can conclude that the scientific research\\nconducted so far has not detected any significant hazard\\ndirectly connected with the use of GM crops.\\n\\nThe analysis of\\nthe record list shows that the Biodiversity topic dominated,\\nfollowed by Traceability and GE food/feed consumption,\\nwhich contributed equally in terms of the number of records\\n(Table 1; Figure 3).\\nThe\\nexperimental data collected so far on authorized GE crops can\\nbe summarized as follows:\\n(a) there is no scientific evidence of toxic or allergenic\\neffects;\\n(b) some concern has been raised against GE corn MON\\n810, MON863 and NK603 (de Vendoˆmois et al., 2009;\\nSe´ralini et al., 2007, 2012), but these experimental results\\nhave been deemed of no significance (EFSA 2007, 2012;\\nHoullier, 2012; Parrot & Chassy, 2009);\\n(c) only two cases are known about the potential allergenicity of transgenic proteins, the verified case of the brazilnut storage protein in soybean, which has not been\\nmarketed (Nordlee et al., 1996) and the not verified case\\nof maize Starlink (Siruguri et al., 2004);\\n(d) during the digestion process the proteins generally\\nundergo degradation that leads to the loss of activity\\n(Delaney et al., 2008);\\n(e) even though there are examples of some ingested proteins\\nthat are absorbed in minute quantities in an essentially\\nintact form (e.g.\\neu/en/press/news/gmo070628.htm [last accessed 1 Aug 2013].\\n\\nEFSA.\\n\\n(2008).\\n\\nSafety and nutritional assessment of GM plants and\\nderived food and feed: the role of animal feeding trials.\\n\\nFood Chem\\nToxicol, 46, S2–70.\\n\\nEFSA.\\n\\n(2011).\\n\\nGuidance for risk assessment of food and feed from\\ngenetically modified plants.\\n\\nEFSA J, 9, 2150(1–37).\\n\\nEFSA.\\n\\n(2012).\\n\\nEFSA publishes initial review on GM maize and\\nherbicide study.\\n\\nAvailable from: http://www.efsa.europa.eu/en/press/\\nnews/121004.htm [last accessed 1 Aug 2013].\\n\\nEuropean Commission.\\n\\n(2006).\\n\\nNew case studies on the coexistence\\nof GM and non-GM crops in European agriculture.\\n\\nAvailable from:\\nhttp://ftp.jrc.es/EURdoc/eur22102en.pdf [last accessed 1 Aug 2013].\\n\\nEuropean Commission.\\n\\n(2010).\\n\\nA decade of EU-funded GMO research.\\n\\nAvailable from: http://ec.europa.eu/research/biosociety/pdf/a_decade_\\nof_eu-funded_gmo_research.pdf [last accessed 1 Aug 2013].\\n\\nFerradini N, Nicolia A, Capomaccio S, et al.\\n\\n(2011).\\n\\n Quote: There is a scientific consensus[338][339][340][341] that currently available food derived from GM crops poses no greater risk to human health than conventional food,[342][343][344][345][346] but that each GM food needs to be tested on a case-by-case basis before introduction.'"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"quote = \"There is a scientific consensus[338][339][340][341] that currently available food derived from GM crops poses no greater risk to human health than conventional food,[342][343][344][345][346] but that each GM food needs to be tested on a case-by-case basis before introduction.\"\n",
"\n",
"results = vectorstore.similarity_search(quote, k=5)\n",
"source_knowledge = \"\\n\".join([x.page_content for x in results])\n"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The quote is supported by the content, specifically by the statement: \"the scientific research conducted so far has not detected any significant hazards directly connected with the use of GE crops\" and \"there is no scientific evidence of toxic or allergenic effects\" in relation to GM crops. This indicates that there is a consensus that currently available food derived from GM crops does not pose a greater risk to human health than conventional food, but that each GM food should be tested on a case-by-case basis.\n"
]
}
],
"source": [
"from langchain.schema import (\n",
" SystemMessage,\n",
" HumanMessage,\n",
" AIMessage\n",
")\n",
"prompt = HumanMessage(\n",
" content=context_prompt(quote, source_knowledge)\n",
")\n",
"# add to messages\n",
"messages = [prompt]\n",
"\n",
"res = chat(messages)\n",
"\n",
"print(res.content)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}