all_citations_assignment_beyondchats
https://github.com/consolebuddy/all_citations_assignment_beyondchats
Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
·
Repository
Basic Info
- Host: GitHub
- Owner: consolebuddy
- Language: Python
- Default Branch: main
- Size: 830 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Created about 2 years ago
· Last pushed about 2 years ago
Metadata Files
Readme
Citation
README.txt
README: Setup Instructions: Clone the repository to your local machine. Ensure you have Python installed (version 3.6 or higher). Install the required dependencies by running pip install -r requirements.txt. Ensure an active internet connection for fetching data from the API. Execution Instructions: Open a terminal or command prompt. Navigate to the directory containing the cloned repository. To get all citations corresponding to each response: * Run the main.py script using the command ‘python3 main.py’ * The program will fetch data from the specified API, find citations for each object, and store it to a json file name ‘all_citations.json’ in the same directory To Run UI using Streamlit: * Run app.py file using command ‘streamlit run app.py’ * This will show matching citation for each response in a table format. Result / Output : - After running main.py file, a json output file will be created as shown in 'output_images/output_json' - After running app.py, UI will shown on browser as shown in 'output_images/output_streamlit' Note: Code is running completely fine on my PC but There may be chances that you will encounter errors while executing the code. This may be because of a library installation error. Please check on StackOverFlow and resolve the issues. If still error persist please let me know. Because code in iteself is completely fine. Since it requires a lot of dependencies that why there may be some kind of error take place.
Owner
- Login: consolebuddy
- Kind: user
- Repositories: 1
- Profile: https://github.com/consolebuddy
Citation (CitationFinder.py)
import torch
import spacy
import nltk
from transformers import AutoModel, AutoTokenizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import regex as re
nltk.download('punkt')
class CitationFinder:
def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"):
self.nlp = spacy.load("en_core_web_sm")
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModel.from_pretrained(model_name)
nltk.download('punkt')
nltk.download('stopwords')
self.stop_words = set(stopwords.words('english'))
def remove_special_characters(self, text):
pattern = r'[^a-zA-Z0-9 ]'
cleaned_text = re.sub(pattern, '', text)
return cleaned_text
def remove_links(self, text):
url_pattern = r'https?://\S+|www\.\S+'
cleaned_text = re.sub(url_pattern, '', text)
return cleaned_text
def lemmatize_words(self, words):
doc = self.nlp(' '.join(words))
lemmatized_words = [token.lemma_ for token in doc]
return lemmatized_words
def preprocess_text(self, text):
# text = self.remove_links(text)
text = self.remove_special_characters(text)
text = ' '.join(text.split())
tokens = word_tokenize(text)
tokens = [word for word in tokens if word.lower() not in self.stop_words]
text = ' '.join(tokens)
return text
def embed_text(self, text):
tokens = self.tokenizer(text, return_tensors='pt')
outputs = self.model(**tokens)
embeddings = outputs.last_hidden_state[:, 0, :] # Get the embeddings of [CLS] token
return embeddings
def cosine_similarity(self, embedding1, embedding2):
return torch.nn.functional.cosine_similarity(embedding1, embedding2).item()
def get_similarity(self, text1, text2):
embedding1 = self.embed_text(text1)
embedding2 = self.embed_text(text2)
return self.cosine_similarity(embedding1, embedding2)
def extract_keywords(self, response):
doc = self.nlp(response)
keywords = [token.text.lower() for token in doc if token.pos_ in ["NOUN", "PROPN"]]
keywords = self.lemmatize_words(keywords)
seen = set()
keywords = [x for x in keywords if not (x in seen or seen.add(x))]
return keywords
def clean_citations(self, citations):
citations = list({item['id']: item for item in citations}.values())
citations = list({item['context'].lower(): item for item in citations}.values())
citations = [{"id": citation["id"], "link": citation["link"]} for citation in citations]
return citations
def find_citations(self, response, sources, thres=0.8):
citations = []
for r in response.split('.'):
r = self.preprocess_text(r)
keywords = self.extract_keywords(r)
for source in sources:
context = source["context"]
if isinstance(context, list):
context = ' '.join(context)
context = self.preprocess_text(context)
similarity_score = self.get_similarity(r, context)
if any(keyword in context.lower() for keyword in keywords) and similarity_score > thres:
citations.append({"id": source["id"], "context": source["context"], "link": source["link"]})
return self.clean_citations(citations)
GitHub Events
Total
Last Year
Dependencies
requirements.txt
pypi
- nltk ==3.8.1
- regex ==2024.5.15
- requests ==2.31.0
- spacy ==3.7.4
- torch ==2.3.0
- transformers ==4.41.0