all_citations_assignment_beyondchats

https://github.com/consolebuddy/all_citations_assignment_beyondchats

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: consolebuddy
Language: Python
Default Branch: main
Size: 830 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme Citation

README.txt

README:


Setup Instructions:


Clone the repository to your local machine.
Ensure you have Python installed (version 3.6 or higher).
Install the required dependencies by running pip install -r requirements.txt.
Ensure an active internet connection for fetching data from the API.


Execution Instructions:


Open a terminal or command prompt.
Navigate to the directory containing the cloned repository.


To get all citations corresponding to each response:
* Run the main.py script using the command ‘python3 main.py’
* The program will fetch data from the specified API, find citations for each object, and store it to a json file name ‘all_citations.json’ in the same directory


To Run UI using Streamlit:
* Run app.py file using command ‘streamlit run app.py’
* This will show matching citation for each response in a table format.

Result / Output :
- After running main.py file, a json output file will be created as shown in 'output_images/output_json'
- After running app.py, UI will shown on browser as shown in 'output_images/output_streamlit'


Note:


Code is running completely fine on my PC but There may be chances that you will encounter errors while executing the code. This may be because of a library installation error. Please check on StackOverFlow and resolve the issues. If still error persist please let me know. Because code in iteself is completely fine. Since it requires a lot of dependencies that why there may be some kind of error take place.

Owner

Login: consolebuddy
Kind: user

Repositories: 1
Profile: https://github.com/consolebuddy

Citation (CitationFinder.py)

import torch
import spacy
import nltk
from transformers import AutoModel, AutoTokenizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import regex as re

nltk.download('punkt')

class CitationFinder:
    def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"):
        self.nlp = spacy.load("en_core_web_sm")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        nltk.download('punkt')
        nltk.download('stopwords')
        self.stop_words = set(stopwords.words('english'))

    def remove_special_characters(self, text):
        pattern = r'[^a-zA-Z0-9 ]'
        cleaned_text = re.sub(pattern, '', text)
        return cleaned_text

    def remove_links(self, text):
        url_pattern = r'https?://\S+|www\.\S+'
        cleaned_text = re.sub(url_pattern, '', text)
        return cleaned_text

    def lemmatize_words(self, words):
        doc = self.nlp(' '.join(words))
        lemmatized_words = [token.lemma_ for token in doc]
        return lemmatized_words

    def preprocess_text(self, text):
        # text = self.remove_links(text)
        text = self.remove_special_characters(text)
        text = ' '.join(text.split())
        tokens = word_tokenize(text)
        tokens = [word for word in tokens if word.lower() not in self.stop_words]
        text = ' '.join(tokens)
        return text

    def embed_text(self, text):
        tokens = self.tokenizer(text, return_tensors='pt')
        outputs = self.model(**tokens)
        embeddings = outputs.last_hidden_state[:, 0, :]  # Get the embeddings of [CLS] token
        return embeddings

    def cosine_similarity(self, embedding1, embedding2):
        return torch.nn.functional.cosine_similarity(embedding1, embedding2).item()

    def get_similarity(self, text1, text2):
        embedding1 = self.embed_text(text1)
        embedding2 = self.embed_text(text2)
        return self.cosine_similarity(embedding1, embedding2)

    def extract_keywords(self, response):
        doc = self.nlp(response)
        keywords = [token.text.lower() for token in doc if token.pos_ in ["NOUN", "PROPN"]]
        keywords = self.lemmatize_words(keywords)
        seen = set()
        keywords = [x for x in keywords if not (x in seen or seen.add(x))]
        return keywords

    def clean_citations(self, citations):
        citations = list({item['id']: item for item in citations}.values())
        citations = list({item['context'].lower(): item for item in citations}.values())
        citations = [{"id": citation["id"], "link": citation["link"]} for citation in citations]
        return citations


    def find_citations(self, response, sources, thres=0.8):
        citations = []
        for r in response.split('.'):
            r = self.preprocess_text(r)
            keywords = self.extract_keywords(r)

            for source in sources:
                context = source["context"]
                if isinstance(context, list):
                    context = ' '.join(context)
                context = self.preprocess_text(context)
                similarity_score = self.get_similarity(r, context)

                if any(keyword in context.lower() for keyword in keywords) and similarity_score > thres:
                    citations.append({"id": source["id"], "context": source["context"], "link": source["link"]})

        return self.clean_citations(citations)

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

nltk ==3.8.1
regex ==2024.5.15
requests ==2.31.0
spacy ==3.7.4
torch ==2.3.0
transformers ==4.41.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science