Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: consolebuddy
  • Language: Python
  • Default Branch: main
  • Size: 830 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme Citation

README.txt

README:


Setup Instructions:


Clone the repository to your local machine.
Ensure you have Python installed (version 3.6 or higher).
Install the required dependencies by running pip install -r requirements.txt.
Ensure an active internet connection for fetching data from the API.


Execution Instructions:


Open a terminal or command prompt.
Navigate to the directory containing the cloned repository.


To get all citations corresponding to each response:
* Run the main.py script using the command ‘python3 main.py’
* The program will fetch data from the specified API, find citations for each object, and store it to a json file name ‘all_citations.json’ in the same directory


To Run UI using Streamlit:
* Run app.py file using command ‘streamlit run app.py’
* This will show matching citation for each response in a table format.

Result / Output :
- After running main.py file, a json output file will be created as shown in 'output_images/output_json'
- After running app.py, UI will shown on browser as shown in 'output_images/output_streamlit'


Note:


Code is running completely fine on my PC but There may be chances that you will encounter errors while executing the code. This may be because of a library installation error. Please check on StackOverFlow and resolve the issues. If still error persist please let me know. Because code in iteself is completely fine. Since it requires a lot of dependencies that why there may be some kind of error take place.

Owner

  • Login: consolebuddy
  • Kind: user

Citation (CitationFinder.py)

import torch
import spacy
import nltk
from transformers import AutoModel, AutoTokenizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import regex as re

nltk.download('punkt')

class CitationFinder:
    def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"):
        self.nlp = spacy.load("en_core_web_sm")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        nltk.download('punkt')
        nltk.download('stopwords')
        self.stop_words = set(stopwords.words('english'))

    def remove_special_characters(self, text):
        pattern = r'[^a-zA-Z0-9 ]'
        cleaned_text = re.sub(pattern, '', text)
        return cleaned_text

    def remove_links(self, text):
        url_pattern = r'https?://\S+|www\.\S+'
        cleaned_text = re.sub(url_pattern, '', text)
        return cleaned_text

    def lemmatize_words(self, words):
        doc = self.nlp(' '.join(words))
        lemmatized_words = [token.lemma_ for token in doc]
        return lemmatized_words

    def preprocess_text(self, text):
        # text = self.remove_links(text)
        text = self.remove_special_characters(text)
        text = ' '.join(text.split())
        tokens = word_tokenize(text)
        tokens = [word for word in tokens if word.lower() not in self.stop_words]
        text = ' '.join(tokens)
        return text

    def embed_text(self, text):
        tokens = self.tokenizer(text, return_tensors='pt')
        outputs = self.model(**tokens)
        embeddings = outputs.last_hidden_state[:, 0, :]  # Get the embeddings of [CLS] token
        return embeddings

    def cosine_similarity(self, embedding1, embedding2):
        return torch.nn.functional.cosine_similarity(embedding1, embedding2).item()

    def get_similarity(self, text1, text2):
        embedding1 = self.embed_text(text1)
        embedding2 = self.embed_text(text2)
        return self.cosine_similarity(embedding1, embedding2)

    def extract_keywords(self, response):
        doc = self.nlp(response)
        keywords = [token.text.lower() for token in doc if token.pos_ in ["NOUN", "PROPN"]]
        keywords = self.lemmatize_words(keywords)
        seen = set()
        keywords = [x for x in keywords if not (x in seen or seen.add(x))]
        return keywords

    def clean_citations(self, citations):
        citations = list({item['id']: item for item in citations}.values())
        citations = list({item['context'].lower(): item for item in citations}.values())
        citations = [{"id": citation["id"], "link": citation["link"]} for citation in citations]
        return citations


    def find_citations(self, response, sources, thres=0.8):
        citations = []
        for r in response.split('.'):
            r = self.preprocess_text(r)
            keywords = self.extract_keywords(r)

            for source in sources:
                context = source["context"]
                if isinstance(context, list):
                    context = ' '.join(context)
                context = self.preprocess_text(context)
                similarity_score = self.get_similarity(r, context)

                if any(keyword in context.lower() for keyword in keywords) and similarity_score > thres:
                    citations.append({"id": source["id"], "context": source["context"], "link": source["link"]})

        return self.clean_citations(citations)

GitHub Events

Total
Last Year

Dependencies

requirements.txt pypi
  • nltk ==3.8.1
  • regex ==2024.5.15
  • requests ==2.31.0
  • spacy ==3.7.4
  • torch ==2.3.0
  • transformers ==4.41.0