citationtool

https://github.com/ivanp1994/citationtool

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 8 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.5%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: ivanp1994
Language: Python
Default Branch: main
Size: 583 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme Citation

A single script that uses DOI to manually insert APA style references. To use it, have a Word document that has a reference in the form of [LR:{doi}] where {doi} is a valid doi. For example, the following sentence:

For example, according to [LR: https://doi.org/10.1016/j.cmet.2024.07.004] which is a single citation!

Links to a paper. The exact paper, formatted in APA style is:

Zhang, Y., Wang, X., Lin, J., Liu, J., Wang, K., Nie, Q., Ye, C., Sun, L., Ma, Y., Qu, R., Mao, Y., Zhang, X., Lu, H., Xia, P., Zhao, D., Wang, G., Zhang, Z., Fu, W., Jiang, C., & Pang, Y. (2024). A microbial metabolite inhibits the HIF-2α-ceramide pathway to mediate the beneficial effects of time-restricted feeding on MASH. Cell Metabolism, 36(8), 1823-1838.e6. https://doi.org/10.1016/j.cmet.2024.07.004

Since it has a ton of authors, we'd quote it as "Zhang et al., 2024" so the above sentence is:

For example, according to (Zhang et al., 2024) which is a single citation!

Multiple references can be chained together, for example:

Now I will have multiple citations in one [LR: https://doi.org/10.1016/j.cell.2024.07.014; https://doi.org/10.1038/s41586-024-07754-w].

Becomes:

Now I will have multiple citations in one (Mihlan et al., 2024; Roje et al., 2024).

Usage

python citationtool.py testdoc/test.docx testdoc/test2.docx

Output

Every provided file will have a paired file with suffix "_proc" generated. A file "Literature.docx" will be generated in directory of the last provided file.

Requirements

Python version I used this is 3.8.12. Because of f-strings, I don't really recommend using anything prior.

Requirements are outlined in requirements.txt. But they are: tqdm==4.62.3 python-docx==0.8.11 requests==2.27.1

Why not make an installable module

Because it's a ~300 line of code using relatively common packages, avoid unnecessary dependencies and simply copy citationtool.py script.

Bad DOIs

Sometimes you can badly format a DOI and it wont show. The script includes a STDOUT that nicely shows you what dois are bad, highlighting them in orange in a sentence. The example output is: Alt text

Yeah, it happens, not often, but it's usually a sign that you C/P doi incorrectly. Also, output prints single quotes weirdly, but it doesn't impact the rest of the files. Everything is copied, excluding comments.

Owner

Login: ivanp1994
Kind: user

Repositories: 1
Profile: https://github.com/ivanp1994

Citation (citationtool.py)

# -*- coding: utf-8 -*-
"""
Created on Wed Aug  7 10:00:53 2024

@author: Ivan
"""
import os
import re
import argparse
from functools import lru_cache
from typing import Dict, List, Optional, Tuple
import requests


from tqdm import tqdm
from docx import Document
from docx.text.paragraph import Paragraph


# %% GLOBALS
LREF_PATTERN = re.compile(r'\[LR[^\]]*\]')

CN_BASE_URL = "https://doi.org"
HEADERS = {"User-Agent": "python-requests/" + requests.__version__,
           "X-USER-AGENT": "python-requests/" + requests.__version__,
           "Accept": "text/x-bibliography; style = apa; locale = en-Us"}


# %% STARTING FUNCTIONS

def add_suffix(file_path: str, suffix: str) -> str:
    """
    Add a suffix to the filename in the given file path.
    """
    directory, filename = os.path.split(file_path)
    name, ext = os.path.splitext(filename)

    new_filename = f"{name}{suffix}{ext}"

    # Construct the new file path
    new_file_path = os.path.join(directory, new_filename)

    return new_file_path


def extract_references(docx_path: str,
                       pattern: Optional[re.Pattern] = None) -> Dict[str, List[str]]:
    """
    Extracts all the references
    found in a document path that match the given pattern. The given
    pattern is [LR:doi1;doi2]. A pattern can be changed by providing optional
    pattern argument or modifying LREF_PATTERN global variable.
    What is returned is a dictionary of intext pattern and List of strings
    """
    if pattern is None:
        pattern = LREF_PATTERN

    doc = Document(docx_path)

    doi_list = list()
    for para in doc.paragraphs:
        # Find all matches in the paragraph text
        matches = LREF_PATTERN.findall(para.text)
        doi_list.extend(matches)

    reference_dict = {k: k.strip("]").strip(
        "[LR:").split(";") for k in doi_list}
    return reference_dict


def green_text(text: str) -> str:
    """Return a string formatted with green color for terminal output."""
    green_color_code = "\033[92m"
    reset_color_code = "\033[0m"
    return f"{green_color_code}{text}{reset_color_code}"

# %% FUNCTIONS FOR RETRIEVING DOI


@lru_cache(maxsize=256)
def fetch_request(url: str) -> Optional[str]:
    """
    Interface to the request
    """
    r = requests.get(url, headers=HEADERS, allow_redirects=True,
                     )  # pylint: disable=C0103
    if not r.ok:
        return None
    r.encoding = "UTF-8"
    return r.text


def fetch_doi(doi_list: List[str],
              ) -> Dict[str, Optional[str]]:
    """
    Fetches an APA style format of the doi
    """
    doi_dictionary = dict()
    for ids in tqdm(doi_list, "Fetching dois"):
        doi_dictionary[ids] = fetch_request(CN_BASE_URL + "/" + ids.strip())
    return doi_dictionary

# %% FUNCTIONS FOR FORMATTING CITATIONS


def intext_cit(literature_citation: str) -> str:
    """
    Creates an APA style intext citation from a given string

    """
    authors, year = literature_citation.split("(")[:2]
    year = year.split(")")[0]

    author_count = authors.count(".,") + authors.count("&")

    first_author = authors.split(",")[0].strip()
    if author_count == 2:
        second_author = " & " + \
            authors.split("&")[1].split(",")[0].strip()+", "
    # ., counts all authors and & counts last author
    # when there is one author this can be only 0
    elif author_count == 0:
        first_author = first_author + ","
        second_author = " "
    else:
        second_author = " et al., "

    intext_citation = first_author + second_author + year
    return intext_citation


def generate_replacer(lrtext_doi: Dict[str, List[str]],
                      doi_intext: Dict[str, str]) -> Tuple[Dict[str, str], Dict[str, str]]:
    """
    "lrtext_doi" is how references are in the raw file, e.g.
    '[LR:  https://doi.org/10.1016/j.cmet.2024.07.004 ]' and the value is without LR
    "doi_intext" is how a citation is written during text, e.g.
    '(Zhang et al., 2024)'

    This returns two dictionaries - good replacer and bad replacer
    """
    good_replacer = dict()
    bad_replacer = dict()
    for intext, doi_list in lrtext_doi.items():
        replacement = "("
        one_citation = len(doi_list) == 1

        if one_citation:
            joiner = ")"
            end = ""
        else:
            joiner = "; "
            end = ")"

        for doi in doi_list:
            intext_doi = doi_intext.get(doi, None)
            if intext_doi is None:
                intext_doi = "BAD_DOI:"+doi
            replacement = replacement + intext_doi + joiner

        replacement = replacement.rstrip("; ") + end

        if "BAD_DOI" in replacement:
            bad_replacer[intext] = replacement
        else:
            good_replacer[intext] = replacement
    return good_replacer, bad_replacer

# %% BAD PATTERN RECOGNIZAL


def highlight_text(text: str, pattern: str, color_code: str) -> str:
    """Highlight occurrences of `pattern` in `text` with the specified `color_code`."""
    parts = text.split(pattern)
    highlighted_text = ''

    for i in range(len(parts) - 1):
        highlighted_text += parts[i]

        highlighted_text += f"{color_code}{pattern}\033[0m"

    highlighted_text += parts[-1]

    return highlighted_text


def split_into_sentences(text: str) -> List[str]:
    """Split text into sentences using basic punctuation rules."""

    sentence_endings = re.compile(r"(?<=[.!?])\s+(?=[A-Z])")
    sentences = sentence_endings.split(text)
    # edge cases
    return [sentence.strip() for sentence in sentences if sentence.strip()]


def recognize_bad_dois(doc_path: str, patterns: List[str]) -> List[str]:
    """
    Recognizes bad doi and returns sentences containing them,
    with highlighting them
    """
    doc = Document(doc_path)

    # I think orange is nice
    highlight_color = "\033[38;5;208m"
    bad_sentences = list()

    for para in doc.paragraphs:
        text = para.text
        sentences = split_into_sentences(text)
        for sentence in sentences:
            if any(pattern in sentence for pattern in patterns):
                highlighted_sentence = sentence
                for pattern in patterns:
                    if pattern in highlighted_sentence:
                        highlighted_sentence = highlight_text(
                            highlighted_sentence, pattern, highlight_color)
                        bad_sentences.append(highlighted_sentence)
    return bad_sentences

# %% END SAVING


def replace_text_in_runs(paragraph: Paragraph,
                         replacements: Dict[str, str]) -> None:
    """
    Replaces a pattern in text
    Doesnt save comments
    """
    for key, value in replacements.items():
        if key in paragraph.text:
            full_text = ''.join(run.text for run in paragraph.runs)
            full_text = full_text.replace(key, value)
            start_index = 0
            for run in paragraph.runs:
                run_length = len(run.text)
                run.text = full_text[start_index:start_index + run_length]
                start_index += run_length


def save_document(doc_path: str, doc_out: str, replacement: Dict[str, str]) -> None:
    """
    Replaces a pattern in doc and then saves it as the output
    """
    doc = Document(doc_path)
    for paragraph in doc.paragraphs:
        replace_text_in_runs(paragraph, replacement)
    doc.save(doc_out)


def save_literature_to_docx(literature_entries: List[str], doc_path: str):
    """
    saves literature to document provided
    """
    doc = Document()
    doc.add_heading('Literature References', level=1)
    for entry in literature_entries:
        doc.add_paragraph(entry.strip())  # Strip trailing newline characters
    doc.save(doc_path)
# %%  MAIN


def fix_citations(doc_paths: List[str],
                  pattern: Optional[re.Pattern] = None) -> None:
    """
    Main function that fixes citations.
    For every file in doc_path, an additional file with suffix '_proc'
    is created.
    The entire literature in APA format is found at the location of the
    last file as "Literature.docx"

    The pattern for doi recognizal is outlined in LREF_PATTERN global
    and it follows [LR:*] scheme with multiple citations for same sentence
    possible [LR:*;*]

    Optionally, the pattern can be changed via "pattern"


    """
    end_literature = set()
    if len(doc_paths) == 0:
        raise ValueError("No documents provided")
    for path in doc_paths:
        file_name = os.path.basename(path)
        result_name = add_suffix(path, "_proc")
        lrtext_doi = extract_references(
            path, pattern)  # converts it to lrtext_doi

        # there are all citations in this file
        merged_list = list({item for sublist in lrtext_doi.values()
                           for item in sublist})
        print("Done extracting from file ", green_text(file_name))
        doi_citation = fetch_doi(merged_list)  # <- the time consuming part

        print("Done fetching literature, the following bad dois are present in the file:")

        doi_intext = {k: intext_cit(v)
                      for k, v in doi_citation.items() if v is not None}
        end_literature = end_literature.union(
            {v for v in doi_citation.values() if v})
        good_replacer, bad_replacer = generate_replacer(lrtext_doi, doi_intext)
        bad_sentences = recognize_bad_dois(path, list(bad_replacer.keys()))
        for sentence in bad_sentences:
            print("\t", sentence)
        # replacing loging
        save_document(path, result_name, good_replacer)
        print("Saved file found at ", green_text(path))
        print("\n")

    # literature saving loging
    end_literature = sorted(list(end_literature))
    _directory, _ = os.path.split(path)  # pylint: disable=W0631
    _literature = os.path.join(_directory, "Literature.docx")
    save_literature_to_docx(end_literature, _literature)
    print("Saved literature to ", green_text(_literature))


# %%
if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Filling out references using their DOI")
    parser.add_argument("files", nargs='+', )
    files = vars(parser.parse_args())["files"]
    fix_citations(files)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science