citations-app
Science Score: 18.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: marierosok
- Language: Jupyter Notebook
- Default Branch: main
- Size: 1.13 MB
Statistics
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
Citation-finder
Citation-finder finds in-text citations in a text corpus using regular expressions. The expressions find citations that follow the author name-publication year format of the APA, Harvard and Chicago B styles.
Streamlit app
For a quick and easy way to use citation-finder, check out the Streamlit app! Search literature by subject to create a corpus and find the in-text citations.
How to use
Citation-finder takes a corpus created with the dhlab package from The National Library of Norway.
Install and import dhlab, and create a corpus.
``` pip install -U dhlab
import dhlab as dh
corpus = dh.Corpus(doctype='digibok') ```
Import citation-finder and call the function on the corpus.
``` import citaton_finder as cf
cf.citation_finder(corpus, yearspan=(1900,1965), limit=500) ```
The additional, optional arguments are yearspan and limit. The function searches for concordances using a range of four digit numbers that represent publication years. The range from year to year can be defined with yearspan. The default is from 1000 to the current year. Limit refers to the concordance limit. The default is 4000.
The function returns a Pandas DataFrame with the individual citation matches and their associated URN from the dhlab corpus.
What will match
Citation-finder will match both citations where the author name and publication year is inside parentheses (e.g. (Smith, 1991)), as well as citations where the author name is outside parentheses and the publication year is inside parentheses (e.g. Smith (1991)).
In order for the regular expressions to distinguish citation-like strings from other text, they assume at least one word beginning with an upper case letter (author name), a four digit number (publication year) and parentheses (or semicolons, which can also surround citations if several are listed in a row.
Additionally the patterns allow for several optional elements: * multiple authors can be listed * (Lee, Singh and Smith, 1991) * Lee, Singh and Smith (1991) * author names can include initials * (P. W. Smith, 1991) * P. W. Smith (1991) * author names can be followed by "et al." or "m.fl." in Norwegian * (Smith et al., 1991) * Smith et al. (1991) * publication year can be followed by a page reference * (Smith, 1991, p. 123-125) * Smith (1991, p. 123-125) * publication year can be followed by a single letter to differentiate multiple works by the same author in the same year * (Smith, 1991a) * Smith (1991a) * the author name inside parentheses can be preceded by other text * (see for instance Smith, 1991)
Since the regular expressions simply search for patterns in raw text, citation-finder will return all the matching strings regardless of whether they are true citations or not, and will not return citations that do not match the pattern.
Owner
- Login: marierosok
- Kind: user
- Repositories: 1
- Profile: https://github.com/marierosok
Citation (citation_app.py)
import dhlab as dh
import streamlit as st
import pandas as pd
import datetime
import citation_finder as cf
st.set_page_config(layout="wide")
st.title('Citation-finder')
st.subheader('Lag korpus')
subject=st.text_input('Søk på temaord', 'Søk', help='Skriv ønsket søkeord for tema. Du kan bruke * som wildcard før eller etter søkeordet, f.eks. "språk*"')
corp_limit=st.number_input('Antall verk', value=500, help='Sett maksgrense for antall verk i korpuset')
st.subheader('Konkordansevalg')
st.markdown('Citation-finder henter tekst fra konkordanser. Velg årsspennet som skal brukes for å søke etter konkordanser og hvor mange konkordanser som skal hentes ut.')
curr_year=datetime.datetime.today().year
from_year=st.number_input('Fra år', value=1000, help='Velg startår for årsspennet som skal brukes i konkordansesøket')
to_year=st.number_input('Til år', value=curr_year, help='Velg sluttår for årsspennet som skal brukes i konkordansesøket')
yearspan=(from_year,to_year)
conc_limit=st.number_input('Antall konkordanser', value=4000, help='Sett maksgrense for antall konkordanser som skal hentes ut')
st.write(subject, corp_limit, from_year, to_year, conc_limit)
corpus = dh.Corpus(doctype='digibok', subject=subject, limit=corp_limit)
df = cf.citation_finder(corpus, yearspan=yearspan, conc_limit=conc_limit)
df.columns = ["urn","citation"]
res = df.merge(corpus.frame, left_on='urn', right_on='urn')['urn title authors year citation'.split()]
st.markdown('Corpus + citations')
st.dataframe(res)
GitHub Events
Total
Last Year
Dependencies
- dhlab *
- pandas *
- streamlit *
- asttokens 2.4.1
- beautifulsoup4 4.12.3
- certifi 2024.2.2
- charset-normalizer 3.3.2
- colorama 0.4.6
- contourpy 1.2.1
- cycler 0.12.1
- decorator 5.1.1
- dhlab 2.32.0
- et-xmlfile 1.1.0
- exceptiongroup 1.2.0
- executing 2.0.1
- fonttools 4.51.0
- idna 3.6
- ipython 8.23.0
- jedi 0.19.1
- kiwisolver 1.4.5
- matplotlib 3.8.4
- matplotlib-inline 0.1.6
- networkx 3.3
- numpy 1.26.4
- openpyxl 3.1.2
- packaging 24.0
- pandas 2.2.1
- parso 0.8.4
- pexpect 4.9.0
- pillow 10.3.0
- prompt-toolkit 3.0.43
- ptyprocess 0.7.0
- pure-eval 0.2.2
- pygments 2.17.2
- pyparsing 3.1.2
- python-dateutil 2.9.0.post0
- python-louvain 0.16
- pytz 2024.1
- requests 2.31.0
- scipy 1.13.0
- seaborn 0.13.2
- six 1.16.0
- soupsieve 2.5
- stack-data 0.6.3
- traitlets 5.14.2
- typing-extensions 4.11.0
- tzdata 2024.1
- urllib3 2.2.1
- wcwidth 0.2.13
- dhlab ^2.30.0
- pandas ^2.1.3
- python ^3.10