autosum

Summarize Publications Automatically

https://github.com/recite/autosum

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: scholar.google
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary

Keywords

arxiv citation google-scholar

Last synced: 8 months ago · JSON representation ·

Repository

Summarize Publications Automatically

Basic Info

Host: GitHub
Owner: recite
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 13.5 MB

Statistics

Stars: 37
Watchers: 2
Forks: 10
Open Issues: 0
Releases: 0

Topics

arxiv citation google-scholar

Created over 10 years ago · Last pushed about 3 years ago

Metadata Files

Readme Funding License Citation

AutoSum: Summarize Publications Automatically

The tool exploits the labor already expended by scholars in summarizing articles. It scrapes words next to citations across all openly available research citing a publication, and collates the output. The result is a very useful summary and data that are in a format that allows easy discovery of potential miscitations.

CLICK HERE to suggest an edit to this page!

Get the Data
Scrapes all openly accessible research citing a particular publication using links provided by Google Scholar. Note: Google monitors scraping on Google scholar.
Parse the Data
Iterates through a directory with all the articles citing a particular research article, and using regular expressions, picks up sentences near a citation.
Example from Social Science

Get the Data

To search for openly accessible pdfs citing the original research article on Google Scholar, use Scholar.py.

Input: URL to Google Scholar Page of an article.
What the script does:
- Goes to 'Cited By..'
- Downloads a user specified number of publicly available papers (pdfs only for now) that cite the paper to a user specified directory.
- Creates a csv that tracks basic characteristics of each of the downloaded paper -- title, url, author names, journal etc. It also dumps relative path to downloaded file.
Sample output

Usage

``` usage: scholar.py [-h] [-u USER] [-p PASSWORD] [-a AUTHOR] [-d DIR] [-o OUTPUT] [-n N_CITES] [-v] [--version] keyword [keyword ...]

positional arguments: keyword Keyword to be searched

optional arguments: -h, --help show this help message and exit -u USER, --user USER Google account e-mail -p PASSWORD, --password PASSWORD Google account password -a AUTHOR, --author AUTHOR Author to be filtered -d DIR, --dir DIR Output directory for PDF files -o OUTPUT, --output OUTPUT CSV output filename -n NCITES, --n-cites NCITES Number of cites to be download -v, --verbose --version show program's version number and exit ```

Example
python scholar.py -v -d pdfs -o output.csv -n 100 -a "A Einstein" \ "Can quantum-mechanical description of physical reality be considered complete?"

Parse the Data

To scrape the text next to the relevant citations within the pdfs, use autosumpdf.py:

The script iterates through the pdfs using the csv generated above.
Using citation information, or a custom regexp gets the text and puts it in the same csv. If multiple regex are matched, everything is concatenated with a line space.
Sample output

``` usage: searchpdf.py [-h] [-i INPUT] [-o OUTPUT] [-v] [--version] regex [regex ...]

optional arguments: -h, --help show this help message and exit -i INPUT, --input INPUT CSV input filename -o OUTPUT, --output OUTPUT CSV output filename -t TXTDIR, --text TXTDIR extract to specific directory -f, --force force extract text file if exists -v, --verbose -a1 AUTHOR1, --author-1-lastname AUTHOR1 1st author of citation -a2 AUTHOR2, --author-2-lastname AUTHOR2 2nd author of citation -y YEAR, --year YEAR Year of publication --version show program's version number and exit -r REGEX, --regex REGEX specify custom regex to filter citations. ```

Example
python searchpdf.py -v -i output.csv -o search-output.csv -r "\.\s(.{5,100}[\[\(]?Einstein.{2,30}\d+[\]\)])"

The custom regular expression (-r switch) matches a sentence (max 100 chars) following by author name "Einstein", any words (max 30 chars) and number with close bracket at the end.

Depending on the command line arguments (-a1, -a2, -y) the following citation patterns will be automatically used for finding matching sentences: * Author1LastName Year * Author1LastName et al. * Author1LastName et al. Year * Author1LastName et al., Year * Author1LastName and Author2LastName * Author1LastName and Author2LastName Year * Author1LastName, and Author2LastName Year * Author1LastName and Author2LastName, Year * Author1LastName & Author2LastName Year * Author1LastName & Author2LastName, Year

Example from Social Science

What to search for?
- Example with Google Scholar
  Download 500 articles from Google Scholar: python scholar.py -v -d pdfs -o iyengar-output.csv -n 500 -a "S Iyengar" "Is anyone responsible?: How television frames political issues."
Searching in the Test Data
- Sample input data
- Use autosumpdf.py to filter citations to Iyengar et al. 2012: python autosumpdf.py -v -i testdata.csv -o search-testdata-new.csv -a1 "Iyengar" -y "2012"
Miscitations
Social scientists hold that few truths are self-evident. But some truths become obvious to all social scientists after some years of experience, including: a) Peer review is a mess, b) Faculty hiring is idiosyncratic, and c) Research is often miscited. Here we quantify the last portion.

License

Released under the MIT License

Owner

Name: re-cite
Login: recite
Kind: organization

Repositories: 2
Profile: https://github.com/recite

Learning from citations and helping people cite better.

Citation (Citation.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Sood"
  given-names: "Gaurav"
title: "Get Weather Data"
date-released: 2016
url: "https://github.com/recite/autosum/"

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Dependencies

scripts/requirements.txt pypi

beautifulsoup4 *
ftfy *
future *
lxml *
pdfminer.six >=20160202
pdfquery *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science