google-scholar-scraper

https://github.com/tmu-research-project-2020/google-scholar-scraper

Science Score: 18.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (1.3%) to scientific vocabulary

Keywords

google-scholar scraping scraping-website

Last synced: 9 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: tmu-research-project-2020
License: mit
Language: Python
Default Branch: main
Homepage: https://gs-visualizer-production.herokuapp.com/
Size: 731 KB

Statistics

Stars: 7
Watchers: 1
Forks: 2
Open Issues: 0
Releases: 0

Topics

google-scholar scraping scraping-website

Created over 5 years ago · Last pushed over 5 years ago

Metadata Files

Readme License Citation

README.md

google-scholar-scraper

Google scholar から論文情報を抽出するツールを作成した。
* citations_trend.py：キーワードに関する legend, buzz 論文に関する情報を取ってくる * conf_scrape.py：会議に関する論文を100件取ってくる * scraping_utils.py：Google Scholar のスクレイピングを行うために必要なツール

1. スクレイピングの仕組み

url をリクエスト
- （大量にリクエストすると BAN されてしまうので注意）
- （重要そうな論文順に、1ページ10件並んでいる）
html を BeautifulSoup で解析。論文情報抽出。

獲得可能な論文情報 - 論文タイトル、URL、著者、発行年、引用回数、論文ID、スニペット、年毎の被引用数

2. 複合キーワード検索

通常のキーワード検索だと古い年代の有名な文献が出てくる

→検索の際に分野に関係のあるキーワードだけではなく、
出版（会議）名・論文が公開された年も指定したい

複合条件（キーワード・出版名・出版年）で検索できるように修正した 1. キーワード、出版名、出版年を入力 1. url をリクエストし、上位100件を検索 1. 論文の情報（タイトル、著者、出版年、引用数、url、スニペット）を抽出 1. csv に出力

3. ある論文を引用している論文、引用数の推移

対象論文発行年〜2021年間の年ごとの対象論文引用数を取得 - 引用論文の発行年を取得することで実現

後にでる、レジェンド論文、可視化論文の被引用推移を可視化するために使用

被引用推移取得までの流れ 1. 著者、キーワード、出版名、出版年を入力 1. url をリクエストし、上位10件を表示・対象論文を選択 1. 選んだ論文を引用している論文を上位100件検索 1. 論文の情報・各年の引用数の推移を抽出し、csv に出力

4. 分野のレジェンド論文・バズ論文を可視化

分野の初学者に向けて、読んでおくべき2種類の論文を提案
- レジェンド論文：年代を問わず重要な論文 - バズ論文：最近出版され、注目されている論文

それぞれ以下のように定義して、論文情報・引用推移をスクレイピングした - レジェンド論文：単純なキーワード検索で上位に来る論文 - バズ論文：キーワード＋出版年の複合検索で上位に来る論文

得られた情報を csv に保存し、Web アプリ上で可視化を行った [link]

Citation (citations_trend.py)

from scraping_utils import *
import csv


def write_paper_csv(keyword, paper, label):
    path = "data/cite_paper.csv"
    data = [keyword, label]
    data.extend(list(paper.values()))
    with open(path, "a") as f:
        csv_writer = csv.writer(f)
        csv_writer.writerow(data)

def write_years_csv(paper_id, years):
    path = "data/cite_years/"+ paper_id +".csv"
    with open(path, "w") as f:
        csv_writer = csv.writer(f)
        csv_writer.writerows(years)

keyword = input("Keyword?: ")
url_leg = make_url(keyword=keyword, conf=None, author=None, year=None)
url_buz = make_url(keyword=keyword, conf=None, author=None, year="2018")
print("Please select LEGEND paper")
leg_paper = grep_candidate_papers(url_leg)

write_paper_csv(keyword, leg_paper, "legend")

url_cite_leg = make_url(keyword=None, conf=None, author=None, year=None, paper_id=leg_paper["paper_id"])
(
    titles_leg,
    urls_leg,
    writers_leg,
    years_leg,
    ci_num_leg,
    p_ids_leg,
    snippets_leg,
) = scraping_papers(url_cite_leg)
cite_year_leg = year_list_to_cite_years(years_leg, int(leg_paper['year']))
write_years_csv(leg_paper['paper_id'],cite_year_leg)

print("Please select BUZZ paper")
buz_paper = grep_candidate_papers(url_buz)

write_paper_csv(keyword, buz_paper, "buzz")

url_cite_buz = make_url(keyword=None, conf=None, author=None, year=None, paper_id=buz_paper["paper_id"])
(
    titles_buz,
    urls_buz,
    writers_buz,
    years_buz,
    ci_num_buz,
    p_ids_buz,
    snippets_buz,
) = scraping_papers(url_cite_buz)
cite_year_buz = year_list_to_cite_years(years_buz, int(buz_paper['year']))
write_years_csv(buz_paper['paper_id'], cite_year_buz)

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 6
Total pull requests: 26
Average time to close issues: 15 days
Average time to close pull requests: 1 day
Total issue authors: 2
Total pull request authors: 3
Average comments per issue: 0.33
Average comments per pull request: 0.12
Merged pull requests: 24
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

google-scholar-scraper

Science Score: 18.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

google-scholar-scraper

1. スクレイピングの仕組み

2. 複合キーワード検索

3. ある論文を引用している論文、引用数の推移

4. 分野のレジェンド論文・バズ論文を可視化

Citation (citations_trend.py)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels