citation-crawler

Asynchronous high-concurrency citation crawler, use with caution!

https://github.com/yindaheng98/citation-crawler

Science Score: 18.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Asynchronous high-concurrency citation crawler, use with caution!

Basic Info

Host: GitHub
Owner: yindaheng98
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 466 KB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

citation-crawler

Asynchronous high-concurrency citation crawler, use with caution!

异步高并发引文数据爬虫，慎用

Only support Semantic Scholar currently.

目前支持从Semantic Scholar上爬references和citations

Crawl papers based on BFS(Breath First Search) from the citation network and connect them into an undirected graph. Each edge is a paper, each node is an author.

基于广度优先搜索爬引文数据并将其组织为无向图。图的节点是文章，边是引用关系

Neo4J output compatible with dblp-crawler

Neo4J形式的输出和dblp-crawler兼容，可以自动识别相同paper不产生重复节点，并可以匹配数据库中已记录的作者，从而为新加入数据库的paper连接作者

Install

sh pip install citation-crawler

Usage

```sh python -m citationcrawler -h usage: _main__.py [-h] [-y YEAR] [-l LIMIT] -k KEYWORD [-p PID] [-a AID] {networkx,neo4j} ...

positional arguments: {networkx,neo4j} sub-command help networkx Write results to a json file. neo4j Write result to neo4j database

optional arguments: -h, --help show this help message and exit -y YEAR, --year YEAR Only crawl the paper after the specified year. -l LIMIT, --limit LIMIT Limitation of BFS depth. -k KEYWORD, --keyword KEYWORD Specify keyword rules. -p PID, --pid PID Specified a list of paperId to start crawling. -a AID, --aid AID Specified a list of authorId to start crawling. ```

```sh python -m citationcrawler networkx -h usage: _main__.py networkx [-h] --dest DEST

optional arguments: -h, --help show this help message and exit --dest DEST Path to write results. ```

```sh python -m citationcrawler neo4j -h
usage: _main__.py neo4j [-h] [--auth AUTH] --uri URI

optional arguments: -h, --help show this help message and exit --username USERNAME Auth username to neo4j database. --password PASSWORD Auth password to neo4j database. --uri URI URI to neo4j database. ```

Config environment variables

CITATION_CRAWLER_MAX_CACHE_DAYS_AUTHORS:
- save cache for a paper authors page (to get authors of a published paper) for how many days
- default: -1 (cache forever, since authors of a paper are not likely to change)
CITATION_CRAWLER_MAX_CACHE_DAYS_REFERENCES:
- save cache for a reference page (to get references of a published paper) for how many days
- default: -1 (cache forever, since references of a paper are not likely to change)
CITATION_CRAWLER_MAX_CACHE_DAYS_CITATIONS
- save cache for a citation page (to get citations of a published paper) for how many days
- default: 7 (citations of a paper may change frequently)
CITATION_CRAWLER_MAX_CACHE_DAYS_PAPER
- save cache for a paper detail page (to get details of a paper) for how many days
- default: -1 (cache forever, since detailed information of a published paper are not likely to change)
CITATION_CRAWLER_MAX_CACHE_DAYS_INIT_AUTHOR
- save cache for an author page (to init papers from specified author by -a) for how many days
- default: 7 (author may publish frequently)
HTTP_PROXY
- Set it http://your_user:your_password@your_proxy_url:your_proxy_port if you want to use proxy
HTTP_TIMEOUT
- Timeout for each http request, in seconds
HTTP_CONCORRENT
- Concurrent HTTP requests
- default: 8
HTTP_HEADERS
- Headers for HTTP requests
- default: None
HTTP_SLEEP
- Sleep after request (in seconds)
- default: 0

Write to a JSON file

e.g. write to summary.json:

sh python -m citation_crawler -k video -k edge -p 27d5dc70280c8628f181a7f8881912025f808256 -a 1681457 networkx --dest summary.json

JSON format

json { "nodes": { "<paperId of a paper in Semantic Scholar>": { "paperId": "<paperId of this paper in Semantic Scholar>", "dblp_key": "<dblp id of this paper>", "title": "<title of this paper>", "year": "int <publish year of this paper>", "doi": "<doi of this paper>", "authors": [ { "authorId": "<authorId of this person in Semantic Scholar>", "name": "<name of this person>", "dblp_name": [ "<disambiguation name of this person in dblp>", "<disambiguation name of this person in dblp>", "<disambiguation name of this person in dblp>", "......" ] }, { ...... }, { ...... }, ...... ] }, "<paperId of a paper in Semantic Scholar>": { ...... }, "<paperId of a paper in Semantic Scholar>": { ...... }, ...... }, "edges": [ [ "<paperId of a paper in Semantic Scholar>", "<paperId of a reference in the above paper>" ], [ ...... ], [ ...... ], ...... ]

Write to a Neo4J database

sh docker pull neo4j docker run --rm -it -p 7474:7474 -p 7687:7687 -v $(pwd)save/neo4j:/data -e NEO4J_AUTH=none neo4j

e.g. write to neo4j://localhost:7687:

sh python -m dblp_crawler -k video -k edge -p 27d5dc70280c8628f181a7f8881912025f808256 -a 1681457 neo4j --uri neo4j://localhost:7687

Tips

Without index, NEO4J query will be very very slow. So before you start, you should add some index:

cql CREATE TEXT INDEX publication_title_hash_text_index FOR (p:Publication) ON (p.title_hash); CREATE INDEX publication_title_hash_index FOR (p:Publication) ON (p.title_hash); CREATE INDEX publication_dblp_key_index FOR (p:Publication) ON (p.dblp_key); CREATE INDEX publication_paper_id_index FOR (p:Publication) ON (p.paperId); CREATE INDEX person_author_id_index FOR (p:Person) ON (p.authorId); CREATE INDEX person_dblp_pid_index FOR (p:Person) ON (p.dblp_pid);

Get initial paper list or author list from a Neo4J database

sh python -m dblp_crawler -k video -k edge -a "importlib.import_module('citation_crawler.init').papers_in_neo4j('neo4j://localhost:7687')" neo4j --uri neo4j://localhost:7687

sh python -m dblp_crawler -k video -k edge -p "importlib.import_module('citation_crawler.init').authors_in_neo4j('neo4j://localhost:7687')" neo4j --uri neo4j://localhost:7687

importlib.import_module is flexible, you can import your own variables through this.

Owner

Name: Howard Yin
Login: yindaheng98
Kind: user
Location: Nanjing, China
Company: Southeastern University

Website: https://yindaheng98.github.io/
Repositories: 25
Profile: https://github.com/yindaheng98

Just a college students, major in video live & super-resolution & edge computing.

Citation (citation_crawler/init.py)

from .items import Author, Paper
from .graph import Crawler, Summarizer

GitHub Events

Total

Last Year

Packages

Total packages: 1
Total downloads:
- pypi 65 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 56
Total maintainers: 1

pypi.org: citation-crawler

Asynchronous high-concurrency citation crawler, use with caution!

Homepage: https://github.com/yindaheng98/citation-crawler
Documentation: https://citation-crawler.readthedocs.io/
License: MIT License
Latest release: 2.10.3
published almost 2 years ago

Versions: 56
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 65 Last month

Rankings

Dependent packages count: 10.0%

Average: 38.8%

Dependent repos count: 67.6%

Maintainers (1)

yindaheng98

Last synced: 11 months ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

citation-crawler

Science Score: 18.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

citation-crawler

Install

Usage

Config environment variables

Write to a JSON file

JSON format

Write to a Neo4J database

Tips

Get initial paper list or author list from a Neo4J database

Owner

Citation (citation_crawler/init.py)

GitHub Events

Total

Last Year

Packages

pypi.org: citation-crawler

Rankings

Maintainers (1)

citation-crawler

Science Score: 18.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

citation-crawler

Install

Usage

Config environment variables

Write to a JSON file

JSON format

Write to a Neo4J database

Tips

Get initial paper list or author list from a Neo4J database

Owner

Citation (citation_crawler/__init__.py)

GitHub Events

Total

Last Year

Packages

pypi.org: citation-crawler

Rankings

Maintainers (1)

Citation (citation_crawler/init.py)