cs-insights-crawler

This repository implements the interaction with DBLP, information extraction and pre-processing of papers, and a client to store data to the cs-insights-backend.

https://github.com/jpwahle/cs-insights-crawler

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.3%) to scientific vocabulary

Keywords

crawler dblp dblp-dataset nlp semanticscholar
Last synced: 6 months ago · JSON representation ·

Repository

This repository implements the interaction with DBLP, information extraction and pre-processing of papers, and a client to store data to the cs-insights-backend.

Basic Info
Statistics
  • Stars: 7
  • Watchers: 3
  • Forks: 0
  • Open Issues: 8
  • Releases: 0
Topics
crawler dblp dblp-dataset nlp semanticscholar
Created almost 5 years ago · Last pushed about 3 years ago
Metadata Files
Readme License Citation

README.md


Logo

Actions Status
Actions Status Actions Status License: MIT Code style: black


This is the official crawler implementation for the D3 Dataset in almost pure python. The crawler is also used for the cs-insights project.

Starting from version 1.0.2, this project is using semantic versioning, and supports SemanticScholar. For more info about the features supported, see the releases.

Installation & Setup

First install the package manager poetry:

console pip install poetry

Then run:

console poetry install

To start the crawling process, run:

console poetry run cli main --s2_use_papers --s2_use_abstracts --s2_filter_dblp

For help run:

console poetry run cli main --help

Code quality and tests

To maintain a consistent and well-tested repository, we use unit tests, linting, and typing checkers with GitHub actions. We use pytest for testing, pylint for linting, and pyright for typing. Every time code gets pushed to our repository these checks are executed and have to fullfill certain requirements before you can merge the code to our master branch.

Whenever you create a pull request against the default branch, GitHub actions will create a CI job executing unit tests and linting.

To run all tests that are tested during CI locally, run:

console poetry run poe alltest

Contributing

Fork the repo, make changes and send a PR. We'll review it together!

Commit messages should follow Angular's conventions.

License

This project is licensed under the terms of MIT license. For more information, please see the LICENSE file.

Citation

If you use this repository, or use our tool for analysis, please cite our work:

Citation

If you use this repository, or use our tool for analysis, please cite our work:

bib @inproceedings{Wahle2022c, title = {D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research}, author = {Wahle, Jan Philip and Ruas, Terry and Mohammad, Saif M. and Gipp, Bela}, year = {2022}, month = {July}, booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference}, publisher = {European Language Resources Association}, address = {Marseille, France}, doi = {}, }

Also make sure to cite the following papers if you use SemanticScholar data:

bib @inproceedings{ammar-etal-2018-construction, title = "Construction of the Literature Graph in Semantic Scholar", author = "Ammar, Waleed and Groeneveld, Dirk and Bhagavatula, Chandra and Beltagy, Iz", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)", month = jun, year = "2018", address = "New Orleans - Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-3011", doi = "10.18653/v1/N18-3011", pages = "84--91", }

bib @inproceedings{lo-wang-2020-s2orc, title = "{S}2{ORC}: The Semantic Scholar Open Research Corpus", author = "Lo, Kyle and Wang, Lucy Lu and Neumann, Mark and Kinney, Rodney and Weld, Daniel", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-main.447", doi = "10.18653/v1/2020.acl-main.447", pages = "4969--4983" }

Owner

  • Name: Jan Philip Wahle
  • Login: jpwahle
  • Kind: user
  • Location: Göttingen
  • Company: @gipplab

👨🏼‍💻 Computer Science Researcher | 📍Göttingen, Germany

Citation (CITATION.bib)

@inproceedings{Wahle2022c,
  title     = {D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research},
  author    = {Wahle, Jan Philip and Ruas, Terry and Mohammad, Saif M. and Gipp, Bela},
  year      = {2022},
  month     = {July},
  booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},
  publisher = {European Language Resources Association},
  address   = {Marseille, France},
  doi       = {}
}

GitHub Events

Total
Last Year

Dependencies

.github/workflows/codeql-analysis.yml actions
  • actions/checkout v3 composite
  • github/codeql-action/analyze v2 composite
  • github/codeql-action/autobuild v2 composite
  • github/codeql-action/init v2 composite
.github/workflows/main.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • snok/install-poetry v1 composite
.github/workflows/release.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v2 composite
  • ad-m/github-push-action master composite
  • mathieudutour/github-tag-action v6.0 composite
  • ncipollo/release-action v1.10.0 composite
  • snok/install-poetry v1 composite
Dockerfile docker
  • python 3.8 build
poetry.lock pypi
  • black 22.8.0 develop
  • cfgv 3.3.1 develop
  • distlib 0.3.6 develop
  • filelock 3.8.0 develop
  • flake8 4.0.1 develop
  • flake8-annotations 2.9.1 develop
  • flake8-black 0.2.5 develop
  • flake8-docstrings 1.6.0 develop
  • flake8-isort 4.2.0 develop
  • identify 2.5.4 develop
  • importlib-metadata 4.12.0 develop
  • iniconfig 1.1.1 develop
  • isort 5.10.1 develop
  • lxml-stubs 0.3.1 develop
  • mako 1.2.2 develop
  • markdown 3.4.1 develop
  • markupsafe 2.1.1 develop
  • mccabe 0.6.1 develop
  • mypy 0.910 develop
  • mypy-extensions 0.4.3 develop
  • nodeenv 1.7.0 develop
  • packaging 21.3 develop
  • pandas-stubs 1.4.4.220906 develop
  • pastel 0.2.1 develop
  • pathspec 0.10.1 develop
  • pdoc3 0.10.0 develop
  • pep8-naming 0.13.2 develop
  • platformdirs 2.5.2 develop
  • pluggy 1.0.0 develop
  • poethepoet 0.11.0 develop
  • pre-commit 2.20.0 develop
  • py 1.11.0 develop
  • pycodestyle 2.8.0 develop
  • pydocstyle 6.1.1 develop
  • pyflakes 2.4.0 develop
  • pyparsing 3.0.9 develop
  • pytest 7.1.3 develop
  • pytest-mock 3.8.2 develop
  • pyyaml 6.0 develop
  • snowballstemmer 2.2.0 develop
  • toml 0.10.2 develop
  • tomli 1.2.3 develop
  • tqdm-stubs 0.2.1 develop
  • types-appdirs 1.4.3 develop
  • types-beautifulsoup4 4.11.6 develop
  • types-click 7.1.8 develop
  • types-pytz 2022.2.1.0 develop
  • types-requests 2.28.9 develop
  • types-urllib3 1.26.23 develop
  • typing-extensions 4.3.0 develop
  • virtualenv 20.16.4 develop
  • zipp 3.8.1 develop
  • appdirs 1.4.4
  • attrs 22.1.0
  • beautifulsoup4 4.11.1
  • certifi 2022.6.15
  • charset-normalizer 2.1.1
  • click 8.1.3
  • colorama 0.4.5
  • idna 3.3
  • jsonlines 3.1.0
  • lxml 4.9.1
  • numpy 1.23.2
  • pandas 1.4.4
  • python-dateutil 2.8.2
  • pytz 2022.2.1
  • requests 2.28.1
  • six 1.16.0
  • soupsieve 2.3.2.post1
  • tqdm 4.64.1
  • urllib3 1.26.12
  • xmltodict 0.12.0
pyproject.toml pypi
  • black ^22.1.0 develop
  • flake8 ^4.0.1 develop
  • flake8-annotations ^2.7.0 develop
  • flake8-black ^0.2.3 develop
  • flake8-docstrings ^1.6.0 develop
  • flake8-isort ^4.1.1 develop
  • isort ^5.9.3 develop
  • lxml-stubs ^0.3.0 develop
  • mypy ^0.910 develop
  • pandas-stubs ^1.4.4 develop
  • pdoc3 ^0.10.0 develop
  • pep8-naming ^0.13.2 develop
  • poethepoet ^0.11.0 develop
  • pre-commit ^2.15.0 develop
  • pytest-mock ^3.6.1 develop
  • tqdm-stubs ^0.2.1 develop
  • types-appdirs ^1.4.1 develop
  • types-beautifulsoup4 ^4.10.5 develop
  • types-click ^7.1.8 develop
  • types-requests ^2.26.0 develop
  • appdirs ^1.4.4
  • beautifulsoup4 ^4.10.0
  • click ^8.0.3
  • jsonlines ^3.1.0
  • lxml ^4.6.4
  • pandas ^1.4.4
  • python >=3.8,<3.11
  • requests ^2.26.0
  • tqdm ^4.62.3
  • xmltodict ^0.12.0