cs-insights-crawler
This repository implements the interaction with DBLP, information extraction and pre-processing of papers, and a client to store data to the cs-insights-backend.
Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (17.3%) to scientific vocabulary
Keywords
Repository
This repository implements the interaction with DBLP, information extraction and pre-processing of papers, and a client to store data to the cs-insights-backend.
Basic Info
- Host: GitHub
- Owner: jpwahle
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://aclanthology.org/2022.lrec-1.283.pdf
- Size: 8.67 MB
Statistics
- Stars: 7
- Watchers: 3
- Forks: 0
- Open Issues: 8
- Releases: 0
Topics
Metadata Files
README.md
This is the official crawler implementation for the D3 Dataset in almost pure python. The crawler is also used for the cs-insights project.
Starting from version 1.0.2, this project is using semantic versioning, and supports SemanticScholar. For more info about the features supported, see the releases.
Installation & Setup
First install the package manager poetry:
console
pip install poetry
Then run:
console
poetry install
To start the crawling process, run:
console
poetry run cli main --s2_use_papers --s2_use_abstracts --s2_filter_dblp
For help run:
console
poetry run cli main --help
Code quality and tests
To maintain a consistent and well-tested repository, we use unit tests, linting, and typing checkers with GitHub actions. We use pytest for testing, pylint for linting, and pyright for typing. Every time code gets pushed to our repository these checks are executed and have to fullfill certain requirements before you can merge the code to our master branch.
Whenever you create a pull request against the default branch, GitHub actions will create a CI job executing unit tests and linting.
To run all tests that are tested during CI locally, run:
console
poetry run poe alltest
Contributing
Fork the repo, make changes and send a PR. We'll review it together!
Commit messages should follow Angular's conventions.
License
This project is licensed under the terms of MIT license. For more information, please see the LICENSE file.
Citation
If you use this repository, or use our tool for analysis, please cite our work:
Citation
If you use this repository, or use our tool for analysis, please cite our work:
bib
@inproceedings{Wahle2022c,
title = {D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research},
author = {Wahle, Jan Philip and Ruas, Terry and Mohammad, Saif M. and Gipp, Bela},
year = {2022},
month = {July},
booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},
publisher = {European Language Resources Association},
address = {Marseille, France},
doi = {},
}
Also make sure to cite the following papers if you use SemanticScholar data:
bib
@inproceedings{ammar-etal-2018-construction,
title = "Construction of the Literature Graph in Semantic Scholar",
author = "Ammar, Waleed and
Groeneveld, Dirk and
Bhagavatula, Chandra and
Beltagy, Iz",
booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)",
month = jun,
year = "2018",
address = "New Orleans - Louisiana",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/N18-3011",
doi = "10.18653/v1/N18-3011",
pages = "84--91",
}
bib
@inproceedings{lo-wang-2020-s2orc,
title = "{S}2{ORC}: The Semantic Scholar Open Research Corpus",
author = "Lo, Kyle and Wang, Lucy Lu and Neumann, Mark and Kinney, Rodney and Weld, Daniel",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.447",
doi = "10.18653/v1/2020.acl-main.447",
pages = "4969--4983"
}
Owner
- Name: Jan Philip Wahle
- Login: jpwahle
- Kind: user
- Location: Göttingen
- Company: @gipplab
- Website: https://jpwahle.com
- Twitter: jpwahle
- Repositories: 20
- Profile: https://github.com/jpwahle
👨🏼💻 Computer Science Researcher | 📍Göttingen, Germany
Citation (CITATION.bib)
@inproceedings{Wahle2022c,
title = {D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research},
author = {Wahle, Jan Philip and Ruas, Terry and Mohammad, Saif M. and Gipp, Bela},
year = {2022},
month = {July},
booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},
publisher = {European Language Resources Association},
address = {Marseille, France},
doi = {}
}
GitHub Events
Total
Last Year
Dependencies
- actions/checkout v3 composite
- github/codeql-action/analyze v2 composite
- github/codeql-action/autobuild v2 composite
- github/codeql-action/init v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- snok/install-poetry v1 composite
- actions/checkout v3 composite
- actions/setup-python v2 composite
- ad-m/github-push-action master composite
- mathieudutour/github-tag-action v6.0 composite
- ncipollo/release-action v1.10.0 composite
- snok/install-poetry v1 composite
- python 3.8 build
- black 22.8.0 develop
- cfgv 3.3.1 develop
- distlib 0.3.6 develop
- filelock 3.8.0 develop
- flake8 4.0.1 develop
- flake8-annotations 2.9.1 develop
- flake8-black 0.2.5 develop
- flake8-docstrings 1.6.0 develop
- flake8-isort 4.2.0 develop
- identify 2.5.4 develop
- importlib-metadata 4.12.0 develop
- iniconfig 1.1.1 develop
- isort 5.10.1 develop
- lxml-stubs 0.3.1 develop
- mako 1.2.2 develop
- markdown 3.4.1 develop
- markupsafe 2.1.1 develop
- mccabe 0.6.1 develop
- mypy 0.910 develop
- mypy-extensions 0.4.3 develop
- nodeenv 1.7.0 develop
- packaging 21.3 develop
- pandas-stubs 1.4.4.220906 develop
- pastel 0.2.1 develop
- pathspec 0.10.1 develop
- pdoc3 0.10.0 develop
- pep8-naming 0.13.2 develop
- platformdirs 2.5.2 develop
- pluggy 1.0.0 develop
- poethepoet 0.11.0 develop
- pre-commit 2.20.0 develop
- py 1.11.0 develop
- pycodestyle 2.8.0 develop
- pydocstyle 6.1.1 develop
- pyflakes 2.4.0 develop
- pyparsing 3.0.9 develop
- pytest 7.1.3 develop
- pytest-mock 3.8.2 develop
- pyyaml 6.0 develop
- snowballstemmer 2.2.0 develop
- toml 0.10.2 develop
- tomli 1.2.3 develop
- tqdm-stubs 0.2.1 develop
- types-appdirs 1.4.3 develop
- types-beautifulsoup4 4.11.6 develop
- types-click 7.1.8 develop
- types-pytz 2022.2.1.0 develop
- types-requests 2.28.9 develop
- types-urllib3 1.26.23 develop
- typing-extensions 4.3.0 develop
- virtualenv 20.16.4 develop
- zipp 3.8.1 develop
- appdirs 1.4.4
- attrs 22.1.0
- beautifulsoup4 4.11.1
- certifi 2022.6.15
- charset-normalizer 2.1.1
- click 8.1.3
- colorama 0.4.5
- idna 3.3
- jsonlines 3.1.0
- lxml 4.9.1
- numpy 1.23.2
- pandas 1.4.4
- python-dateutil 2.8.2
- pytz 2022.2.1
- requests 2.28.1
- six 1.16.0
- soupsieve 2.3.2.post1
- tqdm 4.64.1
- urllib3 1.26.12
- xmltodict 0.12.0
- black ^22.1.0 develop
- flake8 ^4.0.1 develop
- flake8-annotations ^2.7.0 develop
- flake8-black ^0.2.3 develop
- flake8-docstrings ^1.6.0 develop
- flake8-isort ^4.1.1 develop
- isort ^5.9.3 develop
- lxml-stubs ^0.3.0 develop
- mypy ^0.910 develop
- pandas-stubs ^1.4.4 develop
- pdoc3 ^0.10.0 develop
- pep8-naming ^0.13.2 develop
- poethepoet ^0.11.0 develop
- pre-commit ^2.15.0 develop
- pytest-mock ^3.6.1 develop
- tqdm-stubs ^0.2.1 develop
- types-appdirs ^1.4.1 develop
- types-beautifulsoup4 ^4.10.5 develop
- types-click ^7.1.8 develop
- types-requests ^2.26.0 develop
- appdirs ^1.4.4
- beautifulsoup4 ^4.10.0
- click ^8.0.3
- jsonlines ^3.1.0
- lxml ^4.6.4
- pandas ^1.4.4
- python >=3.8,<3.11
- requests ^2.26.0
- tqdm ^4.62.3
- xmltodict ^0.12.0