findpapers

Findpapers: A tool for helping researchers who are looking for related works

https://github.com/jonatasgrosman/findpapers

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ieee.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.5%) to scientific vocabulary

Keywords

academic academic-publishing acm arxiv bibtex biorxiv crawler ieee medrxiv paper papers pubmed research scientific-papers scientific-publications scientific-publishing scopus scraper systematic-mapping systematic-review

Last synced: 6 months ago · JSON representation ·

Repository

Findpapers: A tool for helping researchers who are looking for related works

Basic Info

Host: GitHub
Owner: jonatasgrosman
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 5.06 MB

Statistics

Stars: 277
Watchers: 3
Forks: 38
Open Issues: 7
Releases: 1

Topics

Created over 5 years ago · Last pushed about 2 years ago

Metadata Files

Readme Contributing Funding License Citation

README.md

Findpapers

Findpapers is an application that helps researchers who are looking for references for their work. The application will perform searches in several databases (currently ACM, arXiv, bioRxiv, IEEE, medRxiv, PubMed, and Scopus) from a user-defined search query.

In summary, this tool will help you to perform the process below:

Workflow

Requirements

Python 3.7+

Installation

console $ pip install findpapers

You can check your Findpapers version running:

console $ findpapers version

If you have an old version of the tool and want to upgrade it run the following command:

console $ pip install findpapers --upgrade

How to use it?

All application actions are command-line based. The available commands are

findpapers search: Search for papers metadata using a query. This search will be made by matching the query with the paper's title, abstract, and keywords.
findpapers refine: Refine the search results by selecting/classifying the papers
findpapers download: Download full-text papers using the search results
findpapers bibtex: Generate a BibTeX file from the search results

You can control the commands logging verbosity by the -v (or --verbose) argument.

In the following sections, we will show how to use the Findpapers commands. However, all the commands have the --help argument to display some summary about their usage, E.g., findpapers search --help.

Search query construction

First of all, we need to know how to build the search queries. The search queries must follow the rules:

All the query terms need to be not empty and enclosed by square brackets. E.g., [term a]
The query can contain boolean operators, but they must be uppercase. The allowed operators are AND, OR, and NOT. E.g., [term a] AND [term b]
All the operators must have at least one whitespace before and after them (tabs or newlines can be valid too). E.g., [term a] OR [term b] OR [term c]
The NOT operator must always be preceded by an AND operator E.g., [term a] AND NOT [term b]
A subquery needs to be enclosed by parentheses. E.g., [term a] AND ([term b] OR [term c])
The composition of terms is only allowed through boolean operators. Queries like "[term a] [term b]" are invalid

We still have a few more rules that are only applicable on bioRxiv and medRxiv databases:

On subqueries with parentheses, only 1-level grouping is supported, i.e., queries with 2-level grouping like [term a] OR (([term b] OR [term c]) AND [term d]) are considered invalid
Only "OR" connectors are allowed between parentheses, i.e., queries like ([term a] OR [term b]) AND ([term c] OR [term d]) are considered invalid
Only "OR" and "AND" connectors are allowed, i.e., queries like [term a] AND NOT [term b] are considered invalid
Mixed connectors are not allowed on queries (or subqueries when parentheses are used), i.e., queries like [term a] OR [term b] AND [term b] are considered invalid. But queries like [term a] OR [term b] OR [term b] are considered valid

You can use some wildcards in the query too. Use question mark (?) to replace exactly one character, and use an asterisk (*) to replace zero or more characters:

[son?] will match song, sons, ... (But won't match "son")
[son*] will match son, song, sons, sonic, songwriting, ...

There are some rules that you'll need to follow when using wildcards:

Cannot be used at the start of a search term;
A minimum of 3 characters preceding the asterisk wildcard (*) is required;
The asterisk wildcard (*) can only be used at the end of a search term;
Can be used only in single terms;
Only one wildcard can be included in a search term.

Note: The bioRxiv and medRxiv databases don't support any wildcards, and the IEEE and PubMed databases only support the "*" wildcard.

Let's see some examples of valid and invalid queries:

| Query | Valid? | | ------------- | ------------- | | [term a] | Yes | | ([term a] OR [term b]) | Yes | | [term a] OR [term b] | Yes | | [term a] AND [term b] | Yes | | [term a] AND NOT ([term b] OR [term c]) | Yes | | [term a] OR ([term b] AND ([term*] OR [t?rm])) | Yes | | [term a]OR[term b] | No (no whitespace between terms and boolean operator) | | ([term a] OR [term b] | No (missing parentheses) | | [term a] or [term b] | No (lowercase boolean operator) | | term a OR [term b] | No (missing square brackets) | | [term a] [term b] | No (missing boolean operator) | | [term a] XOR [term b] | No (invalid boolean operator) | | [term a] OR NOT [term b] | No (NOT boolean operator must be preceded by AND) | | [] AND [term b] | No (empty term) | |[some term*] | No (wildcards can be used only in single terms) | |[?erm] | No (wildcards cannot be used at the start of a search term) | |[te] | *No** (a minimum of 3 characters preceding the asterisk wildcard is required) | |[ters] | *No** (the asterisk wildcard can only be used at the end of a search term) | |[t?rm?] | No (only one wildcard can be included in a search term) |

Basic example (TL;DR)

Searching for papers:

console $ findpapers search /some/path/search.json -q "[happiness] AND ([joy] OR [peace of mind]) AND NOT [stressful]"

Refining search results:

console $ findpapers refine /some/path/search.json

Downloading full-text from selected papers:

console $ findpapers download /some/path/search.json /some/path/papers/ -s

Generating BibTeX file from selected papers:

console $ findpapers bibtex /some/path/search.json /some/path/mybib.bib -s

Advanced example

This advanced usage documentation can be a bit boring to read (and write), so I think it's better to go for a storytelling approach here.

Let's take a look at Dr. McCartney's research. He's a computer scientist interested in AI and music, so he created a search query to collect papers that can help with his research and exported this query to an environment variable.

console $ export QUERY="([artificial intelligence] OR [AI] OR [machine learning] OR [ML] OR [deep learning] OR [DL]) AND ([music] OR [s?ng])"

Dr. McCartney is interested in testing his query, so he decides to collect only 20 papers to test whether the query is suitable for his research (the Findpapers results are sorted by publication date in descending order. An important disclaimer about publication dates is that some papers just have the publication year defined. In those cases, the publication date will be set to 01-01-year by Findpapers).

console $ findpapers search /some/path/search_paul.json --query "$QUERY" --limit 20

But after taking a look at the results contained in the search_paul.json file, he notices two problems: - Only one database was used to collect all the 20 papers - Some collected papers were about drums, but he doesn't like drums or drummers

So he decides to solve these problems by: - Reformulating his query, and also placing it inside a file to make his life easier (Note that in a text file, you can split your query search into multiple lines. Is it much more comfortable to read, right?).

/some/path/query.txt ``` ([artificial intelligence] OR [AI] OR [machine learning] OR [ML] OR [deep learning] OR [DL])

AND

([music] OR [s?ng])

AND NOT

[drum*] ```

Performing the search limiting the number of papers that can be collected by each database.

console $ findpapers search /some/path/search_paul.json --query-file /some/path/query.txt --limit-db 4

Now his query returned the papers he wanted, but he realized one thing, no papers were collected from Scopus or IEEE databases. Then he noticed that he needed to pass his Scopus and IEEE API access keys when calling the search command. So he went to https://dev.elsevier.com and https://developer.ieee.org, generated the access keys, and used them in the search.

```console $ export IEEETOKEN=SOMESUPERSECRETTOKEN

$ export SCOPUSTOKEN=SOMESUPERSECRETTOKEN

$ findpapers search /some/path/searchpaul.json --query-file /some/path/query.txt --limit-db 4 --token-ieee "$IEEETOKEN" --token-scopus "$SCOPUS_TOKEN" ```

Now everything is working as he expected, so it's time to do the final papers search. So he defines that he wants to collect only works published between 2000 and 2020. He also decides that he only wants papers collected from ACM, IEEE, and Scopus. And he only wants to papers published on a journal or conference proceedings (Tip: The available publication types on Findpapers are: journal, conference proceedings, book, other. When a particular publication does not fit into any of the other types it is classified as "other", e.g., magazines, newsletters, unpublished manuscripts)

console $ findpapers search /some/path/search_paul.json --query-file /some/path/query.txt --token-ieee "$IEEE_TOKEN" --token-scopus "$SCOPUS_TOKEN" --since 2000-01-01 --until 2020-12-31 --databases "acm,ieee,scopus" --publication-types "journal,conference proceedings"

The searching process took a long time, but after many cups of coffee, Dr. McCartney finally has a good list of papers with the potential to help in his research. All the information collected is in the search_paul.json file. He could access this file now and manually filter which works are most interesting for him, but he prefers to use the Findpapers refine command for this.

First, he wants to filter the papers looking only at their basic information.

console $ findpapers refine /some/path/search_paul.json

refine-01

After completing the first round filtering of the collected papers, he decides to do new filtering on the selected ones looking at the paper's extra info (citations, DOI, publication name, etc.) and abstract now. He also chooses to perform some classification while doing this further filtering (tip: he'll need to use spacebar for categories selection). And to help in this process, he also decides to highlight some keywords contained in the abstract.

```console $ export CATEGORIES_CONTRIBUTION="Contribution:Metric,Tool,Model,Method"

$ export CATEGORIESRESEARCHTYPE="Research Type:Validation Research,Solution Proposal,Philosophical,Opinion,Experience,Other"

$ export HIGHLIGHTS="propose, achiev, accuracy, method, metric, result, limitation, state of the art"

$ findpapers refine /some/path/searchpaul.json --selected --abstract --extra-info --categories "$CATEGORIESCONTRIBUTION" --categories "$CATEGORIESRESEARCHTYPE" --highlights "$HIGHLIGHTS" ```

refine-02

An interesting point to stand out from the tool is that it automatically prevents duplication of papers, merging their information when the same paper is found in different databases. You can see this in the image above, where the Findpapers found the same work on the IEEE and Scopus databases (see "Paper found in" value) and merged the paper information on a single record.

Another interesting extra information given by Findpapers (based on Beall's List) is whether a collected paper was published by a predatory publisher (see "Publication is potentially predatory" value). That is a really good feature, because there is a lot of scientific misinformation out there.

Now that Dr. McCartney has selected all the papers he wanted, he wants to see all of them.

console $ findpapers refine /some/path/search_paul.json --selected --abstract --extra-info --list

He wants to see all the removed papers too.

console $ findpapers refine /some/path/search_paul.json --removed --abstract --extra-info --list

Then, he decides to download the full-text from all the selected papers which have a "Model" or "Tool" as a contribution.

console $ findpapers download /some/path/search_paul.json /some/path/papers --selected --categories "Contribution:Tool,Model"

He also wants to generate the BibTeX file from these papers.

console $ findpapers bibtex /some/path/search_paul.json /some/path/mybib.bib --selected --categories "Contribution:Tool,Model"

But when he compared the papers' data in the /some/path/mybib.bib and PDF files in the /some/path/papers folder, he noticed that many papers had not been downloaded.

So when he opened the /some/path/papers/download.log file, he could see the URL of all papers that weren't downloaded correctly. After accessing these links, he noticed that some of them weren't downloaded due to some limitations of Findpapers (currently, the tool has a set of heuristics to perform the download that may not work in all cases). However, the vast majority of papers weren't downloaded because they were behind a paywall. But, Dr. McCartney has access to these papers when he's connected to the network at the university where he works, but unfortunately, he is at home right now.

But he discovers two things that could save him from this mess. First, the university provides a proxy for tunneling requests. Second, Findpapers accepts the configuration of a proxy URL. And of course, he'll use this feature (see the "--proxy" argument on the command bellow).

console $ findpapers download /some/path/search_paul.json /some/path/papers --selected --categories "Contribution:Tool,Model" --proxy "https://mccartney:super_secret_pass@liverpool.ac.uk:1234"

Now the vast majority of the papers he has access have been downloaded correctly.

And at the end of it, he decides to download the full-text from all the selected works (regardless of their categorization) and generate their BibTeX file too. And, as he is very happy with the results, he also wants to include a Findpapers entry in the BibTeX file to cite in his work.

```console $ findpapers download /some/path/searchpaul.json /some/path/papers --selected --proxy "https://mccartney:supersecret_pass@liverpool.ac.uk:1234"

$ findpapers bibtex /some/path/search_paul.json /some/path/mybib.bib --selected --findpapers ```

That's all, folks! We have reached the end of our journey. I hope Dr. McCartney can continue his research and publish his work without any major problems now. You can use findpapers in a more scriptable way too. Check out the search_paul.py file to see how you can do that.

As you could see, all the information collected and enriched by the Findpapers is placed in a single JSON file. From this file, it is possible to create interesting visualizations about the collected data ...

charts

... So, use your imagination! (The samples/charts.py script made the visualization above).

With the story above, we cover all the commands available in Findpapers. I know this documentation is unconventional, but I haven't had time to write a more formal version of the documentation. But you can help us to improve this, take a look at the next section and see how you can do that.

Want to help?

See the contribution guidelines if you'd like to contribute to Findpapers project.

You don't even need to know how to code to contribute to the project. Even the improvement of our documentation is an outstanding contribution.

If this project has been useful for you, please share it with your friends. This project could be helpful for them too.

If you like this project and want to motivate the maintainers, give us a :star:. This kind of recognition will make us very happy with the work that we've done with :heart:

You can also sponsor me :heart_eyes:

Citation

If you want to cite the tool you can use this:

bibtex @misc{grosman2020findpapers, title={{Findpapers: A tool for helping researchers who are looking for related works}}, author={Grosman, Jonatas}, howpublished={\url{https://github.com/jonatasgrosman/findpapers}}, year={2020} }

Owner

Name: Jonatas Grosman
Login: jonatasgrosman
Kind: user
Location: Brazil
Company: Pontifical Catholic University of Rio de Janeiro

Website: jonatasgrosman.com
Twitter: jonatasgrosman
Repositories: 18
Profile: https://github.com/jonatasgrosman

PhD in Computer Science

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Grosman
  given-names: Jonatas
title: "Findpapers: A tool for helping researchers who are looking for related works"
date-released: 2020
url: "https://github.com/jonatasgrosman/findpapers"
preferred-citation:
  type: generic
  authors:
  - family-names: Grosman
    given-names: Jonatas
  title: "Findpapers: A tool for helping researchers who are looking for related works"
  year: 2020
  url: "https://github.com/jonatasgrosman/findpapers"

GitHub Events

Total

Issues event: 2
Watch event: 59
Issue comment event: 6
Pull request event: 1
Fork event: 3

Last Year

Issues event: 2
Watch event: 59
Issue comment event: 6
Pull request event: 1
Fork event: 3

Committers

Last synced: over 1 year ago

All Time

Total Commits: 204
Total Committers: 2
Avg Commits per committer: 102.0
Development Distribution Score (DDS): 0.078

Past Year

Commits: 3
Committers: 1
Avg Commits per committer: 3.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Jonatas Grosman	j**n@g**m	188
Jonatas Grosman	g**s@g**m	16

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 9
Total pull requests: 1
Average time to close issues: 24 days
Average time to close pull requests: N/A
Total issue authors: 7
Total pull request authors: 1
Average comments per issue: 2.0
Average comments per pull request: 2.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 3
Pull request authors: 0
Average comments per issue: 1.33
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

iannunes (3)
VladimirShitov (1)
mikecuoco (1)
youkq95 (1)
Erickfeio (1)
pdecazes (1)
ZeroCommits (1)

Pull Request Authors

ZeroCommits (1)
denisstrizhkin (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 336 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 24
Total maintainers: 1

pypi.org: findpapers

Findpapers is an application that helps researchers who are looking for references for their work.

Homepage: https://github.com/jonatasgrosman/findpapers
Documentation: https://github.com/jonatasgrosman/findpapers
License: MIT
Latest release: 0.6.7
published over 3 years ago

Versions: 24
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 336 Last month

Rankings

Stargazers count: 5.7%

Forks count: 8.3%

Dependent packages count: 9.8%

Average: 14.3%

Dependent repos count: 21.8%

Downloads: 25.6%

Maintainers (1)

jonatasgrosman

Last synced: 6 months ago

Dependencies

poetry.lock pypi

alabaster 0.7.12 develop
atomicwrites 1.4.0 develop
attrs 20.2.0 develop
babel 2.8.0 develop
coverage 5.3 develop
docutils 0.16 develop
imagesize 1.2.0 develop
jinja2 2.11.2 develop
markupsafe 1.1.1 develop
more-itertools 8.5.0 develop
packaging 20.4 develop
pluggy 0.13.1 develop
py 1.9.0 develop
pygments 2.7.1 develop
pyparsing 2.4.7 develop
pytest 5.4.3 develop
pytest-cov 2.10.1 develop
pytest-randomly 3.4.1 develop
pytz 2020.1 develop
snowballstemmer 2.0.0 develop
sphinx 3.2.1 develop
sphinxcontrib-applehelp 1.0.2 develop
sphinxcontrib-devhelp 1.0.2 develop
sphinxcontrib-htmlhelp 1.0.3 develop
sphinxcontrib-jsmath 1.0.1 develop
sphinxcontrib-qthelp 1.0.3 develop
sphinxcontrib-serializinghtml 1.1.4 develop
toml 0.10.1 develop
ansicon 1.89.0
blessed 1.17.6
certifi 2020.6.20
chardet 3.0.4
click 7.1.2
colorama 0.4.3
edlib 1.3.8.post1
idna 2.10
importlib-metadata 1.7.0
inquirer 2.7.0
jinxed 1.0.1
lxml 4.5.2
python-editor 1.0.4
readchar 2.0.1
requests 2.24.0
six 1.15.0
typer 0.3.2
urllib3 1.25.10
wcwidth 0.2.5
xmltodict 0.12.0
zipp 3.1.0

pyproject.toml pypi

Sphinx ^3.2.1 develop
coverage ^5.2.1 develop
pytest ^5.2 develop
pytest-cov ^2.10.1 develop
pytest-randomly ^3.4.1 develop
colorama ^0.4.3
edlib ^1.3.8
importlib-metadata ^1.0
inquirer ^2.7.0
lxml ^4.5.2
python ^3.7
requests ^2.24.0
typer ^0.3.2
xmltodict ^0.12.0

.github/workflows/publish.yml actions

actions/cache v1 composite
actions/checkout v2 composite
actions/setup-python v1 composite

.github/workflows/test.yml actions

actions/cache v1 composite
actions/checkout v2 composite
actions/setup-python v1 composite
codecov/codecov-action v1 composite