linklittogbifid
A script to link publications citing GBIF downloads to specimens
https://github.com/agentschapplantentuinmeise/linklittogbifid
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.1%) to scientific vocabulary
Repository
A script to link publications citing GBIF downloads to specimens
Basic Info
- Host: GitHub
- Owner: AgentschapPlantentuinMeise
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Size: 1.05 MB
Statistics
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
Description of the linkLitToGbifId.ipynb
Script: Collect, Filter, and Process Cited Scientific Data
1. Objective
To systematically collect, filter, and process cited scientific data from the Global Biodiversity Information Facility (GBIF) literature API. The focus is on obtaining literature that has cited GBIF data and ensuring only relevant and peer-reviewed sources are included.
2. Data Sources
- GBIF API: The GBIF literature API provides access to a database of biodiversity-related literature thast cite GBIF. This includes journals, working papers, books, and book sections.
3. Python Libraries:
requests: For making HTTP requests to the GBIF API.json: For parsing and writing JSON data.tqdm: For providing a progress bar during data retrieval.os: For managing file and directory operations.zipfile: For extracting and processing ZIP files.csv: For handling CSV files.sys: For adjusting system settings to handle large CSV files.- Storage:
- Local storage on the D: drive to manage large data files, including downloaded ZIP files, output CSVs, and error logs.
4. Procedure
4.1 Define the API Endpoint and Parameters
- API Endpoint:
https://api.gbif.org/v1/literature/search - Parameters:
contentType: Filters to "literature".literatureType: Includes "JOURNAL", "WORKINGPAPER", "BOOK", and "BOOKSECTION".relevance: Filters literature that is "GBIF_CITED".peerReview: Ensures only peer-reviewed literature is included.limit: Sets the number of records per request to 10.offset: Starts the search from the first record.
4.2 Data Collection
- Fetch Initial Data:
- Make an initial API request to determine the total number of available records.
- If the initial request fails or the count is unavailable, terminate the process.
- Iterative Data Retrieval:
- Continuously request data from the API in batches of 10 records.
- Filter records that contain the
gbifDownloadKey, indicating that the literature has associated GBIF data downloads. - Update the offset parameter to fetch the next batch until all data is retrieved.
4.3 Data Storage
- Save the filtered records with
gbifDownloadKeyto a JSON file namedfiltered_gbif_entries.jsonfor further processing.
4.4 Increase CSV Field Size Limit
- Increase the field size limit for CSV processing to handle large entries, setting it to the maximum allowable size.
4.5 Loading and Saving Processed Records
Because the files are so large it is likely that the process will be interupted and will have to be restarted. Hence the need for a skip file. 1. Load Processed DOIs: - Load the previously processed DOIs from a skip file to avoid reprocessing. 2. Save Processed DOIs: - Append each processed DOI to the skip file to keep track of completed entries. 3. Load Downloaded Keys: - Load previously downloaded keys to avoid duplicate downloads.
4.6 Data Download and Processing
- Directory Management:
- Ensure that all necessary directories for storing downloads, logs, and outputs exist.
- Download Data:
- Download data files associated with each
gbifDownloadKeyand save them as ZIP files in the specified directory.
- Download data files associated with each
- Unzip and Extract:
- Unzip downloaded files and check for the presence of relevant data (e.g.,
occurrence.txtor CSV files).
- Unzip downloaded files and check for the presence of relevant data (e.g.,
- Filter and Save Relevant Data:
- Filter records for preserved specimens and append them to the output CSV file.
- Include columns such as
gbifID,year,countryCode,gbifDownloadKey, anddoi.
- Error Handling:
- Log errors encountered during download or processing to an error log file for review.
4.7 Data Cleanup
- After successful extraction and processing, delete the downloaded ZIP files and extracted contents to conserve storage space.
5. Outputs
- Filtered Entries File:
filtered_gbif_entries.jsoncontaining all relevant entries withgbifDownloadKey. - Output CSV File:
output_data.csvwith filtered and processed data of preserved specimens. - Error Log File:
error_log.txtdocumenting any errors encountered during processing.
Description of the GBIFLitTopicsAnalysis.ipynb
Script: Topic Analysis and Network Visualization of GBIF Literature
1. Objective
To analyze and visualize the thematic topics present in the literature that reference specimens in GBIF. It uses the GBIF Literature API to extract topics associated with each Digital Object Identifier (DOI) and constructs a network graph representing the co-occurrence of these topics.
2. Materials and Tools
Python Libraries:
requests: For querying the GBIF Literature API.pandas: For handling and processing CSV data.networkx: For constructing and analyzing the topic co-occurrence network.matplotlib: For visualizing the topic network graph.tqdm.notebook: For providing progress bars during data processing.
Data Input:
- CSV File (
allDOIs.csv): A CSV file containing a list of DOIs that reference GBIF data. This file is used as the source for querying the literature API. allDOIs.csvis created fromoutput_data.csvusing awk: awk -F',' '!seen[$5]++ { print $5 }' filename.csv > unique_dois.txt
- CSV File (
3. Procedure
3.1 Data Acquisition
- Querying the GBIF Literature API:
- A function
query_gbif_literature(doi)is defined to fetch literature data from the GBIF API using a given DOI. - The function returns the JSON response if the request is successful; otherwise, it prints an error message.
- A function
3.2 Data Extraction and Processing
Extract Topics:
- The function
extract_topics(data)processes the API response to extract thematic topics associated with each literature entry. - Topics are extracted if they exist in the results; otherwise, a message is printed indicating the absence of topics.
- The function
Read DOIs from CSV:
- The script reads the
allDOIs.csvfile usingpandasto obtain a list of DOIs for further processing.
- The script reads the
Topic Count and Co-occurrence Analysis:
- The script iterates through each DOI, querying the GBIF API and extracting topics.
- It maintains two dictionaries:
topic_counts: Tracks the frequency of each unique topic.topic_cooccurrences: Tracks how often pairs of topics co-occur within the same literature entry.
3.3 Network Graph Construction
Create Network Graph:
- A network graph
Gis created using thenetworkxlibrary. - Nodes: Represent unique topics. The size of each node is proportional to the count of the topic in the literature.
- Edges: Represent co-occurrences between topics. The weight of each edge corresponds to the number of times the topics co-occurred.
- A network graph
Add Nodes and Edges:
- Nodes are added to the graph with attributes such as size and count.
- Edges are added between topics based on their co-occurrence frequency.
3.4 Network Graph Export
- The constructed topic network is saved in the
GraphMLformat (topic_network.graphml). This format allows for further analysis and visualization using various graph tools.
3.5 Network Visualization (Optional)
Visualize Network:
- The graph is visualized using
matplotlib. - The size of each node in the visualization is scaled according to its count attribute.
- Nodes are colored, and edges are drawn with varying thickness based on their weights.
- The graph is visualized using
Plot Configuration:
- A spring layout is used to arrange the nodes for better visualization.
- Node size, font size, and colors are customized for readability.
4. Outputs
- Network Graph File:
topic_network.graphmlcontaining the constructed topic network with nodes and edges. - Visual Plot: A visual representation of the topic network is optionally displayed, showing the structure and connections between topics.
Owner
- Name: Botanic Garden Meise
- Login: AgentschapPlantentuinMeise
- Kind: organization
- Email: quentin.groom@plantentuinmeise.be
- Location: Meise, Belgium
- Website: http://www.plantentuinmeise.be/
- Repositories: 11
- Profile: https://github.com/AgentschapPlantentuinMeise
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: linkLitToGbifId
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Quentin
family-names: Groom
email: quentin.groom@plantentuinmeise.be
affiliation: Meise Botanic Garden
orcid: 'https://orcid.org/0000-0002-0596-5376'
abstract: >-
A workflow for collecting, processing, and analysing
scientific literature data that cites data modilsed by the
Global Biodiversity Information Facility. The initial
script systematically queries the GBIF literature API to
retrieve and filter relevant, peer-reviewed publications,
storing the results for further analysis. It then
processes each dataset through iterative downloading and
extraction processes, ensuring only pertinent information,
such as preserved specimen records, is retained.
The subsequent script focuses on topic analysis by
extracting thematic topics associated with each
publication and constructing a network graph to visualize
their co-occurrences. This network analysis highlights the
interrelationships between research themes within the
biodiversity literature, offering valuable insights into
research trends and the thematic structure of studies that
utilize GBIF data.
keywords:
- GBIF
- provenence
- specimens
- citation
- use of specimens data
license: MIT
GitHub Events
Total
- Push event: 3
Last Year
- Push event: 3