dataverse-metadata-crawler

A Python CLI tool for bulk extracting and exporting metadata from Dataverse repositories' collections to JSON and CSV formats.

https://github.com/scholarsportal/dataverse-metadata-crawler

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.6%) to scientific vocabulary

Keywords

borealis dataverse python scholars-portal
Last synced: 6 months ago · JSON representation ·

Repository

A Python CLI tool for bulk extracting and exporting metadata from Dataverse repositories' collections to JSON and CSV formats.

Basic Info
Statistics
  • Stars: 4
  • Watchers: 3
  • Forks: 1
  • Open Issues: 4
  • Releases: 7
Topics
borealis dataverse python scholars-portal
Created about 1 year ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

Project Status: Active – The project has reached a stable, usable state and is being actively developed. License: MIT Dataverse Code Style: Black Binder

Dataverse Metadata Crawler

Screencapture of the CLI tool

📜Description

A Python CLI tool for extracting and exporting metadata from Dataverse repositories. It supports bulk extraction of dataverses, datasets, and data file metadata from any chosen level of dataverse collection (an entire Dataverse repository/sub-Dataverse), with flexible export options to JSON and CSV formats.

✨Features

  1. Bulk metadata extraction from Dataverse repositories at any chosen level of collection (top level or selected collection)
  2. JSON & CSV file export options

☁️ Installation (Cloud - Slower)

Click Binder to launch the crawler directly in your web browser—no Git or Python installation required!

⚙️Installation (Locally - Better performance)

📦Prerequisites

  1. Git

2. Python 3.10+

  1. Clone the repository sh git clone https://github.com/scholarsportal/dataverse-metadata-crawler.git

  2. Change to the project directory sh cd ./dataverse-metadata-crawler

  3. Create an environment file (.env) ```sh touch .env # For Unix/MacOS nano .env # or vim .env, or your preferred editor

    OR

    New-Item .env -Type File # For Windows (Powershell) notepad .env ```

  4. Configure the environment (.env) file using the text editor of your choice. ```sh

    .env file

    BASEURL = "TARGETREPOURL" # Base URL of the repository; e.g., "https://demo.borealisdata.ca/" APIKEY = "YOURAPIKEY" # Found in your Dataverse account settings. Can also be specified in the CLI interface using the -a flag. Your `.env` file should look like this: sh BASEURL = "https://demo.borealisdata.ca/" APIKEY = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXX" ```

  5. Set up virtual environment (recommended) ```sh python3 -m venv .venv source .venv/bin/activate # For Unix/MacOS

    OR

    .venv\Scripts\activate # For Windows ```

  6. Install dependencies sh pip install -r requirements.txt

🛠️Usage

Basic Command

sh python3 dvmeta/main.py [-a AUTH] [-l] [-d] [-p] [-f] [-e] [-s] -c COLLECTION_ALIAS -v VERSION Required arguments:

| Option | Short | Type | Description | Default | |--------------------|-----------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------| | --collection_alias | -c | TEXT | The alias of the collection to crawl.
See the guide here to learn how to look for a the collection alias.
[required] | None | | --version | -v | TEXT | The Dataset version to crawl. Options include:
draft - The draft version, if any
latest - Either a draft (if exists) or the latest published version
latest-published - The latest published version
x.y - A specific version
[required] | None (required) |

Optional arguments:

| Option | Short | Type | Description | Default | |-------------------|-----------|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------| | --auth | -a | TEXT | Authentication token to access the Dataverse repository. | None | | --log --no-log | -l | | Output a log file. Use --no-log to disable logging. | log (unless --no-log) | | --dvdfdsmetadata | -d | | Output a JSON file containing metadata of Dataverses, Datasets, and Data Files. | | | --permission | -p | | Output a JSON file that stores permission metadata for all Datasets in the repository. | | | --emptydv | -e | | Output a JSON file that stores all Dataverses which do NOT contain Datasets (though they might have child Dataverses which have Datasets). | | | --failed | -f | | Output a JSON file of Dataverses/Datasets that failed to be crawled. | | | --spreadsheet | -s | | Output a CSV file of the metadata of Datasets.
See the spreadsheet column explanation notes. | | | --debug-log | -debug | | Enable debug logging. This will create a debug log file in the log
files directory. | | | --help | | | Show the help message. | |

Examples

```sh

Export the metadata of latest version of datasets under collection 'demo' to JSON

python3 dvmeta/main.py -c demo -v latest -d

Export the metadata of version 1.0 of all datasets under collection 'demo' to JSON and CSV

python3 dvmeta/main.py -c demo -v 1.0 -d -s

Export the metadata and permission metadata of version 1.0 of all datasets under collection 'demo' to JSON and CSV, with the API token specified in the CLI interface

python3 dvmeta/main.py -c demo -v 1.0 -d -s -p -a xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx ```

📂Output Structure

| File | Description | |-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------| | dsmetadatayyyymmdd-HHMMSS.json | Datasets representation & data files metadata in JSON format. | | emptydvyyyymmdd-HHMMSS.json | The id of empty dataverse(s) in list format. | | failedmetadataurisyyyymmdd-HHMMSS.json | The URIs (URL) of datasets failed to be downloaded. | | permissiondictyyyymmdd-HHMMSS.json | The permission metadata of datasets with their dataset id. | | piddictyyyymmdd-HHMMSS.json | Datasets' basic info with hierarchical information dictionary.Only exported if -p (permission) flag is used without -d (metadata) flag. | | piddictddyyyymmdd-HHMMSS.json | The Hierarchical information of deaccessioned/draft datasets. | | dsmetadatayyyymmdd-HHMMSS.csv | Datasets and their data files' metadata in CSV format. | | log_yyyymmdd-HHMMSS.txt | Summary of the crawling work. |

sh exported_files/ ├── json_files/ │ └── ds_metadata_yyyymmdd-HHMMSS.json # With -d flag enabled │ └── empty_dv_yyyymmdd-HHMMSS.json # With -e flag enabled │ └── failed_metadata_uris_yyyymmdd-HHMMSS.json # With -f flag enabled │ └── permission_dict_yyyymmdd-HHMMSS.json # With only -p flag enabled │ └── pid_dict_yyyymmdd-HHMMSS.json # With only -p flag enabled │ └── pid_dict_dd_yyyymmdd-HHMMSS.json # Hierarchical information of deaccessioned/draft datasets. ├── csv_files/ │ └── ds_metadata_yyyymmdd-HHMMSS.csv # with -s flag enabled └── logs_files/ └── log_yyyymmdd-HHMMSS.txt # Exported by default, without specifying --no-log └── debug.log # Export by using -debug flag

⚠️Disclaimer

[!WARNING] To retrieve data about unpublished datasets or information that is not available publicly (e.g. collaborators/permissions), you will need to have necessary access rights. Please note that any publication or use of non-publicly available data may require review by a Research Ethics Board.

✅Tests

No tests have been written yet. Contributions welcome!

💻Development

  1. Dependencies management: uv - Use uv to manage dependencies and reflect changes in the pyproject.toml file.
  2. Linter: ruff - Follow the linting rules outlined in the pyproject.toml file.

🙌Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Submit a pull request

📄License

MIT

🆘Support

  • Create an issue in the GitHub repository

📚Citation

If you use this software in your work, please cite it using the following metadata.

APA: Lui, L. H. (2025). Dataverse Metadata Crawler (Version 0.1.6) [Computer software]. https://github.com/scholarsportal/dataverse-metadata-crawler

BibTeX: @software{Lui_Dataverse_Metadata_Crawler_2025, author = {Lui, Lok Hei}, month = {June}, title = {Dataverse Metadata Crawler}, url = {https://github.com/scholarsportal/dataverse-metadata-crawler}, version = {0.1.6}, year = {2025} }

✍️Authors

Ken Lui - Data Curation Specialist, Map and Data Library, University of Toronto - kenlh.lui@utoronto.ca

Owner

  • Name: Scholars Portal
  • Login: scholarsportal
  • Kind: organization
  • Email: help@scholarsportal.info
  • Location: Toronto

A service of the Ontario Council of University Libraries

Citation (CITATION.cff)

cff-version: 0.1.6
message: "If you use this software, please cite it as below."
authors:
- family-names: "Lui"
  given-names: "Lok Hei"
  orcid: "https://orcid.org/0000-0001-5077-1530"
title: "Dataverse Metadata Crawler"
version: 0.1.6
date-released: 2025-06-09
url: "https://github.com/scholarsportal/dataverse-metadata-crawler"

GitHub Events

Total
  • Create event: 25
  • Issues event: 7
  • Release event: 16
  • Watch event: 4
  • Delete event: 15
  • Issue comment event: 1
  • Member event: 1
  • Push event: 86
  • Pull request event: 44
  • Gollum event: 9
  • Fork event: 2
Last Year
  • Create event: 25
  • Issues event: 7
  • Release event: 16
  • Watch event: 4
  • Delete event: 15
  • Issue comment event: 1
  • Member event: 1
  • Push event: 86
  • Pull request event: 44
  • Gollum event: 9
  • Fork event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 5
  • Total pull requests: 23
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 1 minute
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 0.2
  • Average comments per pull request: 0.0
  • Merged pull requests: 22
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 5
  • Pull requests: 23
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 1 minute
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 0.2
  • Average comments per pull request: 0.0
  • Merged pull requests: 22
  • Bot issues: 0
  • Bot pull requests: 1
Top Authors
Issue Authors
  • kenlhlui (5)
Pull Request Authors
  • kenlhlui (21)
  • dependabot[bot] (1)
Top Labels
Issue Labels
enhancement (3)
Pull Request Labels
dependencies (1) python (1)

Dependencies

.github/workflows/poetry-export_dependencies.yml actions
  • abatilo/actions-poetry v4 composite
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
poetry.lock pypi
  • anyio 4.8.0
  • appnope 0.1.4
  • asttokens 3.0.0
  • asyncio 3.4.3
  • attrs 24.3.0
  • certifi 2024.12.14
  • cffi 1.17.1
  • click 8.1.8
  • colorama 0.4.6
  • comm 0.2.2
  • debugpy 1.8.12
  • decorator 5.1.1
  • exceptiongroup 1.2.2
  • executing 2.1.0
  • h11 0.14.0
  • httpcore 1.0.7
  • httpx 0.27.2
  • idna 3.10
  • ipykernel 6.29.5
  • ipython 8.31.0
  • jedi 0.19.2
  • jinja2 3.1.5
  • jmespath 1.0.1
  • json2 0.9.0
  • jsonschema 4.23.0
  • jsonschema-specifications 2024.10.1
  • jupyter-client 8.6.3
  • jupyter-core 5.7.2
  • markdown-it-py 3.0.0
  • markupsafe 3.0.2
  • matplotlib-inline 0.1.7
  • mdurl 0.1.2
  • nest-asyncio 1.6.0
  • numpy 2.2.1
  • orjson 3.10.14
  • packaging 24.2
  • pandas 2.2.3
  • parso 0.8.4
  • pexpect 4.9.0
  • platformdirs 4.3.6
  • prompt-toolkit 3.0.48
  • psutil 6.1.1
  • ptyprocess 0.7.0
  • pure-eval 0.2.3
  • pycparser 2.22
  • pydataverse 0.3.4
  • pygments 2.19.1
  • python-dateutil 2.9.0.post0
  • python-dotenv 1.0.1
  • pytz 2024.2
  • pywin32 308
  • pyzmq 26.2.0
  • referencing 0.36.0
  • rich 13.9.4
  • rich-argparse 1.6.0
  • rpds-py 0.22.3
  • shellingham 1.5.4
  • six 1.17.0
  • sniffio 1.3.1
  • stack-data 0.6.3
  • tornado 6.4.2
  • traitlets 5.14.3
  • typer 0.13.1
  • typing-extensions 4.12.2
  • tzdata 2024.2
  • wcwidth 0.2.13
pyproject.toml pypi
  • ipykernel ^6.29.5 develop
  • asyncio ^3.4.3
  • httpx ^0.27.2
  • ipykernel ^6.29.5
  • jinja2 ^3.1.4
  • jmespath ^1.0.1
  • json2 ^0.9.0
  • numpy ^2.2.1
  • orjson ^3.10.14
  • pandas ^2.2.3
  • pydataverse ^0.3.4
  • python ^3.10
  • python-dotenv ^1.0.1
  • rich-argparse ^1.6.0
  • typer ^0.13.1
requirements.txt pypi
  • anyio ==4.8.0
  • appnope ==0.1.4
  • asttokens ==3.0.0
  • asyncio ==3.4.3
  • attrs ==24.3.0
  • certifi ==2024.12.14
  • cffi ==1.17.1
  • click ==8.1.8
  • colorama ==0.4.6
  • comm ==0.2.2
  • debugpy ==1.8.12
  • decorator ==5.1.1
  • exceptiongroup ==1.2.2
  • executing ==2.1.0
  • h11 ==0.14.0
  • httpcore ==1.0.7
  • httpx ==0.27.2
  • idna ==3.10
  • ipykernel ==6.29.5
  • ipython ==8.31.0
  • jedi ==0.19.2
  • jinja2 ==3.1.5
  • jmespath ==1.0.1
  • json2 ==0.9.0
  • jsonschema ==4.23.0
  • jsonschema-specifications ==2024.10.1
  • jupyter-client ==8.6.3
  • jupyter-core ==5.7.2
  • markdown-it-py ==3.0.0
  • markupsafe ==3.0.2
  • matplotlib-inline ==0.1.7
  • mdurl ==0.1.2
  • nest-asyncio ==1.6.0
  • numpy ==2.2.1
  • orjson ==3.10.14
  • packaging ==24.2
  • pandas ==2.2.3
  • parso ==0.8.4
  • pexpect ==4.9.0
  • platformdirs ==4.3.6
  • prompt-toolkit ==3.0.48
  • psutil ==6.1.1
  • ptyprocess ==0.7.0
  • pure-eval ==0.2.3
  • pycparser ==2.22
  • pydataverse ==0.3.4
  • pygments ==2.19.1
  • python-dateutil ==2.9.0.post0
  • python-dotenv ==1.0.1
  • pytz ==2024.2
  • pywin32 ==308
  • pyzmq ==26.2.0
  • referencing ==0.36.0
  • rich ==13.9.4
  • rich-argparse ==1.6.0
  • rpds-py ==0.22.3
  • shellingham ==1.5.4
  • six ==1.17.0
  • sniffio ==1.3.1
  • stack-data ==0.6.3
  • tornado ==6.4.2
  • traitlets ==5.14.3
  • typer ==0.13.1
  • typing-extensions ==4.12.2
  • tzdata ==2024.2
  • wcwidth ==0.2.13
.github/workflows/jekyll-gh-pages.yml actions
  • actions/checkout v4 composite
  • actions/configure-pages v5 composite
  • actions/deploy-pages v4 composite
  • actions/jekyll-build-pages v1 composite
  • actions/upload-pages-artifact v3 composite