unknown-data-framework
https://github.com/unknowndataproject/unknown-data-framework
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Repository
Basic Info
Statistics
- Stars: 1
- Watchers: 3
- Forks: 2
- Open Issues: 1
- Releases: 1
Metadata Files
README.md
UnknownData Framework
The goal of the UnkownData Framework is to scrape the internet to find metadata for scientifically relevant data publications. For this, a targeted crawl is performed, dataset mentions are extracted, coreferences are detected, and the results are analysed.
This framework is a prototype and was developed as part of the UnkownData project.
Requirements
- Docker. The simplest way to install docker is to install Docker Desktop
Structure
Each software component has its own folder which contains its source code and its Dockerfile. The Dockerfile defines how the code of the software component can be compiled and executed.
Docker Compose is used to combine the individual components, the order in which they should be executed and the shared folders.
The following folders are mounted in each docker container and can be used to share data between the containers. [REPO] is the local path of this repository.
| HOST | DOCKER CONTAINER |
| ---------------------------|-----------------------|
| [REPO]/data/crawler/ | /data/crawler/ |
| [REPO]/data/mentions/ | /data/mentions/ |
| [REPO]/data/coreference/ | /data/coreference/ |
| [REPO]/data/export/ | /data/export/ |
How to Run
The following paragraph explains how to run the software components using docker compose. The flag --build is used in each command to ensure the created docker containers are being rebuild and thus use the latest available source code.
Whole Pipeline
If you want to run the whole pipeline, use the following command from within the project root folder:
bash
docker compose up --build
That command will run all components in the order crawler -> mentions-web -> coreference -> export. Each component will only start to run when the previous one finished successfully.
Only One Component
If you only want to run one component of the pipeline, use the following command from within the project root folder and replace [COMPONENT] by the desired component name (crawler, mentions-web, mentions-pdf, coreference, export). The --no-deps flag ensures that the upstream dependent containers are not executed before the choosen component. Instead only the choosen component is run.
bash
docker compose up --build --no-deps [COMPONENT]
Development
For each software component there is one folder and one Dockerfile that defines a docker image.
Feel free to adapt the folder stucture and the Dockerfile of your software component as needed.
Please adapt the README.md file of your software component for a light documentation of you component.
Owner
- Name: unknowndataproject
- Login: unknowndataproject
- Kind: organization
- Repositories: 2
- Profile: https://github.com/unknowndataproject
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: unknown data framework
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Benedikt Maria
family-names: Beckermann
affiliation: Schloss Dagstuhl - Leibniz Center for Informatics
orcid: 'https://orcid.org/0009-0008-3920-6109'
email: benedikt.beckermann@dagstuhl.de
- given-names: Lu
family-names: Gan
affiliation: GESIS - Leibniz Institute for the Social Sciences
email: Lu.Gan@gesis.org
- given-names: Sebastian
family-names: Tiesler
affiliation: Humboldt-Universität zu Berlin
email: sebastian.tiesler@hu-berlin.de
- given-names: Yousef
family-names: Younes
affiliation: GESIS - Leibniz-Institut for the Social Sciences
orcid: 'https://orcid.org/0000-0003-1271-3633'
email: yousef.younes@gesis.org
repository-code: >-
https://github.com/unknowndataproject/unknown-data-framework
license: CC0-1.0
commit: 6792afc359f806c6153b9cdeabf936a9ef47ea8b
date-released: '2024-09-12'
GitHub Events
Total
- Release event: 1
- Push event: 6
- Pull request event: 1
- Fork event: 1
- Create event: 1
Last Year
- Release event: 1
- Push event: 6
- Pull request event: 1
- Fork event: 1
- Create event: 1
Dependencies
- ubuntu latest build
- ubuntu latest build
- ubuntu latest build
- ubuntu latest build
- ubuntu 22.04 build
- ubuntu 22.04 build
- pandas ==1.5.0
- spacy ==3.7.2
- lxml ==5.1.0
- requests *
- PyYAML ==6.0
- beautifulsoup4 ==4.11.2
- fastwarc *
- numpy ==1.22.2
- resiliparse *
- torch ==1.11.0
- tqdm ==4.62.3
- transformers ==4.17.0
- warcio ==1.7.4