datadoc-analyzer

A tool for analyzing the documentation of scientific datasets

https://github.com/som-research/datadoc-analyzer

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: wiley.com, nature.com, acm.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
    Organization som-research has institutional domain (som-research.uoc.edu)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

A tool for analyzing the documentation of scientific datasets

Basic Info
  • Host: GitHub
  • Owner: SOM-Research
  • License: cc-by-sa-4.0
  • Language: Python
  • Default Branch: main
  • Size: 42.8 MB
Statistics
  • Stars: 4
  • Watchers: 2
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Created about 3 years ago · Last pushed over 2 years ago
Metadata Files
Readme Contributing License Code of conduct Citation Governance

README.md

DataDoc Analyzer

Extract, in a structured manner, the general guidelines from the ML community about dataset documentation practices from its scientific documentation. Study and analyze scientific data published in peer-review journals such as: Nature's Scientific Data and Data-in-Brief.

:vhs: Take a look to our short video presenting the tool! :vhs: and here you have an example of an study using DataDocAnalyzer to extract the data from data papers.

Here you have a complete list of data journals suitable to be analyzed with this tool. Test the web UI of the tool in the following HuggingFace Space, and the API using our Docker image

Installation

The tools come with two UIs. A web app built with Gradio intended to test the tool's capabilities and analyze a single document (you can try it in the HuggingFace Space). And a API built with FastAPI, suited to be integrated into any ML pipeline:

To use this tool, you need to have python3.10, git, and pip installed in your system. Then just:

``` git clone https://github.com/SOM-Research/DataDoc-Analyzer.git datadoc

Enter to the created folder

cd datadoc

Install dependencies (Better to do this in a virtual enviroment)

pip install -r requirements.txt ```

Run the web UI:

python3 app.py

Run the API:

uvicorn api:app

Run the API using the docker image:

First you need to install docker in your sistem. Then:

docker pull joangi/datadoc_analyzer docker run --name apidataset -p 80:80 joangi/datadoc_analyzer docker exec apidataset apt -y install default-jre

The API will be running in your localhost at port 80. (You can change the port in the command above)

Usage

Web UI

To use this tool, you need to provide your own API key from OpenAI.

Once set, you can upload your PDF from one of the scientific journals suited for this tool[^1]. Keep in mind that we analyze data papers. Other journal publications, such as meta-analysis or full papers, may not work adequately.

At last, click on get insights of any tab, and you will get the results together with the completeness report.

[^1]: Some journals that publish data papers: Nature's Scientific Data, Data-in-Brief, Geoscience Data Journal etc... Here you have a complete list of data journals suitable to be analyzed with this tool.

Api showcase

### API

The API imitates the behavior of the tabs of the web UI, but, in addition, you also have an endpoint to retrieve all the dimensions at the same time. The API's swagger documentation, which can be tested in situ, is published together along the API. The server will start at port 8000 by default (if not occupied by another app of your system). And the documentation will be found at http://127.0.0.1:8000/docs

![Api showcase](./assets/apigif.gif)

Background research

The tool has been presented at the 32nd ACM International Conference on Information and Knowledge Management in October '23 (tool's publication). Also, you can check this short video presenting the tool

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

The CC BY-SA license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.

Creative Commons License

Owner

  • Name: SOM Research Lab
  • Login: SOM-Research
  • Kind: organization
  • Email: rclariso@uoc.edu
  • Location: Barcelona

GitHub Events

Total
Last Year

Dependencies

Dockerfile docker
  • python 3.10.11-bullseye build
requirements.txt pypi
  • Jinja2 ==3.1.2
  • MarkupSafe ==2.1.2
  • Pillow ==9.5.0
  • PyYAML ==6.0
  • Pygments ==2.15.1
  • SQLAlchemy ==2.0.15
  • Wand ==0.6.11
  • aiofiles ==23.1.0
  • aiohttp ==3.8.4
  • aiosignal ==1.3.1
  • altair ==5.0.1
  • anyio ==3.7.0
  • async-timeout ==4.0.2
  • asyncio ==3.4.3
  • attrs ==23.1.0
  • beautifulsoup4 ==4.12.2
  • blis ==0.7.9
  • catalogue ==2.0.8
  • certifi ==2023.5.7
  • cffi ==1.15.1
  • charset-normalizer ==3.1.0
  • click ==8.1.3
  • confection ==0.0.4
  • contourpy ==1.0.7
  • cryptography ==41.0.0
  • cycler ==0.11.0
  • cymem ==2.0.7
  • dataclasses-json ==0.5.7
  • distro ==1.8.0
  • exceptiongroup ==1.1.1
  • faiss-cpu ==1.7.4
  • fastapi ==0.95.2
  • ffmpy ==0.3.0
  • filelock ==3.12.0
  • fonttools ==4.39.4
  • frozenlist ==1.3.3
  • fsspec ==2023.5.0
  • gradio ==3.32.0
  • gradio_client ==0.2.5
  • h11 ==0.14.0
  • httpcore ==0.17.2
  • httptools ==0.5.0
  • httpx ==0.24.1
  • huggingface-hub ==0.14.1
  • idna ==3.4
  • jsonschema ==4.17.3
  • kiwisolver ==1.4.4
  • langchain ==0.0.186
  • langcodes ==3.3.0
  • linkify-it-py ==2.0.2
  • lxml ==4.9.2
  • markdown-it-py ==2.2.0
  • marshmallow ==3.19.0
  • marshmallow-enum ==1.5.1
  • matplotlib ==3.7.1
  • mdit-py-plugins ==0.3.3
  • mdurl ==0.1.2
  • mmda ==0.4.8
  • mpmath ==1.3.0
  • multidict ==6.0.4
  • murmurhash ==1.0.9
  • mypy-extensions ==1.0.0
  • ncls ==0.0.66
  • necessary ==0.4.2
  • networkx ==3.1
  • numexpr ==2.8.4
  • numpy ==1.24.3
  • openai ==0.27.7
  • openapi-schema-pydantic ==1.2.4
  • orjson ==3.8.14
  • packaging ==23.1
  • pandas ==1.5.3
  • pathy ==0.10.1
  • pdf2image ==1.16.3
  • pdfminer.six ==20220524
  • pdfplumber ==0.7.4
  • preshed ==3.0.8
  • pycparser ==2.21
  • pydantic ==1.10.8
  • pydub ==0.25.1
  • pyparsing ==3.0.9
  • pyphen ==0.14.0
  • pyrsistent ==0.19.3
  • python-dateutil ==2.8.2
  • python-dotenv ==1.0.0
  • python-multipart ==0.0.6
  • pytz ==2023.3
  • regex ==2023.5.5
  • requests ==2.31.0
  • requirements-parser ==0.5.0
  • semantic-version ==2.10.0
  • six ==1.16.0
  • smart-open ==6.3.0
  • sniffio ==1.3.0
  • soupsieve ==2.4.1
  • spacy ==3.5.3
  • spacy-legacy ==3.0.12
  • spacy-loggers ==1.0.4
  • srsly ==2.4.6
  • starlette ==0.27.0
  • sympy ==1.12
  • tabula-py ==2.7.0
  • tenacity ==8.2.2
  • textstat ==0.7.3
  • thinc ==8.1.10
  • tiktoken ==0.4.0
  • tokenizers ==0.13.3
  • toolz ==0.12.0
  • torch ==2.0.1
  • tqdm ==4.65.0
  • transformers ==4.29.2
  • typer ==0.7.0
  • types-setuptools ==67.8.0.0
  • typing-inspect ==0.9.0
  • typing_extensions ==4.6.2
  • uc-micro-py ==1.0.2
  • urllib3 ==2.0.2
  • uvicorn ==0.22.0
  • uvloop ==0.17.0
  • wasabi ==1.1.1
  • watchfiles ==0.19.0
  • websockets ==11.0.3
  • yarl ==1.9.2