datadoc-analyzer
A tool for analyzing the documentation of scientific datasets
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: wiley.com, nature.com, acm.org, zenodo.org -
○Academic email domains
-
✓Institutional organization owner
Organization som-research has institutional domain (som-research.uoc.edu) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary
Repository
A tool for analyzing the documentation of scientific datasets
Basic Info
- Host: GitHub
- Owner: SOM-Research
- License: cc-by-sa-4.0
- Language: Python
- Default Branch: main
- Size: 42.8 MB
Statistics
- Stars: 4
- Watchers: 2
- Forks: 2
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
DataDoc Analyzer
Extract, in a structured manner, the general guidelines from the ML community about dataset documentation practices from its scientific documentation. Study and analyze scientific data published in peer-review journals such as: Nature's Scientific Data and Data-in-Brief.
:vhs: Take a look to our short video presenting the tool! :vhs: and here you have an example of an study using DataDocAnalyzer to extract the data from data papers.
Here you have a complete list of data journals suitable to be analyzed with this tool. Test the web UI of the tool in the following HuggingFace Space, and the API using our Docker image
Installation
The tools come with two UIs. A web app built with Gradio intended to test the tool's capabilities and analyze a single document (you can try it in the HuggingFace Space). And a API built with FastAPI, suited to be integrated into any ML pipeline:
To use this tool, you need to have python3.10, git, and pip installed in your system. Then just:
``` git clone https://github.com/SOM-Research/DataDoc-Analyzer.git datadoc
Enter to the created folder
cd datadoc
Install dependencies (Better to do this in a virtual enviroment)
pip install -r requirements.txt ```
Run the web UI:
python3 app.py
Run the API:
uvicorn api:app
Run the API using the docker image:
First you need to install docker in your sistem. Then:
docker pull joangi/datadoc_analyzer
docker run --name apidataset -p 80:80 joangi/datadoc_analyzer
docker exec apidataset apt -y install default-jre
The API will be running in your localhost at port 80. (You can change the port in the command above)
Usage
Web UI
To use this tool, you need to provide your own API key from OpenAI.
Once set, you can upload your PDF from one of the scientific journals suited for this tool[^1]. Keep in mind that we analyze data papers. Other journal publications, such as meta-analysis or full papers, may not work adequately.
At last, click on get insights of any tab, and you will get the results together with the completeness report.
[^1]: Some journals that publish data papers: Nature's Scientific Data, Data-in-Brief, Geoscience Data Journal etc... Here you have a complete list of data journals suitable to be analyzed with this tool.

### API
The API imitates the behavior of the tabs of the web UI, but, in addition, you also have an endpoint to retrieve all the dimensions at the same time. The API's swagger documentation, which can be tested in situ, is published together along the API. The server will start at port 8000 by default (if not occupied by another app of your system). And the documentation will be found at http://127.0.0.1:8000/docs
Background research
The tool has been presented at the 32nd ACM International Conference on Information and Knowledge Management in October '23 (tool's publication). Also, you can check this short video presenting the tool
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
The CC BY-SA license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.
Owner
- Name: SOM Research Lab
- Login: SOM-Research
- Kind: organization
- Email: rclariso@uoc.edu
- Location: Barcelona
- Website: http://som-research.uoc.edu
- Repositories: 54
- Profile: https://github.com/SOM-Research
GitHub Events
Total
Last Year
Dependencies
- python 3.10.11-bullseye build
- Jinja2 ==3.1.2
- MarkupSafe ==2.1.2
- Pillow ==9.5.0
- PyYAML ==6.0
- Pygments ==2.15.1
- SQLAlchemy ==2.0.15
- Wand ==0.6.11
- aiofiles ==23.1.0
- aiohttp ==3.8.4
- aiosignal ==1.3.1
- altair ==5.0.1
- anyio ==3.7.0
- async-timeout ==4.0.2
- asyncio ==3.4.3
- attrs ==23.1.0
- beautifulsoup4 ==4.12.2
- blis ==0.7.9
- catalogue ==2.0.8
- certifi ==2023.5.7
- cffi ==1.15.1
- charset-normalizer ==3.1.0
- click ==8.1.3
- confection ==0.0.4
- contourpy ==1.0.7
- cryptography ==41.0.0
- cycler ==0.11.0
- cymem ==2.0.7
- dataclasses-json ==0.5.7
- distro ==1.8.0
- exceptiongroup ==1.1.1
- faiss-cpu ==1.7.4
- fastapi ==0.95.2
- ffmpy ==0.3.0
- filelock ==3.12.0
- fonttools ==4.39.4
- frozenlist ==1.3.3
- fsspec ==2023.5.0
- gradio ==3.32.0
- gradio_client ==0.2.5
- h11 ==0.14.0
- httpcore ==0.17.2
- httptools ==0.5.0
- httpx ==0.24.1
- huggingface-hub ==0.14.1
- idna ==3.4
- jsonschema ==4.17.3
- kiwisolver ==1.4.4
- langchain ==0.0.186
- langcodes ==3.3.0
- linkify-it-py ==2.0.2
- lxml ==4.9.2
- markdown-it-py ==2.2.0
- marshmallow ==3.19.0
- marshmallow-enum ==1.5.1
- matplotlib ==3.7.1
- mdit-py-plugins ==0.3.3
- mdurl ==0.1.2
- mmda ==0.4.8
- mpmath ==1.3.0
- multidict ==6.0.4
- murmurhash ==1.0.9
- mypy-extensions ==1.0.0
- ncls ==0.0.66
- necessary ==0.4.2
- networkx ==3.1
- numexpr ==2.8.4
- numpy ==1.24.3
- openai ==0.27.7
- openapi-schema-pydantic ==1.2.4
- orjson ==3.8.14
- packaging ==23.1
- pandas ==1.5.3
- pathy ==0.10.1
- pdf2image ==1.16.3
- pdfminer.six ==20220524
- pdfplumber ==0.7.4
- preshed ==3.0.8
- pycparser ==2.21
- pydantic ==1.10.8
- pydub ==0.25.1
- pyparsing ==3.0.9
- pyphen ==0.14.0
- pyrsistent ==0.19.3
- python-dateutil ==2.8.2
- python-dotenv ==1.0.0
- python-multipart ==0.0.6
- pytz ==2023.3
- regex ==2023.5.5
- requests ==2.31.0
- requirements-parser ==0.5.0
- semantic-version ==2.10.0
- six ==1.16.0
- smart-open ==6.3.0
- sniffio ==1.3.0
- soupsieve ==2.4.1
- spacy ==3.5.3
- spacy-legacy ==3.0.12
- spacy-loggers ==1.0.4
- srsly ==2.4.6
- starlette ==0.27.0
- sympy ==1.12
- tabula-py ==2.7.0
- tenacity ==8.2.2
- textstat ==0.7.3
- thinc ==8.1.10
- tiktoken ==0.4.0
- tokenizers ==0.13.3
- toolz ==0.12.0
- torch ==2.0.1
- tqdm ==4.65.0
- transformers ==4.29.2
- typer ==0.7.0
- types-setuptools ==67.8.0.0
- typing-inspect ==0.9.0
- typing_extensions ==4.6.2
- uc-micro-py ==1.0.2
- urllib3 ==2.0.2
- uvicorn ==0.22.0
- uvloop ==0.17.0
- wasabi ==1.1.1
- watchfiles ==0.19.0
- websockets ==11.0.3
- yarl ==1.9.2
