https://github.com/alan-turing-institute/csv_wrangling

Repository for reproducibility of the CSV file project

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: scholar.google, zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.8%) to scientific vocabulary

Keywords

csv csv-files csv-parsing reproducibility reproducible-paper reproducible-research reproducible-science

Last synced: 5 months ago · JSON representation

Repository

Repository for reproducibility of the CSV file project

Basic Info

Host: GitHub
Owner: alan-turing-institute
License: mit
Language: TeX
Default Branch: master
Size: 6.2 MB

Statistics

Stars: 28
Watchers: 5
Forks: 6
Open Issues: 0
Releases: 0

Topics

csv csv-files csv-parsing reproducibility reproducible-paper reproducible-research reproducible-science

Created over 7 years ago · Last pushed about 4 years ago

Metadata Files

Readme License

CSV Wrangling

This is the repository for reproducing the experiments in the paper:

Wrangling Messy CSV files by Detecting Row and Type Patterns (PDF)

by G.J.J. van den Burg, A. Nazabal and C. Sutton.

For an implementation of the method developed in the paper, see the CleverCSV repository.

If you use this paper or this code in your own work, please cite the paper using for instance the following BibTeX citation:

bibtex @article{van2019wrangling, title = {Wrangling Messy {CSV} Files by Detecting Row and Type Patterns}, author = {{van den Burg}, G. J. J. and Naz{\'a}bal, A. and Sutton, C.}, journal = {Data Mining and Knowledge Discovery}, year = {2019}, volume = {33}, number = {6}, pages = {1799--1820}, issn = {1573-756X}, doi = {10.1007/s10618-019-00646-y}, }

Introduction

Our experiments are made reproducible through the use of GNU Make. You can either set up your local environment with the necessary dependencies as described under Requirements, or use the Dockerfile included in the repository.

There are two ways to reproduce our results. The first only reproduces the figures, tables, and constants in the paper from the raw detection results, while the second runs the detection methods as well.

You can reproduce the figures, tables, and constants from the raw experimental results included in this repository. This will not re-run all the experiments but will regenerate the output used in the paper. The command for this is:

bash $ make output

You can fully reproduce our experiments by downloading the data and rerunning the detection methods on all the files. This might take a while depending on the speed of your machine and the number of cores available. Total wall-clock computation time for a single core is estimated at 11 days. The following commands will do all of this.

bash $ make clean # remove existing output files, except human annotated $ make data # download the data $ make results # run all the detectors and generate the result files

If you'd like to use multiple cores, you can replace the last command with:

bash $ make -j X results

where X is the desired number of cores.

Data

There are two datasets that are used in the experiments. Because we don't own the rights to all these files, we can't package these files and make them available in a single download. We can however provide URLs to the files and add a download script, which is what we do here. The data can be downloaded with:

bash $ make data

If you wish to change the download location of the data, please edit the DATA_DIR variable in the Makefile.

Note: We are aware of the fact that some of the files may change or become unavailable in the future. This is an unfortunate side-effect of using publically available data in this way. The data downloader skips files that are unavailable or that have changed. Note that this may affect the exact reproducibility of the results.

The above downloads the "test" set that was used for the evaluation in the paper. For the "working set" that was used to develop our algorithm, run make dev-data.

If the above datasets are insufficient, the complete original data sets are available on request for research purposes. Contact gertjanvandenburg at gmail dot com.

Requirements

Below are the requirements for reproducing the experiments if you're not using Docker. Note that at the moment only Linux-based systems are supported. MacOS will probably work, but hasn't been tested.

Python 3.x with the packages in the requirements.txt file. These can be installed with: pip install --user -r requirements.txt.
R with the external packages installed through: install.packages(c('devtools', 'rjson', 'data.tree', 'RecordLinkage', 'readr', 'tibble')).
A working LaTeX installation is needed for creating the figures (at least texlive-latex-extra and texlive-pictures), as well as a working LaTeXMK installation.

Instructions

To clone this repository and all its submodules do:

bash $ git clone --recurse-submodules https://github.com/alan-turing-institute/CSV_Wrangling

Then install the requirements as listed above and run the make command of your choice.

License

With the exception of the submodule in scripts/detection/lib/hypoparsr this code is licensed under the MIT license. See the LICENSE file for more details.

Owner

Name: The Alan Turing Institute
Login: alan-turing-institute
Kind: organization
Email: info@turing.ac.uk

Website: https://turing.ac.uk
Repositories: 477
Profile: https://github.com/alan-turing-institute

The UK's national institute for data science and artificial intelligence.

GitHub Events

Total

Issues event: 1

Last Year

Issues event: 1

Issues and Pull Requests

Last synced: 12 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 11 minutes
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

mhauru (1)

Pull Request Authors

GjjvdBurg (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

chardet *
dominate *
libtmux *
matplotlib *
numpy *
pandas *
regex *
requests *
scipy *
sklearn *
tabulate *
tqdm *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/alan-turing-institute/csv_wrangling

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

CSV Wrangling

Introduction

Data

Requirements

Instructions

License

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies