Visions

Visions: An Open-Source Library for Semantic Data - Published in JOSS (2020)

https://github.com/dylan-profiler/visions

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

data-analysis data-science hacktoberfest numpy pandas python spark type-inference type-system

Keywords from Contributors

energy-systems meshes cryptocurrencies blackhole gravitational-lenses bioinformatics simulations bayesian-statistics graph-generation hydrology

Scientific Fields

Sociology Social Sciences - 87% confidence
Last synced: 4 months ago · JSON representation

Repository

Type System for Data Analysis in Python

Basic Info
Statistics
  • Stars: 213
  • Watchers: 6
  • Forks: 19
  • Open Issues: 18
  • Releases: 17
Topics
data-analysis data-science hacktoberfest numpy pandas python spark type-inference type-system
Created about 6 years ago · Last pushed 11 months ago
Metadata Files
Readme License

README.md


And these visions of data types, they kept us up past the dawn.

The Semantic Data Library

Visions provides a set of tools for defining and using semantic data types.

  • [x] Semantic type detection & inference on sequence data.

  • [x] Automated data processing

  • [x] Completely customizable. Visions makes it easy to build and modify semantic data types for domain specific purposes

  • [x] Out of the box support for multiple backend implementations including pandas, spark, numpy, and python

  • [x] A robust set of default types and typesets covering the most common use cases.

Check out the complete documentation here.

Installation

Source code is available on github and binary installers via pip.

```

Pip

pip install visions ```

Complete installation instructions (including extras) are available in the docs.

Quick Start Guide

If you want to play immediately check out the examples folder on . Otherwise, let's get some data

```python import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv") df.head(2) ```

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C

The most important abstraction in visions are Types - these represent semantic notions about data. You have access to a range of well tested types like Integer, Float, and Files covering the most common software development use cases. Types can be bundled together into typesets. Behind the scenes, visions builds a traversable graph for any collection of types.

```python from visions import types, typesets

StandardSet is the basic builtin typeset

typeset = typesets.CompleteSet() typeset.plot_graph() ```

Note: Plots require pygraphviz to be installed.

Because of the special relationship between types these graphs can be used to detect the type of your data or infer a more appropriate one.

```python

Detection looks like this

typeset.detect_type(df)

While inference looks like this

typeset.infer_type(df)

Inference works well even if we monkey with the data, say by converting everything to strings

typeset.infer_type(df.astype(str))

{ 'PassengerId': Integer, 'Survived': Integer, 'Pclass': Integer, 'Name': String, 'Sex': String, 'Age': Float, 'SibSp': Integer, 'Parch': Integer, 'Ticket': String, 'Fare': Float, 'Cabin': String, 'Embarked': String } ```

Visions solves many of the most common problems working with tabular data for example, sequences of Integers are still recognized as integers whether they have trailing decimal 0's from being cast to float, missing values, or something else altogether. Much of this cleaning is performed automatically providing nicely cleaned and processed data as well.

python cleaned_df = typeset.cast_to_inferred(df)

This is only a small taste of everything visions can do including building your own domain specific types and typesets so please check out the API documentation or the examples/ directory for more info!

Supported frameworks

Thanks to its dispatch based implementation Visions is able to exploit framework specific capabilities offered by libraries like pandas and spark. Currently it works with the following backends by default.

  • Pandas (feature complete)
  • Numpy (boolean, complex, date time, float, integer, string, time deltas, string, objects)
  • Spark (boolean, categorical, date, date time, float, integer, numeric, object, string)
  • Python (string, float, integer, date time, time delta, boolean, categorical, object, complex - other datatypes are untested)

If you're using pandas it will also take advantage of parallelization tools like swifter if available.

It also offers a simple annotation based API for registering new implementations as needed. For example, if you wished to extend the categorical data type to include a Dask specific implementation you might do something like

```python from visions.types.categorical import Categorical from pandas.api import types as pdt import dask

@Categorical.containsop.register def categoricalcontains(series: dask.dataframe.Series, state: dict) -> bool: return pdt.iscategoricaldtype(series.dtype) ```

Contributing and support

Contributions to visions are welcome. For more information, please visit the community contributions page and join on us on slack. The github issues tracker is used for reporting bugs, feature requests and support questions.

Also, please check out some of the other companies and packages using visions including:

If you're currently using visions or would like to be featured here please let us know.

Acknowledgements

This package is part of the dylan-profiler project. The package is core component of pandas-profiling. More information can be found here. This work was partially supported by SIDN Fonds.

Owner

  • Name: dylan-profiler
  • Login: dylan-profiler
  • Kind: organization

DYLAN: Tools for effective data analysis

JOSS Publication

Visions: An Open-Source Library for Semantic Data
Published
April 13, 2020
Volume 5, Issue 48, Page 2145
Authors
Simon Brugman ORCID
Radboud University
Ian Eaves ORCID
Independent
Editor
Matthew Sottile ORCID
Tags
data types data workflows data integration machine learning

GitHub Events

Total
  • Create event: 7
  • Release event: 2
  • Issues event: 3
  • Watch event: 6
  • Delete event: 5
  • Issue comment event: 2
  • Push event: 24
  • Pull request review event: 3
  • Pull request review comment event: 4
  • Pull request event: 21
Last Year
  • Create event: 7
  • Release event: 2
  • Issues event: 3
  • Watch event: 6
  • Delete event: 5
  • Issue comment event: 2
  • Push event: 24
  • Pull request review event: 3
  • Pull request review comment event: 4
  • Pull request event: 21

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 906
  • Total Committers: 12
  • Avg Commits per committer: 75.5
  • Development Distribution Score (DDS): 0.371
Past Year
  • Commits: 10
  • Committers: 2
  • Avg Commits per committer: 5.0
  • Development Distribution Score (DDS): 0.1
Top Committers
Name Email Commits
simon_graphkite s****n@g****m 570
Ian Eaves i****s@g****m 254
GitHub Action a****n@g****m 72
dependabot-preview[bot] 2****] 2
lgtm-com[bot] 4****] 1
dependabot[bot] 4****] 1
Gustavo Camargo g****1 1
Erik Cederstrand e****k@c****k 1
Dan Houghton d****n@g****m 1
Charles-Meldhine Madi Mnemoi 6****6 1
Arfon Smith a****n 1
Aarni Koskela a****x@i****i 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 31
  • Total pull requests: 98
  • Average time to close issues: about 2 months
  • Average time to close pull requests: about 1 month
  • Total issue authors: 18
  • Total pull request authors: 10
  • Average comments per issue: 1.58
  • Average comments per pull request: 0.96
  • Merged pull requests: 85
  • Bot issues: 1
  • Bot pull requests: 7
Past Year
  • Issues: 1
  • Pull requests: 20
  • Average time to close issues: about 22 hours
  • Average time to close pull requests: 12 minutes
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.1
  • Merged pull requests: 18
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • sbrugman (8)
  • ieaves (6)
  • majidaldo (2)
  • cstabnick (1)
  • dependabot-preview[bot] (1)
  • seshurajup (1)
  • lahwaacz (1)
  • irvinktang (1)
  • fkiraly (1)
  • nv-rliu (1)
  • PraJaL55 (1)
  • sterlinm (1)
  • hubutui (1)
  • ttpro1995 (1)
  • cmnemoi (1)
Pull Request Authors
  • sbrugman (47)
  • ieaves (42)
  • dependabot[bot] (3)
  • lgtm-com[bot] (2)
  • dependabot-preview[bot] (2)
  • akx (1)
  • cmnemoi (1)
  • ecederstrand (1)
  • dah33 (1)
  • gcamargo1 (1)
Top Labels
Issue Labels
enhancement (13) bug (13) good first issue (1) wontfix (1)
Pull Request Labels
dependencies (5)

Packages

  • Total packages: 3
  • Total downloads:
    • pypi 1,072,345 last-month
  • Total docker downloads: 511,361
  • Total dependent packages: 10
    (may contain duplicates)
  • Total dependent repositories: 748
    (may contain duplicates)
  • Total versions: 50
  • Total maintainers: 2
pypi.org: visions

Visions

  • Versions: 30
  • Dependent Packages: 6
  • Dependent Repositories: 722
  • Downloads: 1,072,345 Last month
  • Docker Downloads: 511,361
Rankings
Downloads: 0.3%
Dependent repos count: 0.5%
Docker downloads count: 0.9%
Dependent packages count: 1.6%
Average: 2.8%
Stargazers count: 5.0%
Forks count: 8.4%
Maintainers (2)
Last synced: 4 months ago
conda-forge.org: visions
  • Versions: 12
  • Dependent Packages: 2
  • Dependent Repositories: 13
Rankings
Dependent repos count: 9.8%
Dependent packages count: 19.6%
Average: 23.4%
Stargazers count: 27.6%
Forks count: 36.7%
Last synced: 4 months ago
anaconda.org: visions

Visions provides an extensible suite of tools to support common data analysis operations including type inference on unknown data, casting data types and automated data summarization.

  • Versions: 8
  • Dependent Packages: 2
  • Dependent Repositories: 13
Rankings
Dependent packages count: 20.4%
Average: 35.8%
Dependent repos count: 36.0%
Stargazers count: 39.7%
Forks count: 47.1%
Last synced: 5 months ago

Dependencies

requirements.txt pypi
  • attrs >=19.3.0
  • multimethod >=1.4
  • networkx >=2.4
  • numpy *
  • pandas >=0.25.3
  • tangled_up_in_unicode >=0.0.4
requirements_dev.txt pypi
  • IPython * development
  • Sphinx-copybutton * development
  • black >=20.8b1 development
  • isort >=5.0.9 development
  • mypy >=0.770 development
  • nbsphinx * development
  • recommonmark >=0.6.0 development
  • setuptools >=46.1.3 development
  • sphinx-autodoc-typehints >=1.10.3 development
  • sphinx_rtd_theme >=0.4.3 development
  • wheel >=0.34.2 development
requirements_test.txt pypi
  • Pillow * test
  • big_o >=0.10.1 test
  • black >=19.10b0 test
  • check-manifest >=0.41 test
  • imagehash * test
  • isort >=5.0.9 test
  • matplotlib * test
  • mypy >=0.800 test
  • pandas * test
  • pre-commit * test
  • pyarrow >=1.0.1 test
  • pydot * test
  • pyspark * test
  • pytest >=5.2.0 test
  • pytest-spark >=0.6.0 test
  • shapely * test
  • twine >=3.1.1 test
.github/workflows/ci.yml actions
  • actions/cache v1 composite
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
  • ad-m/github-push-action master composite
.github/workflows/pypi.yml actions
  • actions/cache v1 composite
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
  • pypa/gh-action-pypi-publish master composite
.github/workflows/tests.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
requirements_spark.txt pypi
setup.py pypi