Missingno

Missingno: a missing data visualization suite - Published in JOSS (2018)

https://github.com/residentmario/missingno

Keywords

data-analysis data-visualization missing-data pandas python

Last synced: 6 months ago · JSON representation

Repository

Missing data visualization module for Python.

Basic Info

Host: GitHub
Owner: ResidentMario
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 10.4 MB

Statistics

Stars: 4,130
Watchers: 75
Forks: 528
Open Issues: 15
Releases: 6

Topics

data-analysis data-visualization missing-data pandas python

Created almost 10 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Contributing License

missingno

Messy datasets? Missing values? missingno provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset. Just pip install missingno to get started.

quickstart

This quickstart uses a sample of the NYPD Motor Vehicle Collisions Dataset dataset.

python import pandas as pd collisions = pd.read_csv("https://raw.githubusercontent.com/ResidentMario/missingno-data/master/nyc_collision_factors.csv")

`matrix`

The msno.matrix nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

python import missingno as msno %matplotlib inline msno.matrix(collisions.sample(250))

alt text

At a glance, date, time, the distribution of injuries, and the contribution factor of the first vehicle appear to be completely populated, while geographic information seems mostly complete, but spottier.

The sparkline at right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset.

This visualization will comfortably accommodate up to 50 labelled variables. Past that range labels begin to overlap or become unreadable, and by default large displays omit them.

If you are working with time-series data, you can specify a periodicity using the freq keyword parameter:

python null_pattern = (np.random.random(1000).reshape((50, 20)) > 0.5).astype(bool) null_pattern = pd.DataFrame(null_pattern).replace({False: None}) msno.matrix(null_pattern.set_index(pd.period_range('1/1/2011', '2/1/2015', freq='M')) , freq='BQ')

alt text

`bar`

msno.bar is a simple visualization of nullity by column:

python msno.bar(collisions.sample(1000))

alt text

You can switch to a logarithmic scale by specifying log=True. bar provides the same information as matrix, but in a simpler format.

`heatmap`

The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another:

python msno.heatmap(collisions)

alt text

In this example, it seems that reports which are filed with an OFF STREET NAME variable are less likely to have complete geographic data.

Nullity correlation ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does).

The exact algorithm used is:

```python import numpy as np

df is a pandas.DataFrame instance

df = df.iloc[:, [i for i, n in enumerate(np.var(df.isnull(), axis='rows')) if n > 0]] corr_mat = df.isnull().corr() ```

Variables that are always full or always empty have no meaningful correlation, and so are silently removed from the visualization—in this case for instance the datetime and injury number columns, which are completely filled, are not included.

Entries marked <1 or >-1 have a correlation that is close to being exactingly negative or positive, but is still not quite perfectly so. This points to a small number of records in the dataset which are erroneous. For example, in this dataset the correlation between VEHICLE CODE TYPE 3 and CONTRIBUTING FACTOR VEHICLE 3 is <1, indicating that, contrary to our expectation, there are a few records which have one or the other, but not both. These cases will require special attention.

The heatmap works great for picking out data completeness relationships between variable pairs, but its explanatory power is limited when it comes to larger relationships and it has no particular support for extremely large datasets.

`dendrogram`

The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap:

python msno.dendrogram(collisions)

alt text

The dendrogram uses a hierarchical clustering algorithm (courtesy of scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.

The exact algorithm used is:

```python from scipy.cluster import hierarchy import numpy as np

df is a pandas.DataFrame instance

x = np.transpose(df.isnull().astype(int).values) z = hierarchy.linkage(x, method) ```

To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of zero fully predict one another's presence—one variable might always be empty when another is filled, or they might always both be filled or both empty, and so on. In this specific example the dendrogram glues together the variables which are required and therefore present in every record.

Cluster leaves which split close to zero, but not at it, predict one another very well, but still imperfectly. If your own interpretation of the dataset is that these columns actually are or ought to be match each other in nullity (for example, as CONTRIBUTING FACTOR VEHICLE 2 and VEHICLE TYPE CODE 2 ought to), then the height of the cluster leaf tells you, in absolute terms, how often the records are "mismatched" or incorrectly filed—that is, how many values you would have to fill in or drop, if you are so inclined.

As with matrix, only up to 50 labeled columns will comfortably display in this configuration. However the dendrogram more elegantly handles extremely large datasets by simply flipping to a horizontal configuration.

configuration

For more advanced configuration details for your plots, refer to the CONFIGURATION.md file in this repository.

contributing

For thoughts on features or bug reports see Issues. If you're interested in contributing to this library, see details on doing so in the CONTRIBUTING.md file in this repository. If doing so, keep in mind that missingno is currently in a maintenance state, so while bugfixes are welcome, I am unlikely to review or land any new major library features.

Owner

Name: Aleksey Bilogur
Login: ResidentMario
Kind: user
Location: Brooklyn, NY

Website: https://www.residentmar.io/
Twitter: ResidentMario
Repositories: 201
Profile: https://github.com/ResidentMario

Building machine learning tooling at @reddit. Previously: @spellml, @quiltdata, @recursecenter, @Kaggle, @MODA-NYC.

JOSS Publication

Missingno: a missing data visualization suite

Published

February 27, 2018

DOI

10.21105/joss.00547

Volume 3, Issue 22, Page 547

Authors

Aleksey Bilogur

Independent

Editor

Kevin M. Moerman

Papers & Mentions

Total mentions: 4

<i>Poppr</i>: an R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction

DOI: 10.7717/peerj.281
OpenAlex ID: https://openalex.org/W2016603972
Published: March 2014

Last synced: 4 months ago

Advanced methods for missing values imputation based on similarity learning

DOI: 10.7717/peerj-cs.619
OpenAlex ID: https://openalex.org/W3187035773
Published: July 2021

Last synced: 4 months ago

Extracting Value from Industrial Alarms and Events: A Data-Driven Approach Based on Exploratory Data Analysis

DOI: 10.3390/s19122772
OpenAlex ID: https://openalex.org/W2953061042
Published: June 2019

Last synced: 4 months ago

Applying Machine Learning Technologies Based on Historical Activity Features for Multi-Resident Activity Recognition

DOI: 10.3390/s21072520
OpenAlex ID: https://openalex.org/W3144717505
Published: April 2021

Last synced: 4 months ago

GitHub Events

Total

Watch event: 205
Issue comment event: 1
Pull request event: 3
Fork event: 12

Last Year

Watch event: 205
Issue comment event: 1
Pull request event: 3
Fork event: 12

Committers

Last synced: 7 months ago

All Time

Total Commits: 177
Total Committers: 18
Avg Commits per committer: 9.833
Development Distribution Score (DDS): 0.147

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
ResidentMario	a**r@g**m	151
Kevin Moore	k**n@q**o	5
vagrant	s**i@g**m	4
VolkrB	V**u@g**m	3
bradprin@live.com	B**d@B**l	1
Armando Fandango	a****o	1
Ben Everson	b**n@s**m	1
Clovis I	5****0	1
Edison J Abahurire	2****e	1
Harry Mavroforakis	h****r	1
Johnnes SV Cruz	3****s	1
Max Mahlke	1****e	1
Richard Leyshon	4****n	1
Sergiu	3****1	1
Simon Brugman	s****n	1
Tim Gates	t**s@i**m	1
Todd	t**8@g**m	1
Zhian N. Kamvar	z**r@g**m	1

Committer Domains (Top 20 + Academic)

iress.com: 1 svds.com: 1 quiltdata.io: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 80
Total pull requests: 31
Average time to close issues: 5 months
Average time to close pull requests: about 2 months
Total issue authors: 70
Total pull request authors: 16
Average comments per issue: 1.86
Average comments per pull request: 0.65
Merged pull requests: 25
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 1 minute
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

rhiever (4)
lodowner-zz (3)
armando-fandango (2)
aguinaldoabbj (2)
ResidentMario (2)
clevilll (2)
sbrugman (2)
zach-navina (1)
javed110 (1)
pedugnat (1)
brobr (1)
datatalking (1)
ggutierrezc (1)
enarroied (1)
dataoculus (1)

Pull Request Authors

ResidentMario (15)
anopsy (2)
Ahmed-Jabrane (2)
TomerAntman (1)
joeunsupervised (1)
maxmahlke (1)
SultanOrazbayev (1)
sbrugman (1)
r-leyshon (1)
edison12a (1)
toddrme2178 (1)
WillMc93 (1)
timgates42 (1)
chacreton190 (1)
sergiuser1 (1)

Top Labels

Issue Labels

feature request (9) enhancement (4) bug (2)

Pull Request Labels

Packages

Total packages: 3
Total downloads:
- pypi 176,005 last-month
Total docker downloads: 234,410

Total dependent packages: 37
(may contain duplicates)
Total dependent repositories: 2,010
(may contain duplicates)
Total versions: 32
Total maintainers: 1

pypi.org: missingno

Missing data visualization module for Python.

Homepage: https://github.com/ResidentMario/missingno
Documentation: https://missingno.readthedocs.io/
License: MIT License
Latest release: 0.5.2
published almost 3 years ago

Versions: 26
Dependent Packages: 32
Dependent Repositories: 1,920
Downloads: 176,005 Last month
Docker Downloads: 234,410

Rankings

Dependent repos count: 0.3%

Dependent packages count: 0.4%

Downloads: 0.5%

Docker downloads count: 0.9%

Average: 1.0%

Stargazers count: 1.3%

Forks count: 2.5%

Maintainers (1)

ResidentMario

Last synced: about 1 year ago

conda-forge.org: missingno

Homepage: https://github.com/ResidentMario/missingno
License: MIT
Latest release: 0.4.2
published over 6 years ago

Versions: 4
Dependent Packages: 4
Dependent Repositories: 45

Rankings

Dependent repos count: 5.3%

Stargazers count: 6.3%

Forks count: 7.7%

Average: 8.0%

Dependent packages count: 12.5%

Last synced: 6 months ago

anaconda.org: missingno

Messy datasets? Missing values? missingno provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset.

Homepage: https://github.com/ResidentMario/missingno
License: MIT
Latest release: 0.5.2
published 7 months ago

Versions: 2
Dependent Packages: 1
Dependent Repositories: 45

Rankings

Stargazers count: 14.0%

Forks count: 15.7%

Average: 19.1%

Dependent packages count: 21.6%

Dependent repos count: 25.1%

Last synced: 6 months ago

Missingno

Science Score: 93.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

missingno

quickstart

matrix

bar

heatmap

df is a pandas.DataFrame instance

dendrogram

df is a pandas.DataFrame instance

configuration

contributing

Owner

JOSS Publication

Missingno: a missing data visualization suite

Authors

Editor

Tags

Papers & Mentions

<i>Poppr</i>: an R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction

Advanced methods for missing values imputation based on similarity learning

Extracting Value from Industrial Alarms and Events: A Data-Driven Approach Based on Exploratory Data Analysis

Applying Machine Learning Technologies Based on Historical Activity Features for Multi-Resident Activity Recognition

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: missingno

Rankings

Maintainers (1)

conda-forge.org: missingno

Rankings

anaconda.org: missingno

Rankings

`matrix`

`bar`

`heatmap`

`dendrogram`