https://github.com/multimeric/PandasSchema

A validation library for Pandas data frames using user-friendly schemas

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.9%) to scientific vocabulary

Keywords

data-science pandas schema validation

Last synced: 10 months ago · JSON representation

Repository

A validation library for Pandas data frames using user-friendly schemas

Basic Info

Host: GitHub
Owner: multimeric
License: gpl-3.0
Language: Python
Default Branch: master
Homepage: https://multimeric.github.io/PandasSchema/
Size: 767 KB

Statistics

Stars: 192
Watchers: 5
Forks: 36
Open Issues: 38
Releases: 4

Archived

Topics

data-science pandas schema validation

Created over 9 years ago · Last pushed over 3 years ago

Metadata Files

Readme License

README.rst

PandasSchema
************

For the full documentation, refer to the `Github Pages Website
`_.

======================================================================

PandasSchema is a module for validating tabulated data, such as CSVs
(Comma Separated Value files), and TSVs (Tab Separated Value files).
It uses the incredibly powerful data analysis tool Pandas to do so
quickly and efficiently.

For example, say your code expects a CSV that looks a bit like this:

.. code:: default

   Given Name,Family Name,Age,Sex,Customer ID
   Gerald,Hampton,82,Male,2582GABK
   Yuuwa,Miyake,27,Male,7951WVLW
   Edyta,Majewska,50,Female,7758NSID

Now you want to be able to ensure that the data in your CSV is in the
correct format:

.. code:: python

   import pandas as pd
   from io import StringIO
   from pandas_schema import Column, Schema
   from pandas_schema.validation import LeadingWhitespaceValidation, TrailingWhitespaceValidation, CanConvertValidation, MatchesPatternValidation, InRangeValidation, InListValidation

   schema = Schema([
       Column('Given Name', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
       Column('Family Name', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
       Column('Age', [InRangeValidation(0, 120)]),
       Column('Sex', [InListValidation(['Male', 'Female', 'Other'])]),
       Column('Customer ID', [MatchesPatternValidation(r'\d{4}[A-Z]{4}')])
   ])

   test_data = pd.read_csv(StringIO('''Given Name,Family Name,Age,Sex,Customer ID
   Gerald ,Hampton,82,Male,2582GABK
   Yuuwa,Miyake,270,male,7951WVLW
   Edyta,Majewska ,50,Female,775ANSID
   '''))

   errors = schema.validate(test_data)

   for error in errors:
       print(error)

PandasSchema would then output

.. code:: text

   {row: 0, column: "Given Name"}: "Gerald " contains trailing whitespace
   {row: 1, column: "Age"}: "270" was not in the range [0, 120)
   {row: 1, column: "Sex"}: "male" is not in the list of legal options (Male, Female, Other)
   {row: 2, column: "Family Name"}: "Majewska " contains trailing whitespace
   {row: 2, column: "Customer ID"}: "775ANSID" does not match the pattern "\d{4}[A-Z]{4}"

Owner

Name: Michael Milton
Login: multimeric
Kind: user
Location: Australia

Twitter: multimeric
Repositories: 280
Profile: https://github.com/multimeric

GitHub Events

Total

Watch event: 5
Issue comment event: 1
Pull request event: 3
Fork event: 1

Last Year

Watch event: 5
Issue comment event: 1
Pull request event: 3
Fork event: 1

Committers

Last synced: over 1 year ago

All Time

Total Commits: 89
Total Committers: 9
Avg Commits per committer: 9.889
Development Distribution Score (DDS): 0.348

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Michael Milton	t**t@g**m	58
Diego Quintana	d**v@g**m	11
Pranshu Aggarwal	w**u@g**m	8
Dave Cavaletto	d**o@h**m	7
MonsieurWave	t**e@g**m	1
Fasih Ahmad Fakhri	f**i@g**m	1
Andrew Kemm	a**m@w**m	1
David Farrington	i**o@d**k	1
Cristian Narvaez	c**z@g**m	1

Committer Domains (Top 20 + Academic)

globant.com: 1 davidfarrington.co.uk: 1 wspdigital.com: 1 healthmine.com: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 47
Total pull requests: 28
Average time to close issues: 6 months
Average time to close pull requests: about 1 year
Total issue authors: 35
Total pull request authors: 22
Average comments per issue: 2.43
Average comments per pull request: 3.39
Merged pull requests: 10
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.33
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Abhisek1994Roy (5)
diegoquintanav (4)
pranshuag9 (3)
Calosha (2)
caddac (2)
chrispijo (2)
mckev-amazon (1)
dmargol1 (1)
cnarvaa (1)
wolces (1)
courserachateau (1)
erlenddalen (1)
dtamez (1)
Ghost---Shadow (1)
christopherhastings (1)

Pull Request Authors

multimeric (3)
diegoquintanav (2)
AmirAl-Jumaily (2)
caddac (2)
kenahoo (2)
oshribr (1)
RoyalTS (1)
lguntde (1)
cnarvaa (1)
fasih (1)
JulianKlug (1)
chrispijo (1)
ybubnov (1)
pranshuag9 (1)
Maarten-vd-Sande (1)

Top Labels

Issue Labels

bug (8) enhancement (7) good-first-issue (3) help wanted (2)

Pull Request Labels

waiting for reply (1)

Dependencies

requirements.txt pypi

sphinx *
sphinx-autodoc-annotation *

setup.py pypi

numpy *
packaging *
pandas >=0.19

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/multimeric/PandasSchema

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.rst

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies