omigo-arjun

Data Analytics Library for Python

https://github.com/crowdstrike/omigo-data-analytics

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.1%) to scientific vocabulary

Keywords

analytics aws data-analysis data-science data-visualization machine-learning pandas python statistics tsv
Last synced: 6 months ago · JSON representation ·

Repository

Data Analytics Library for Python

Basic Info
  • Host: GitHub
  • Owner: CrowdStrike
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 10.9 MB
Statistics
  • Stars: 16
  • Watchers: 3
  • Forks: 4
  • Open Issues: 0
  • Releases: 87
Topics
analytics aws data-analysis data-science data-visualization machine-learning pandas python statistics tsv
Created over 4 years ago · Last pushed 6 months ago
Metadata Files
Readme Contributing License Code of conduct Citation Security

README.md

Data Analytics Library for Python

Summary

  • Python library to do end to end data analytics from reading to transformation, analysis and visualization.
  • Support for reading and writing multiple data formats from local machine or S3.
  • Functional programming style APIs.
  • Advanced APIs for doing join, aggregation, sampling, and processing time series data.
  • Schema evolution.
  • Visualization APIs to provide simple interface to matplotlib, seaborn, and other popular libraries.

Primary Use Cases

  • Data Exploration phase when we don't know what we are looking for, and what might work.
  • Wide datasets with 100s or 1000s of columns.
  • Complex business logic is involved.

Run through Docker

Build image (first time only)

docker build -t omigo-data-analytics -f deploy/Dockerfile .

Run docker image

docker run --rm -p 8888:8888 -it -v $PWD:/code omigo-data-analytics

Install Instructions

There are three packages: core, extensions and hydra.

The core package is built using core python with minimal external dependencies to keep it stable. The extensions package contains libraries for advanced functionalities like visualization, and can have lot of dependencies. The hydra package contains experimental code for distributed execution. This is not used at the moment.

pip3 install omigo-core omigo-ext omigo-hydra --upgrade

There are APIs provided to create new extension packages for custom needs and plugin easily into the existing code (See extend-class).

Usage

Note: Some working examples are in jupyter example-notebooks directory. Here is a simple example to run in command line.

Read data from local filesystem. Can also use s3 or web url.

``` python3

from omigocore import dataframe from omigohydra import hydra x = hydra.read("data/iris.tsv.gz")

other possible options

x = hydra.read("s3://bucket/pathtofile/data.tsv.gz")

```

Print basic stats like the number of rows

print(x.num_rows()) 150

Export to pandas data frame for nice display, or use any of pandas apis.

```

x.topandasdf(10) sepallength sepalwidth petallength petalwidth class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa 5 5.4 3.9 1.7 0.4 Iris-setosa 6 4.6 3.4 1.4 0.3 Iris-setosa 7 5.0 3.4 1.5 0.2 Iris-setosa 8 4.4 2.9 1.4 0.2 Iris-setosa 9 4.9 3.1 1.5 0.1 Iris-setosa ```

Example of filtering data for specific column value and select specific columns

```

y = x \ .eqstr("class", "Iris-setosa") \ .gtfloat("sepalwidth", 3.1) \ .select(["sepalwidth", "sepal_length"])

y.show(5)

sepalwidth sepallength 3.5 5.1 3.2 4.7 3.6 5.0 ```

Import the graph extension package for creating charts

```

from omigoext import graphext x.extendclass(graphext.VisualDF).histogram("sepal_length", "class", yfigsize = 8) ``` iris sepal_width histogram

Some of the more advanced graphs are also available

```

x.extendclass(graphext.VisualDF).pairplot(["sepallength", "sepalwidth"], kind = "kde", diag_kind = "auto") ``` iris sepal_width pairplot

The tsv file can be saved to local file system or s3

```

hydra.write(y, "output.tsv.gz") ```

Extensions

There are lot of extensions to add advanced functionalities

1. Graphics and Visualization

This extension provides visualization APIs like linechart, barchart.

2. Read from Web Services

This extension provides APIs to call external web service for all the rows in the data. All web service parameters can be templatized and mapped to individual columns including url, query parameters, headers, and payload. The extension supports multi threading.

3. Multi Threading

This extension provides a simple wrapper to call different APIs within a thread pool. Usually used inside other extensions.

4. Kafka Reader

This extension allows reading the data through Kafka and return as tsv object. Lot of custom parameters are provided to simplify parsing of the data.

5. Pandas

This extension is a placeholder for wrapping any interesting pandas apis like reading parquet file (local or s3).

6. ETL

This extension provides APIS to read data that is stored in some ETL format. Useful for reading time series data stored in a partitioned manner.

Documentation

  • README: Good starting point to get a basic overview of the library.
  • API Documentation: Detailed API docs with simple examples to illustrate the usage.
  • example-notebooks: Working examples to show different use cases.

Notes from the author

  • This library is built for simplicity, functionality and robustness. Good engineering practices are being followed slowly.
  • More examples with real life use cases is currently in progress. Feel free to reach out for any questions.
  • This project is in active research phase and not to be deployed in production.

Owner

  • Name: CrowdStrike
  • Login: CrowdStrike
  • Kind: organization
  • Email: github@crowdstrike.com
  • Location: United States of America

Citation (CITATION.cff)

cff-version: 1.2.0
title: 'Omigo Data Analytics'
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: 
    family-names: CrowdStrike
  - given-names: Amit
    family-names: Jaiswal
    affiliation: CrowdStrike
    email: amit.jaiswal@gmail.com
repository-code: 'https://github.com/CrowdStrike/omigo-data-analytics'
repository-artifact: 'https://pypi.org/project/omigo-core/'
abstract: >-
  A data analytics library for Python.
keywords:
  - crowdstrike
  - tsv
  - pandas
  - jupyter
  - jupyter-notebook
  - python
  - python3
  - machine-learning
  - ml
license: MIT

GitHub Events

Total
  • Release event: 3
  • Watch event: 2
  • Push event: 34
  • Pull request review event: 1
  • Pull request event: 6
  • Create event: 5
Last Year
  • Release event: 3
  • Watch event: 2
  • Push event: 34
  • Pull request review event: 1
  • Pull request event: 6
  • Create event: 5

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 1,040
  • Total Committers: 4
  • Avg Commits per committer: 260.0
  • Development Distribution Score (DDS): 0.294
Past Year
  • Commits: 122
  • Committers: 2
  • Avg Commits per committer: 61.0
  • Development Distribution Score (DDS): 0.049
Top Committers
Name Email Commits
Amit Jaiswal a****l@c****m 734
Amit Jaiswal a****l@g****m 292
Joshua Hiller j****r@c****m 13
Shawn Wells s****n@s****o 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 0
  • Total pull requests: 92
  • Average time to close issues: N/A
  • Average time to close pull requests: about 4 hours
  • Total issue authors: 0
  • Total pull request authors: 4
  • Average comments per issue: 0
  • Average comments per pull request: 0.49
  • Merged pull requests: 90
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 3
  • Average time to close issues: N/A
  • Average time to close pull requests: about 2 hours
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • amitjaiswal (68)
  • jshcodes (8)
  • oceanhacktitude (1)
  • shawndwells (1)
Top Labels
Issue Labels
Pull Request Labels
enhancement (1)

Packages

  • Total packages: 6
  • Total downloads:
    • pypi 417 last-month
  • Total dependent packages: 11
    (may contain duplicates)
  • Total dependent repositories: 2
    (may contain duplicates)
  • Total versions: 276
  • Total maintainers: 1
pypi.org: omigo-core

Data Analytics Library for Python

  • Versions: 60
  • Dependent Packages: 5
  • Dependent Repositories: 1
  • Downloads: 148 Last month
Rankings
Dependent packages count: 1.6%
Average: 14.1%
Forks count: 14.2%
Stargazers count: 15.6%
Downloads: 17.5%
Dependent repos count: 21.5%
Maintainers (1)
Last synced: 6 months ago
pypi.org: omigo-ext

Extensions for omigo_core package

  • Versions: 60
  • Dependent Packages: 4
  • Dependent Repositories: 1
  • Downloads: 53 Last month
Rankings
Dependent packages count: 1.9%
Average: 14.2%
Forks count: 14.2%
Stargazers count: 15.6%
Downloads: 17.8%
Dependent repos count: 21.5%
Maintainers (1)
Last synced: 6 months ago
pypi.org: omigo-matel

Data Analytics Library for Python

  • Versions: 39
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 58 Last month
Rankings
Dependent packages count: 7.3%
Forks count: 14.5%
Stargazers count: 16.2%
Average: 19.8%
Dependent repos count: 41.4%
Maintainers (1)
Last synced: 6 months ago
pypi.org: omigo-hydra

Data Analytics Library for Python

  • Versions: 39
  • Dependent Packages: 2
  • Dependent Repositories: 0
  • Downloads: 45 Last month
Rankings
Dependent packages count: 7.3%
Forks count: 14.5%
Stargazers count: 16.2%
Average: 19.8%
Dependent repos count: 41.4%
Maintainers (1)
Last synced: 6 months ago
pypi.org: omigo-arjun

Data Analytics Library for Python

  • Versions: 39
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 51 Last month
Rankings
Dependent packages count: 7.3%
Forks count: 14.5%
Stargazers count: 16.2%
Average: 19.8%
Dependent repos count: 41.4%
Maintainers (1)
Last synced: 6 months ago
pypi.org: omigo-lokki

Data Analytics Library for Python5

  • Versions: 39
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 62 Last month
Rankings
Dependent packages count: 7.3%
Forks count: 14.5%
Stargazers count: 16.2%
Average: 19.8%
Dependent repos count: 41.4%
Maintainers (1)
Last synced: 6 months ago

Dependencies

python-packages/core/requirements.txt pypi
  • boto3 *
  • pandas *
  • statistics *
  • urllib3 *
.github/workflows/bandit.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/codeql.yml actions
  • actions/checkout v2 composite
  • github/codeql-action/analyze v1 composite
  • github/codeql-action/autobuild v1 composite
  • github/codeql-action/init v1 composite
.github/workflows/flake8.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/publish.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/pylint.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
deploy/Dockerfile docker
  • ubuntu 20.04 build
python-packages/arjun/pyproject.toml pypi
python-packages/arjun/requirements.txt pypi
  • boto3 *
  • pandas *
  • statistics *
  • urllib3 *
python-packages/core/pyproject.toml pypi
python-packages/extensions/pyproject.toml pypi
python-packages/hydra/pyproject.toml pypi
python-packages/hydra/requirements.txt pypi
  • boto3 *
  • pandas *
  • statistics *
  • urllib3 *
python-packages/lokki/pyproject.toml pypi
python-packages/lokki/requirements.txt pypi
  • boto3 *
  • pandas *
  • statistics *
  • urllib3 *
python-packages/matel/pyproject.toml pypi
python-packages/matel/requirements.txt pypi
  • boto3 *
  • pandas *
  • statistics *
  • urllib3 *