https://github.com/kyleprotho/analysistoolbox
Analysis Tool Box (i.e. "analysistoolbox") is a collection of tools in Python for data collection and processing, statisitics, analytics, and intelligence analysis.
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.5%) to scientific vocabulary
Keywords
Repository
Analysis Tool Box (i.e. "analysistoolbox") is a collection of tools in Python for data collection and processing, statisitics, analytics, and intelligence analysis.
Basic Info
- Host: GitHub
- Owner: KyleProtho
- License: gpl-3.0
- Language: Python
- Default Branch: master
- Homepage: https://kyleprotho.github.io/AnalysisToolBox/
- Size: 2.32 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 3
- Releases: 0
Topics
Metadata Files
README.md
Analysis Tool Box
Description
Analysis Tool Box (i.e. "analysistoolbox") is a collection of tools in Python for data collection and processing, statisitics, analytics, and intelligence analysis.
Getting Started
To install the package, run the following command in the root directory of the project:
bash
pip install analysistoolbox
Visualizations are created using the matplotlib and seaborn libraries. While you can select whichever seaborn style you'd like, the following Seaborn style tends to get the best looking plots:
python
sns.set(
style="white",
font="Arial",
context="paper"
)
Table of Contents / Usage
There are many modules in the analysistoolbox package, each with their own functions. The following is a list of the modules:
- Calculus
- Data Collection
- Data Processing
- AddDateNumberColumns
- AddLeadingZeros
- AddRowCountColumn
- AddTPeriodColumn
- AddTukeyOutlierColumn
- CleanTextColumns
- ConductAnomalyDetection
- ConductEntityMatching
- ConvertOddsToProbability
- CountMissingDataByGroup
- CreateBinnedColumn
- CreateDataOverview
- CreateRandomSampleGroups
- CreateRareCategoryColumn
- CreateStratifiedRandomSampleGroups
- ImputeMissingValuesUsingNearestNeighbors
- VerifyGranularity
- Descriptive Analytics
- File Management
- Hypothesis Testing
- ChiSquareTestOfIndependence
- ChiSquareTestOfIndependenceFromTable
- ConductCoxProportionalHazardRegression
- ConductLinearRegressionAnalysis
- ConductLogisticRegressionAnalysis
- OneSampleTTest
- OneWayANOVA
- TTestOfMeanFromStats
- TTestOfProportionFromStats
- TTestOfTwoMeansFromStats
- TwoSampleTTestOfIndependence
- TwoSampleTTestPaired
- Linear Algebra
- LLM
- Predictive Analytics
- Prescriptive Analytics
- Probability
- Simulations
- CreateMetalogDistribution
- CreateMetalogDistributionFromPercentiles
- CreateSIPDataframe
- CreateSLURPDistribution
- SimulateCountOfSuccesses
- SimulateCountOutcome
- SimulateCountUntilFirstSuccess
- SimulateNormallyDistributedOutcome
- SimulateTDistributedOutcome
- SimulateTimeBetweenEvents
- SimulateTimeUntilNEvents
- Statistics
- Visualizations
- Plot100PercentStackedBarChart
- PlotBarChart
- PlotBoxWhiskerByGroup
- PlotBulletChart
- PlotCard
- PlotClusteredBarChart
- PlotContingencyHeatmap
- PlotCorrelationMatrix
- PlotDensityByGroup
- PlotDotPlot
- PlotHeatmap
- PlotOverlappingAreaChart
- PlotRiskTolerance
- PlotScatterplot
- PlotSingleVariableCountPlot
- PlotSingleVariableHistogram
- PlotTimeSeries
- RenderTableOne
Calculus
FindDerivative
The FindDerivative function calculates the derivative of a given function. It uses the sympy library, a Python library for symbolic mathematics, to perform the differentiation. The function also has the capability to print the original function and its derivative, return the derivative function, and plot both the original function and its derivative.
```python
Load the FindDerivative function from the Calculus submodule
from analysistoolbox.calculus import FindDerivative import sympy
Define a symbolic variable
x = sympy.symbols('x')
Define a function
fofx = x3 + 2*x2 + 3*x + 4
Use the FindDerivative function
FindDerivative( fofx, printfunctions=True, returnderivativefunction=True, plotfunctions=True ) ```
FindLimitOfFunction
The FindLimitOfFunction function finds the limit of a function at a specific point and optionally plot the function and its tangent line at that point. The script uses the matplotlib and numpy libraries for plotting and numerical operations respectively.
```python
Import the necessary libraries
from analysistoolbox.calculus import FindLimitOfFunction import numpy as np import sympy
Define a symbolic variable
x = sympy.symbols('x')
Define a function
fofx = np.sin(x) / x
Use the FindLimitOfFunction function
FindLimitOfFunction( fofx, point=0, step=0.01, plotfunction=True, xminimum=-10, xmaximum=10, n=1000, tangentline_window=1 ) ```
FindMinimumSquareLoss
The FindMinimumSquareLoss function calculates the minimum square loss between observed and predicted values. This function is often used in machine learning and statistics to measure the average squared difference between the actual and predicted outcomes.
```python
Import the necessary libraries
from analysistoolbox.calculus import FindMinimumSquareLoss
Define observed and predicted values
observedvalues = [1, 2, 3, 4, 5] predictedvalues = [1.1, 1.9, 3.2, 3.7, 5.1]
Use the FindMinimumSquareLoss function
minimumsquareloss = FindMinimumSquareLoss( observedvalues, predictedvalues, show_plot=True )
Print the minimum square loss
print(f"The minimum square loss is: {minimumsquareloss}") ```
PlotFunction
The PlotFunction function plots a mathematical function of x. It takes a lambda function as input and allows for customization of the plot.
```python
Import the necessary libraries
from analysistoolbox.calculus import PlotFunction import sympy
Set x as a symbolic variable
x = sympy.symbols('x')
Define the function to plot
fofx = lambda x: x**2
Plot the function with default settings
PlotFunction(fofx) ```
Data Collection
ExtractTextFromPDF
The ExtractTextFromPDF function extracts text from a PDF file, cleans it, then saves it to a text file.
```python
Import the function
from analysistoolbox.data_collection import ExtractTextFromPDF
Call the function
ExtractTextFromPDF( filepathtopdf="/path/to/your/input.pdf", filepathforexportedtext="/path/to/your/output.txt", startpage=1, end_page=None ) ```
FetchPDFFromURL
The FetchPDFFromURL function downloads a PDF file from a URL and saves it to a specified location.
```python
Import the function
from analysistoolbox.data_collection import FetchPDFFromURL
Call the function to download the PDF
FetchPDFFromURL( url="https://example.com/sample.pdf", filename="C:/folder/sample.pdf" ) ```
FetchUSShapefile
The FetchUSShapefile function fetches a geographical shapefile from the TIGER database of the U.S. Census Bureau.
```python
Import the function
from analysistoolbox.data_collection import FetchUSShapefile
Fetch the shapefile for the census tracts in King County, Washington, for the 2021 census year
shapefile = FetchUSShapefile( state='PA', county='Allegheny', geography='tract', census_year=2021 )
Print the first few rows of the shapefile
print(shapefile.head()) ```
FetchWebsiteText
The FetchWebsiteText function fetches the text from a website and saves it to a text file.
```python
Import the function
from analysistoolbox.data_collection import FetchWebsiteText
Call the function
text = FetchWebsiteText( url="https://www.example.com", browserlessapikey="yourbrowserlessapi_key" )
Print the fetched text
print(text) ```
GetCompanyFilings
The GetCompanyFilings function fetches company filings from the SEC EDGAR database. It returns a list of filings for a given company CIK (Central Index Key) and filing type.
```python
Import the function
from analysistoolbox.data_collection import GetCompanyFilings
Call the function to get company filings for 'Online Dating' companies in 2024
results = GetCompanyFilings( searchkeywords="Online Dating", startdate="2024-01-01", enddate="2024-12-31", filingtype="all", )
Print the results
print(results) ```
GetGoogleSearchResults
The GetGoogleSearchResults function fetches Google search results for a given query using the Serper API.
```python
Import the function
from analysistoolbox.data_collection import GetGoogleSearchResults
Call the function with the query
Make sure to replace 'yourserperapi_key' with your actual Serper API key
results = GetGoogleSearchResults( query="Python programming", serperapikey='yourserperapikey', numberofresults=5, applyautocorrect=True, display_results=True )
Print the results
print(results) ```
GetZipFile
The GetZipFile function downloads a zip file from a url and saves it to a specified folder. It can also unzip the file and print the contents of the zip file.
```python
Import the function
from analysistoolbox.data_collection import GetZipFile
Call the function
GetZipFile( url="http://example.com/file.zip", pathtosave_folder="/path/to/save/folder" ) ```
Data Processing
AddDateNumberColumns
The AddDateNumberColumns function adds columns for the year, month, quarter, week, day of the month, and day of the week to a dataframe.
```python
Import necessary packages
from analysistoolbox.data_processing import AddDateNumberColumns from datetime import datetime import pandas as pd
Create a sample dataframe
data = {'Date': [datetime(2020, 1, 1), datetime(2020, 2, 1), datetime(2020, 3, 1), datetime(2020, 4, 1)]} df = pd.DataFrame(data)
Use the function on the sample dataframe
df = AddDateNumberColumns( dataframe=df, datecolumnname='Date' )
Print the updated dataframe
print(df) ```
AddLeadingZeros
The AddLeadingZeros function adds leading zeros to a column. If fixedlength is not specified, the longest string in the column is used as the fixed length. If addasnewcolumn is set to True, the new column is added to the dataframe. Otherwise, the original column is updated.
```python
Import necessary packages
from analysistoolbox.data_processing import AddLeadingZeros import pandas as pd
Create a sample dataframe
data = {'ID': [1, 23, 456, 7890]} df = pd.DataFrame(data)
Use the AddLeadingZeros function
df = AddLeadingZeros( dataframe=df, columnname='ID', addasnewcolumn=True )
Print updated dataframe
print(df) ```
AddRowCountColumn
The AddRowCountColumn function adds a column to a dataframe that contains the row number for each row, based on a group (or groups) of columns. The function can also sort the dataframe by a column or columns before adding the row count column.
```python
Import necessary packages
from analysistoolbox.data_processing import AddRowCountColumn import pandas as pd
Create a sample dataframe
data = { 'Payment Method': ['Check', 'Credit Card', 'Check', 'Credit Card', 'Check', 'Credit Card', 'Check', 'Credit Card'], 'Transaction Value': [100, 200, 300, 400, 500, 600, 700, 800], 'Transaction Order': [1, 2, 3, 4, 5, 6, 7, 8] } df = pd.DataFrame(data)
Call the function
dfupdated = AddRowCountColumn( dataframe=df, listofgroupingvariables=['Payment Method'], listofordercolumns=['Transaction Order'], listofascendingorder_args=[True] )
Print the updated dataframe
print(df_updated) ```
AddTPeriodColumn
The AddTPeriodColumn function adds a T-period column to a dataframe. The T-period column is the number of intervals (e.g., days or weeks) since the earliest date in the dataframe.
```python
Import necessary libraries
from analysistoolbox.data_processing import AddTPeriodColumn from datetime import datetime import pandas as pd
Create a sample dataframe
data = { 'date': pd.date_range(start='1/1/2020', end='1/10/2020'), 'value': range(1, 11) } df = pd.DataFrame(data)
Use the function
dfupdated = AddTPeriodColumn( dataframe=df, datecolumnname='date', tperiod_interval='days' )
Print the updated dataframe
print(df_updated) ```
AddTukeyOutlierColumn
The AddTukeyOutlierColumn function adds a column to a dataframe that indicates whether a value is an outlier. The function uses the Tukey method to identify outliers.
```python
Import necessary libraries
from analysistoolbox.data_processing import AddTukeyOutlierColumn import pandas as pd
Create a sample dataframe
data = pd.DataFrame({'values': [1, 2, 3, 4, 5, 6, 7, 8, 9, 20]})
Use the function
dfupdated = AddTukeyOutlierColumn( dataframe=data, valuecolumnname='values', tukeyboundarymultiplier=1.5, plottukey_outliers=True )
Print the updated dataframe
print(df_updated) ```
CleanTextColumns
The CleanTextColumns function cleans string-type columns in a pandas DataFrame by removing all leading and trailing spaces.
```python
Import necessary libraries
from analysistoolbox.data_processing import CleanTextColumns import pandas as pd
Create a sample dataframe
df = pd.DataFrame({ 'A': [' hello', 'world ', ' python '], 'B': [1, 2, 3], })
Clean the dataframe
df_clean = CleanTextColumns(df) ```
ConductAnomalyDetection
The ConductAnomalyDetection function performs anomaly detection on a given dataset using the z-score method.
```python
Import necessary libraries
from analysistoolbox.data_processing import ConductAnomalyDetection import pandas as pd
Create a sample dataframe
df = pd.DataFrame({ 'A': [1, 2, 3, 1000], 'B': [4, 5, 6, 2000], })
Conduct anomaly detection
dfanomalydetected = ConductAnomalyDetection( dataframe=df, listofcolumnstoanalyze=['A', 'B'] )
Print the updated dataframe
print(dfanomalydetected) ```
ConductEntityMatching
The ConductEntityMatching function performs entity matching between two dataframes using various fuzzy matching algorithms.
```python from analysistoolbox.data_processing import ConductEntityMatching import pandas as pd
Create two dataframes
dataframe_1 = pd.DataFrame({ 'ID': ['1', '2', '3'], 'Name': ['John Doe', 'Jane Smith', 'Bob Johnson'], 'City': ['New York', 'Los Angeles', 'Chicago'] })
dataframe_2 = pd.DataFrame({ 'ID': ['a', 'b', 'c'], 'Name': ['Jon Doe', 'Jane Smyth', 'Robert Johnson'], 'City': ['NYC', 'LA', 'Chicago'] })
Conduct entity matching
matchedentities = ConductEntityMatching( dataframe1=dataframe1, dataframe1primarykey='ID', dataframe2=dataframe2, dataframe2primarykey='ID', levenshteindistancefilter=3, matchscorethreshold=80, columnstocompare=['Name', 'City'], matchmethods=['Partial Token Set Ratio', 'Weighted Ratio'] ) ```
ConvertOddsToProbability
The ConvertOddsToProbability function converts odds to probability in a new column.
```python
Import necessary packages
from analysistoolbox.data_processing import ConvertOddsToProbability import pandas as pd
Create a sample dataframe
data = { 'Team': ['Team1', 'Team2', 'Team3', 'Team4'], 'Odds': [2.5, 1.5, 3.0, np.nan] } df = pd.DataFrame(data)
Print the original dataframe
print("Original DataFrame:") print(df)
Use the function to convert odds to probability
df = ConvertOddsToProbability( dataframe=df, odds_column='Odds' ) ```
CountMissingDataByGroup
The CountMissingDataByGroup function counts the number of records with missing data in a Pandas dataframe, grouped by specified columns.
```python
Import necessary packages
from analysistoolbox.data_processing import CountMissingDataByGroup import pandas as pd import numpy as np
Create a sample dataframe with some missing values
data = { 'Group': ['A', 'B', 'A', 'B', 'A', 'B'], 'Value1': [1, 2, np.nan, 4, 5, np.nan], 'Value2': [np.nan, 8, 9, 10, np.nan, 12] } df = pd.DataFrame(data)
Use the function to count missing data by group
CountMissingDataByGroup( dataframe=df, listofgrouping_columns=['Group'] ) ```
CreateBinnedColumn
The CreateBinnedColumn function creates a new column in a Pandas dataframe based on a numeric variable. Binning is a process of transforming continuous numerical variables into discrete categorical 'bins'.
```python
Import necessary packages
from analysistoolbox.data_processing import CreateBinnedColumn import pandas as pd import numpy as np
Create a sample dataframe
data = { 'Group': ['A', 'B', 'A', 'B', 'A', 'B'], 'Value1': [1, 2, 3, 4, 5, 6], 'Value2': [7, 8, 9, 10, 11, 12] } df = pd.DataFrame(data)
Use the function to create a binned column
dfbinned = CreateBinnedColumn( dataframe=df, numericcolumnname='Value1', numberofbins=3, binningstrategy='uniform' ) ```
CreateDataOverview
The CreateDataOverview function creates an overview of a Pandas dataframe, including the data type, missing count, missing percentage, and summary statistics for each variable in the DataFrame.
```python
Import necessary packages
from analysistoolbox.data_processing import CreateDataOverview import pandas as pd import numpy as np
Create a sample dataframe
data = { 'Column1': [1, 2, 3, np.nan, 5, 6], 'Column2': ['a', 'b', 'c', 'd', np.nan, 'f'], 'Column3': [7.1, 8.2, 9.3, 10.4, np.nan, 12.5] } df = pd.DataFrame(data)
Use the function to create an overview of the dataframe
CreateDataOverview( dataframe=df, plot_missingness=True ) ```
CreateRandomSampleGroups
The CreateRandomSampleGroups function a takes a pandas DataFrame, shuffle its rows, assign each row to one of n groups, and then return the updated DataFrame with an additional column indicating the group number.
```python
Import necessary packages
from analysistoolbox.data_processing import CreateRandomSampleGroups import pandas as pd
Create a sample DataFrame
data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'Age': [25, 31, 35, 19, 45], 'Score': [85, 95, 78, 81, 92] } df = pd.DataFrame(data)
Use the function
groupeddf = CreateRandomSampleGroups( dataframe=df, numberofgroups=2, randomseed=123 ) ```
CreateRareCategoryColumn
The CreateRareCategoryColumn function creates a new column in a Pandas dataframe that indicates whether a categorical variable value is rare. A rare category is a category that occurs less than a specified percentage of the time.
```python
Import necessary packages
from analysistoolbox.data_processing import CreateRareCategoryColumn import pandas as pd
Create a sample DataFrame
data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Alice', 'Bob', 'Alice'], 'Age': [25, 31, 35, 19, 45, 23, 30, 24], 'Score': [85, 95, 78, 81, 92, 88, 90, 86] } df = pd.DataFrame(data)
Use the function
updateddf = CreateRareCategoryColumn( dataframe=df, categoricalcolumnname='Name', rarecategorylabel='Rare', rarecategorythreshold=0.05, newcolumn_suffix='(rare category)' ) ```
CreateStratifiedRandomSampleGroups
The CreateStratifiedRandomSampleGroups unction performs stratified random sampling on a pandas DataFrame. Stratified random sampling is a method of sampling that involves the division of a population into smaller groups known as strata. In stratified random sampling, the strata are formed based on members' shared attributes or characteristics.
```python
Import necessary packages
from analysistoolbox.data_processing import CreateStratifiedRandomSampleGroups import numpy as np import pandas as pd
Create a sample DataFrame
data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Alice', 'Bob', 'Alice'], 'Age': [25, 31, 35, 19, 45, 23, 30, 24], 'Score': [85, 95, 78, 81, 92, 88, 90, 86] } df = pd.DataFrame(data)
Use the function
stratifieddf = CreateStratifiedRandomSampleGroups( dataframe=df, numberofgroups=2, listcategoricalcolumnnames=['Name'], random_seed=42 ) ```
ImputeMissingValuesUsingNearestNeighbors
The ImputeMissingValuesUsingNearestNeighbors function imputes missing values in a dataframe using the nearest neighbors method. For each sample with missing values, it finds the n_neighbors nearest neighbors in the training set and imputes the missing values using the mean value of these neighbors.
```python
Import necessary packages
from analysistoolbox.data_processing import ImputeMissingValuesUsingNearestNeighbors import pandas as pd import numpy as np
Create a sample DataFrame with missing values
data = { 'A': [1, 2, np.nan, 4, 5], 'B': [np.nan, 2, 3, 4, 5], 'C': [1, 2, 3, np.nan, 5], 'D': [1, 2, 3, 4, np.nan] } df = pd.DataFrame(data)
Use the function
imputeddf = ImputeMissingValuesUsingNearestNeighbors( dataframe=df, listofnumericcolumnstoimpute=['A', 'B', 'C', 'D'], numberofneighbors=2, averaging_method='uniform' ) ```
VerifyGranularity
The VerifyGranularity function checks the granularity of a given dataframe based on a list of key columns. Granularity in this context refers to the level of detail or summarization in a set of data.
```python
Import necessary packages
from analysistoolbox.data_processing import VerifyGranularity import pandas as pd
Create a sample DataFrame
data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Alice', 'Bob', 'Alice'], 'Age': [25, 31, 35, 19, 45, 23, 30, 24], 'Score': [85, 95, 78, 81, 92, 88, 90, 86] } df = pd.DataFrame(data)
Use the function
VerifyGranularity( dataframe=df, listofkeycolumns=['Name', 'Age'], setkeyasindex=True, printasmarkdown=False ) ```
Descriptive Analytics
ConductManifoldLearning
The ConductManifoldLearning function performs manifold learning on a given dataframe and returns a new dataframe with the original columns and the new manifold learning components. Manifold learning is a type of unsupervised learning that is used to reduce the dimensionality of the data.
```python
Import necessary packages
from analysistoolbox.descriptiveanalytics import ConductManifoldLearning import pandas as pd from sklearn.datasets import loadiris
Load the iris dataset
iris = loadiris() irisdf = pd.DataFrame(data=iris.data, columns=iris.feature_names)
Use the function
newdf = ConductManifoldLearning( dataframe=irisdf, listofnumericcolumns=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], numberofcomponents=2, randomseed=42, showcomponentsummaryplots=True, snscolorpalette='Set2', summaryplot_size=(10, 10) ) ```
ConductPrincipalComponentAnalysis
The ConductPrincipalComponentAnalysis function performs Principal Component Analysis (PCA) on a given dataframe. PCA is a technique used in machine learning to reduce the dimensionality of data while retaining as much information as possible.
```python
Import necessary packages
from analysistoolbox.descriptiveanalytics import ConductManifoldLearning import pandas as pd from sklearn.datasets import loadiris
Load the iris dataset
iris = loadiris() irisdf = pd.DataFrame(data=iris.data, columns=iris.feature_names)
Call the function
result = ConductPrincipalComponentAnalysis( dataframe=irisdf, listofnumericcolumns=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], numberofcomponents=2 ) ```
ConductPropensityScoreMatching
Conducts propensity score matching to create balanced treatment and control groups for causal inference analysis.
```python from analysistoolbox.descriptive_analytics import ConductPropensityScoreMatching import pandas as pd
Create matched groups based on age, education, and experience
matcheddf = ConductPropensityScoreMatching( dataframe=df, subjectidcolumnname='employeeid', listofcolumnnamestobasematching=['age', 'education', 'yearsexperience'], groupingcolumnname='receivedtraining', controlgroupname='No', maxmatchespersubject=1, balancegroups=True, propensityscorecolumnname="PSScore", matchedidcolumnname="MatchedEmployeeID", random_seed=412 ) ```
CreateAssociationRules
The CreateAssociationRules function creates association rules from a given dataframe. Association rules are widely used in market basket analysis, where the goal is to find associations and/or correlations among a set of items.
```python
Import necessary packages
from analysistoolbox.descriptive_analytics import CreateAssociationRules import pandas as pd
Assuming you have a dataframe 'df' with 'TransactionID' and 'Item' columns
result = CreateAssociationRules( dataframe=df, transactionidcolumn='TransactionID', itemscolumn='Item', supportthreshold=0.01, confidencethreshold=0.2, plotlift=True, plottitle='Association Rules', plotsize=(10, 7) ) ```
CreateGaussianMixtureClusters
The CreateGaussianMixtureClusters function creates Gaussian mixture clusters from a given dataframe. Gaussian mixture models are a type of unsupervised learning that is used to find clusters in data. It adds the resulting clusters as a new column in the dataframe, and also calculates the probability of each data point belonging to each cluster.
```python
Import necessary packages
from analysistoolbox.descriptive_analytics import CreateGaussianMixtureClusters import pandas as pd from sklearn import datasets
Load the iris dataset
iris = datasets.load_iris()
Convert the iris dataset to a pandas dataframe
df = pd.DataFrame(data= np.c[iris['data'], iris['target']], columns= iris['featurenames'] + ['target'])
Call the CreateGaussianMixtureClusters function
dfclustered = CreateGaussianMixtureClusters( dataframe=df, listofnumericcolumnsforclustering=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], numberofclusters=3, columnnameforclusters='Gaussian Mixture Cluster', scalepredictorvariables=True, showclustersummaryplots=True, snscolorpalette='Set2', summaryplotsize=(15, 15), randomseed=123, maximumiterations=200 ) ```
CreateHierarchicalClusters
The CreateHierarchicalClusters function creates hierarchical clusters from a given dataframe. Hierarchical clustering is a type of unsupervised learning that is used to find clusters in data. It adds the resulting clusters as a new column in the dataframe.
```python
Import necessary packages
from analysistoolbox.descriptive_analytics import CreateHierarchicalClusters import pandas as pd from sklearn import datasets
Load the iris dataset
iris = datasets.loadiris() df = pd.DataFrame(data=iris.data, columns=iris.featurenames)
Call the CreateHierarchicalClusters function
dfclustered = CreateHierarchicalClusters( dataframe=df, listofvaluecolumnsforclustering=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], numberofclusters=3, columnnameforclusters='Hierarchical Cluster', scalepredictorvariables=True, showclustersummaryplots=True, colorpalette='Set2', summaryplotsize=(6, 4), randomseed=412, maximum_iterations=300 ) ```
CreateKMeansClusters
The CreateKMeansClusters function performs K-Means clustering on a given dataset and returns the dataset with an additional column indicating the cluster each record belongs to.
```python
Import necessary packages
from analysistoolbox.descriptive_analytics import CreateKMeansClusters import pandas as pd from sklearn import datasets
Load the iris dataset
iris = datasets.loadiris() df = pd.DataFrame(data=iris.data, columns=iris.featurenames)
Call the CreateKMeansClusters function
dfclustered = CreateKMeansClusters( dataframe=df, listofvaluecolumnsforclustering=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], numberofclusters=3, columnnameforclusters='K-Means Cluster', scalepredictorvariables=True, showclustersummaryplots=True, colorpalette='Set2', summaryplotsize=(6, 4), randomseed=412, maximum_iterations=300 ) ```
GenerateEDAWithLIDA
The GenerateEDAWithLIDA function uses the LIDA package from Microsoft to generate exploratory data analysis (EDA) goals.
```python
Import necessary packages
from analysistoolbox.descriptive_analytics import GenerateEDAWithLIDA import pandas as pd from sklearn import datasets
Load the iris dataset
iris = datasets.loadiris() df = pd.DataFrame(data=iris.data, columns=iris.featurenames)
Call the GenerateEDAWithLIDA function
dfsummary = GenerateEDAWithLIDA( dataframe=df, llmapikey="yourllmapikeyhere", llmprovider="openai", llmmodel="gpt-3.5-turbo", visualizationlibrary="seaborn", goaltemperature=0.50, codegenerationtemperature=0.05, datasummarymethod="llm", numberofsamplestoshowinsummary=5, returndatafieldssummary=True, numberofgoalstogenerate=5, plotrecommendedvisualization=True, showcodeforrecommendedvisualization=True ) ```
File Management
ImportDataFromFolder
The ImportDataFromFolder function imports all CSV and Excel files from a specified folder and combines them into a single DataFrame. It ensures that column names match across all files if specified.
```python
Import necessary packages
from analysistoolbox.file_management import ImportDataFromFolder
Specify the folder path
folder_path = "path/to/your/folder"
Call the ImportDataFromFolder function
combineddf = ImportDataFromFolder( folderpath=folderpath, forcecolumnnamesto_match=True ) ```
CreateFileTree
The CreateFileTree function recursively walks a directory tree and prints a diagram of all the subdirectories and files.
```python
Import necessary packages
from analysistoolbox.file_management import CreateFileTree
Specify the directory path
directory_path = "path/to/your/directory"
Call the CreateFileTree function
CreateFileTree( path=directorypath, indentspaces=2 ) ```
CreateCopyOfPDF
The CreateCopyOfPDF function creates a copy of a PDF file, with options to specify the start and end pages.
```python
Import necessary packages
from analysistoolbox.file_management import CreateCopyOfPDF
Specify the input and output file paths
inputpdf = "path/to/input.pdf" outputpdf = "path/to/output.pdf"
Call the CreateCopyOfPDF function
CreateCopyOfPDF( inputfile=inputpdf, outputfile=outputpdf, startpage=1, endpage=5 ) ```
ConvertWordDocsToPDF
The ConvertWordDocsToPDF function converts all Word documents in a specified folder to PDF format.
```python
Import necessary packages
from analysistoolbox.file_management import ConvertWordDocsToPDF
Specify the folder paths
wordfolder = "path/to/word/documents" pdffolder = "path/to/save/pdf/documents"
Call the ConvertWordDocsToPDF function
ConvertWordDocsToPDF( wordfolderpath=wordfolder, pdffolderpath=pdffolder, openeachdoc=False ) ```
Hypothesis Testing
ChiSquareTestOfIndependence
The ChiSquareTestOfIndependence function performs a chi-square test of independence to determine if there is a significant relationship between two categorical variables.
```python from analysistoolbox.hypothesis_testing import ChiSquareTestOfIndependence
Create sample data
data = { 'Education': ['High School', 'College', 'High School', 'Graduate', 'College'], 'Employment': ['Employed', 'Unemployed', 'Employed', 'Employed', 'Unemployed'] } df = pd.DataFrame(data)
Conduct chi-square test
ChiSquareTestOfIndependence( dataframe=df, firstcategoricalcolumn='Education', secondcategoricalcolumn='Employment', plotcontingencytable=True ) ```
ChiSquareTestOfIndependenceFromTable
The ChiSquareTestOfIndependenceFromTable function performs a chi-square test using a pre-computed contingency table.
```python from analysistoolbox.hypothesis_testing import ChiSquareTestOfIndependenceFromTable
Create contingency table
contingency_table = pd.DataFrame({ 'Online': [100, 150], 'In-Store': [200, 175] }, index=['Male', 'Female'])
Conduct chi-square test
ChiSquareTestOfIndependenceFromTable( contingencytable=contingencytable, plotcontingencytable=True ) ```
ConductCoxProportionalHazardRegression
The ConductCoxProportionalHazardRegression function performs survival analysis using Cox Proportional Hazard regression.
```python from analysistoolbox.hypothesis_testing import ConductCoxProportionalHazardRegression
Conduct Cox regression
model = ConductCoxProportionalHazardRegression( dataframe=df, outcomecolumn='event', durationcolumn='time', listofpredictorcolumns=['age', 'sex', 'treatment'], plotsurvival_curve=True ) ```
ConductLinearRegressionAnalysis
The ConductLinearRegressionAnalysis function performs linear regression analysis with optional plotting.
```python from analysistoolbox.hypothesis_testing import ConductLinearRegressionAnalysis
Conduct linear regression
results = ConductLinearRegressionAnalysis( dataframe=df, outcomecolumn='sales', listofpredictorcolumns=['advertising', 'price'], plotregressiondiagnostic=True ) ```
ConductLogisticRegressionAnalysis
The ConductLogisticRegressionAnalysis function performs logistic regression for binary outcomes.
```python from analysistoolbox.hypothesis_testing import ConductLogisticRegressionAnalysis
Conduct logistic regression
results = ConductLogisticRegressionAnalysis( dataframe=df, outcomecolumn='purchased', listofpredictorcolumns=['age', 'income'], plotregressiondiagnostic=True ) ```
OneSampleTTest
The OneSampleTTest function performs a one-sample t-test to compare a sample mean to a hypothesized population mean.
```python from analysistoolbox.hypothesis_testing import OneSampleTTest
Conduct one-sample t-test
OneSampleTTest( dataframe=df, outcomecolumn='score', hypothesizedmean=70, alternativehypothesis='two-sided', confidenceinterval=0.95 ) ```
OneWayANOVA
The OneWayANOVA function performs a one-way analysis of variance to compare means across multiple groups.
```python from analysistoolbox.hypothesis_testing import OneWayANOVA
Conduct one-way ANOVA
OneWayANOVA( dataframe=df, outcomecolumn='performance', groupingcolumn='treatmentgroup', plotsample_distributions=True ) ```
TTestOfMeanFromStats
The TTestOfMeanFromStats function performs a t-test using summary statistics rather than raw data.
```python from analysistoolbox.hypothesis_testing import TTestOfMeanFromStats
Conduct t-test from statistics
TTestOfMeanFromStats( samplemean=75, samplesize=30, samplestandarddeviation=10, hypothesizedmean=70, alternativehypothesis='greater' ) ```
TTestOfProportionFromStats
The TTestOfProportionFromStats function tests a sample proportion against a hypothesized value.
```python from analysistoolbox.hypothesis_testing import TTestOfProportionFromStats
Test proportion from statistics
TTestOfProportionFromStats( sampleproportion=0.65, # 65% proportion samplesize=200, # 200 survey responses hypothesizedproportion=0.50, alternativehypothesis='two-sided' ) ```
TTestOfTwoMeansFromStats
The TTestOfTwoMeansFromStats function compares two means using summary statistics.
```python from analysistoolbox.hypothesis_testing import TTestOfTwoMeansFromStats
Compare two means from statistics
TTestOfTwoMeansFromStats( firstsamplemean=75, firstsamplesize=30, firstsamplestandarddeviation=10, secondsamplemean=70, secondsamplesize=30, secondsamplestandarddeviation=12 ) ```
TwoSampleTTestOfIndependence
The TwoSampleTTestOfIndependence function performs an independent samples t-test to compare means between two groups.
```python from analysistoolbox.hypothesis_testing import TwoSampleTTestOfIndependence
Conduct independent samples t-test
TwoSampleTTestOfIndependence( dataframe=df, outcomecolumn='score', groupingcolumn='group', alternativehypothesis='two-sided', homogeneityof_variance=True ) ```
TwoSampleTTestPaired
The TwoSampleTTestPaired function performs a paired samples t-test for before-after comparisons.
```python from analysistoolbox.hypothesis_testing import TwoSampleTTestPaired
Conduct paired samples t-test
TwoSampleTTestPaired( dataframe=df, firstoutcomecolumn='prescore', secondoutcomecolumn='postscore', alternative_hypothesis='greater' ) ```
Linear Algebra
CalculateEigenvalues
The CalculateEigenvalues function calculates and visualizes the eigenvalues and eigenvectors of a matrix.
```python from analysistoolbox.linear_algebra import CalculateEigenvalues import numpy as np
Create a 2x2 matrix
matrix = np.array([ [4, -2], [1, 1] ])
Calculate eigenvalues and eigenvectors
CalculateEigenvalues( matrix=matrix, ploteigenvectors=True, plottransformation=True ) ```
ConvertMatrixToRowEchelonForm
The ConvertMatrixToRowEchelonForm function converts a matrix to row echelon form using Gaussian elimination.
```python from analysistoolbox.linear_algebra import ConvertMatrixToRowEchelonForm import numpy as np
Create a matrix
matrix = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ])
Convert to row echelon form
rowechelon = ConvertMatrixToRowEchelonForm( matrix=matrix, showpivot_columns=True ) ```
ConvertSystemOfEquationsToMatrix
The ConvertSystemOfEquationsToMatrix function converts a system of linear equations to matrix form.
```python from analysistoolbox.linear_algebra import ConvertSystemOfEquationsToMatrix import numpy as np
Define system of equations:
2x + 3y = 8
4x - y = 1
coefficients = np.array([ [2, 3], [4, -1] ]) constants = np.array([8, 1])
Convert to matrix form
matrix = ConvertSystemOfEquationsToMatrix( coefficients=coefficients, constants=constants, show_determinant=True ) ```
PlotVectors
The PlotVectors function visualizes vectors in 2D or 3D space.
```python from analysistoolbox.linear_algebra import PlotVectors import numpy as np
Define vectors
vectors = [ [3, 2], # First vector [-1, 4], # Second vector [2, -3] # Third vector ]
Plot vectors
PlotVectors( listofvectors=vectors, origin=[0, 0], plot_sum=True, grid=True ) ```
SolveSystemOfEquations
The SolveSystemOfEquations function solves a system of linear equations and optionally visualizes the solution.
```python from analysistoolbox.linear_algebra import SolveSystemOfEquations import numpy as np
Define system of equations:
2x + y = 5
x - 3y = -1
coefficients = np.array([ [2, 1], [1, -3] ]) constants = np.array([5, -1])
Solve the system
solution = SolveSystemOfEquations( coefficients=coefficients, constants=constants, showplot=True, plotboundary=10 ) ```
VisualizeMatrixAsLinearTransformation
The VisualizeMatrixAsLinearTransformation function visualizes how a matrix transforms space as a linear transformation.
```python from analysistoolbox.linear_algebra import VisualizeMatrixAsLinearTransformation import numpy as np
Define transformation matrix
transformation_matrix = np.array([ [2, -1], [1, 1] ])
Visualize the transformation
VisualizeMatrixAsLinearTransformation( transformationmatrix=transformationmatrix, plotgrid=True, plotunitvectors=True, animationframes=30 ) ```
LLM
SendPromptToAnthropic
The SendPromptToAnthropic function sends a prompt to Anthropic's Claude API using LangChain. It supports template-based prompting and requires an Anthropic API key.
```python from analysistoolbox.llm import SendPromptToAnthropic
Define your prompt template with variables in curly braces
prompt_template = "Given the text: {text}\nSummarize the main points in bullet form."
Create a dictionary with your input variables
user_input = { "text": "Your text to analyze goes here..." }
Send the prompt to Claude
response = SendPromptToAnthropic( prompttemplate=prompttemplate, userinput=userinput, systemmessage="You are a helpful assistant.", anthropicapikey="your-api-key-here", temperature=0.0, chatmodelname="claude-3-opus-20240229", maximumtokens=1000 )
print(response) ```
SendPromptToChatGPT
The SendPromptToChatGPT function sends a prompt to OpenAI's ChatGPT API using LangChain. It supports template-based prompting and requires an OpenAI API key.
```python from analysistoolbox.llm import SendPromptToChatGPT
Define your prompt template with variables in curly braces
prompt_template = "Analyze the following data: {data}\nProvide key insights."
Create a dictionary with your input variables
user_input = { "data": "Your data to analyze goes here..." }
Send the prompt to ChatGPT
response = SendPromptToChatGPT( prompttemplate=prompttemplate, userinput=userinput, systemmessage="You are a helpful assistant.", openaiapikey="your-api-key-here", temperature=0.0, chatmodelname="gpt-4o-mini", maximumtokens=1000 )
print(response) ```
Predictive Analytics
CreateARIMAModel
Builds an ARIMA (Autoregressive Integrated Moving Average) model for time series forecasting.
```python from analysistoolbox.predictive_analytics import CreateARIMAModel import pandas as pd
Create time series forecast
forecast = CreateARIMAModel( dataframe=df, timecolumn='date', valuecolumn='sales', forecast_periods=12 ) ```
CreateBoostedTreeModel
Creates a gradient boosted tree model for classification or regression tasks, offering high performance and feature importance analysis.
```python from analysistoolbox.predictive_analytics import CreateBoostedTreeModel
Train a boosted tree classifier
model = CreateBoostedTreeModel( dataframe=df, outcomevariable='churn', listofpredictorvariables=['usage', 'tenure', 'satisfaction'], isoutcomecategorical=True, plotmodeltest_performance=True ) ```
CreateDecisionTreeModel
Builds an interpretable decision tree for classification or regression, with visualization options.
```python from analysistoolbox.predictive_analytics import CreateDecisionTreeModel
Create a decision tree for predicting house prices
model = CreateDecisionTreeModel( dataframe=df, outcomevariable='price', listofpredictorvariables=['sqft', 'bedrooms', 'location'], isoutcomecategorical=False, maximum_depth=5 ) ```
CreateLinearRegressionModel
Fits a linear regression model with optional scaling and comprehensive performance visualization.
```python from analysistoolbox.predictive_analytics import CreateLinearRegressionModel
Predict sales based on advertising spend
model = CreateLinearRegressionModel( dataframe=df, outcomevariable='sales', listofpredictorvariables=['tvads', 'radioads', 'newspaperads'], scalevariables=True, plotmodeltest_performance=True ) ```
CreateLogisticRegressionModel
Implements logistic regression for binary classification tasks with regularization options.
```python from analysistoolbox.predictive_analytics import CreateLogisticRegressionModel
Predict customer churn probability
model = CreateLogisticRegressionModel( dataframe=df, outcomevariable='churn', listofpredictorvariables=['usage', 'complaints', 'satisfaction'], scalepredictorvariables=True, showclassificationplot=True ) ```
CreateNeuralNetwork_SingleOutcome
Builds and trains a neural network for single-outcome prediction tasks, with customizable architecture.
```python from analysistoolbox.predictiveanalytics import CreateNeuralNetworkSingleOutcome
Create a neural network for image classification
model = CreateNeuralNetworkSingleOutcome( dataframe=df, outcomevariable='label', listofpredictorvariables=featurecolumns, numberofhiddenlayers=3, isoutcomecategorical=True, plotloss=True ) ```
Prescriptive Analytics
The prescriptive analytics module provides tools for making data-driven recommendations and decisions:
ConductLinearOptimization
Conducts linear optimization to find the optimal input values for a given output variable, with optional constraints.
```python import pandas as pd from analysistoolbox.prescriptive_analytics.ConductLinearOptimization import ConductLinearOptimization
Sample data
data = pd.DataFrame({ 'input1': [1, 2, 3, 4, 5], 'input2': [2, 4, 6, 8, 10], 'output': [10, 20, 30, 40, 50] })
Define constraints (optional)
constraints = { 'input1': (0, 10), # input1 between 0 and 10 'input2': (None, 15) # input2 maximum 15, no minimum }
Run optimization
results = ConductLinearOptimization( dataframe=data, outputvariable='output', listofinputvariables=['input1', 'input2'], optimizationtype='maximize', inputconstraints=constraints ) ```
CreateContentBasedRecommender
Builds a content-based recommendation system using neural networks to learn user and item embeddings.
```python from analysistoolbox.prescriptive_analytics import CreateContentBasedRecommender import pandas as pd
Create a movie recommendation system
recommender = CreateContentBasedRecommender( dataframe=movieratingsdf, outcomevariable='rating', userlistofpredictorvariables=['age', 'gender', 'occupation'], itemlistofpredictorvariables=['genre', 'year', 'director', 'budget'], usernumberofhiddenlayers=2, itemnumberofhiddenlayers=2, numberofrecommendations=5, scalevariables=True, plot_loss=True ) ```
Probability
The probability module provides tools for working with probability distributions and statistical models:
ProbabilityOfAtLeastOne
Calculates and visualizes the probability of at least one event occurring in a series of independent trials.
```python from analysistoolbox.probability import ProbabilityOfAtLeastOne
Calculate probability of at least one defect in 10 products
given a 5% defect rate per product
prob = ProbabilityOfAtLeastOne( probabilityofevent=0.05, numberofevents=10, formataspercent=True, showplot=True, risktolerance=0.20 # Highlight 20% risk threshold )
Calculate probability of at least one successful sale
given 30 customer interactions with 15% success rate
prob = ProbabilityOfAtLeastOne( probabilityofevent=0.15, numberofevents=30, formataspercent=True, showplot=True, titleforplot="Sales Success Probability", subtitlefor_plot="Probability of at least one sale in 30 customer interactions" ) ```
Simulations
The simulations module provides a comprehensive set of tools for statistical simulations and probability distributions:
CreateMetalogDistribution
Creates a flexible metalog distribution from data, useful for modeling complex probability distributions.
```python from analysistoolbox.simulations import CreateMetalogDistribution
Create a metalog distribution from historical data
distribution = CreateMetalogDistribution( dataframe=df, variable='sales', lowerbound=0, numberofsamples=10000, plotmetalog_distribution=True ) ```
CreateMetalogDistributionFromPercentiles
Builds a metalog distribution from known percentile values.
```python from analysistoolbox.simulations import CreateMetalogDistributionFromPercentiles
Create distribution from percentiles
distribution = CreateMetalogDistributionFromPercentiles( listofvalues=[10, 20, 30, 50], listofpercentiles=[0.1, 0.25, 0.75, 0.9], lowerbound=0, showdistribution_plot=True ) ```
CreateSIPDataframe
Generates Stochastically Indexed Percentiles (SIP) for uncertainty analysis.
```python from analysistoolbox.simulations import CreateSIPDataframe
Create SIP dataframe for risk analysis
sipdf = CreateSIPDataframe( numberofpercentiles=10, numberof_trials=1000 ) ```
CreateSLURPDistribution
Creates a SIP with relationships preserved (SLURP) based on a linear regression model's prediction interval.
```python from analysistoolbox.simulations import CreateSLURPDistribution
Create a SLURP distribution from a linear regression model
slurpdist = CreateSLURPDistribution( linearregressionmodel=model, # statsmodels regression model listofpredictionvalues=[x1, x2, ...], # values for predictors numberoftrials=10000, # number of samples to generate predictioninterval=0.95, # confidence level for prediction interval lowerbound=None, # optional lower bound constraint upper_bound=None # optional upper bound constraint ) ```
SimulateCountOfSuccesses
Simulates binomial outcomes (number of successes in fixed trials).
```python from analysistoolbox.simulations import SimulateCountOfSuccesses
Simulate customer conversion rates
results = SimulateCountOfSuccesses( probabilityofsuccess=0.15, samplesizepertrial=100, numberoftrials=10000, plotsimulation_results=True ) ```
SimulateCountOutcome
Simulates Poisson-distributed count data.
```python from analysistoolbox.simulations import SimulateCountOutcome
Simulate daily customer arrivals
arrivals = SimulateCountOutcome( expectedcount=25, numberoftrials=10000, plotsimulation_results=True ) ```
SimulateCountUntilFirstSuccess
Simulates geometric distributions (trials until first success).
```python from analysistoolbox.simulations import SimulateCountUntilFirstSuccess
Simulate number of attempts until success
attempts = SimulateCountUntilFirstSuccess( probabilityofsuccess=0.2, numberoftrials=10000, plotsimulationresults=True ) ```
SimulateNormallyDistributedOutcome
Generates normally distributed random variables.
```python from analysistoolbox.simulations import SimulateNormallyDistributedOutcome
Simulate product weights
weights = SimulateNormallyDistributedOutcome( mean=100, standarddeviation=5, numberoftrials=10000, plotsimulation_results=True ) ```
SimulateTDistributedOutcome
Generates Student's t-distributed random variables.
```python from analysistoolbox.simulations import SimulateTDistributedOutcome
Simulate with heavy-tailed distribution
values = SimulateTDistributedOutcome( degreesoffreedom=5, numberoftrials=10000, plotsimulationresults=True ) ```
SimulateTimeBetweenEvents
Simulates exponentially distributed inter-arrival times.
```python from analysistoolbox.simulations import SimulateTimeBetweenEvents
Simulate time between customer arrivals
times = SimulateTimeBetweenEvents( averagetimebetweenevents=30, numberoftrials=10000, plotsimulation_results=True ) ```
SimulateTimeUntilNEvents
Simulates Erlang-distributed waiting times.
```python from analysistoolbox.simulations import SimulateTimeUntilNEvents
Simulate time until 5 events occur
waittime = SimulateTimeUntilNEvents( averagetimebetweenevents=10, numberofevents=5, numberoftrials=10000, plotsimulationresults=True ) ```
Statistics
The statistics module provides essential tools for statistical inference and estimation:
CalculateConfidenceIntervalOfMean
Calculates confidence intervals for population means, automatically handling both large (z-distribution) and small (t-distribution) sample sizes.
```python from analysistoolbox.statistics import CalculateConfidenceIntervalOfMean
Calculate 95% confidence interval for average customer spending
ciresults = CalculateConfidenceIntervalOfMean( samplemean=45.2, samplestandarddeviation=12.5, samplesize=100, confidenceinterval=0.95, plotsampledistribution=True, value_name="Average Spending ($)" ) ```
CalculateConfidenceIntervalOfProportion
Calculates confidence intervals for population proportions, with automatic selection of the appropriate distribution based on sample size.
```python from analysistoolbox.statistics import CalculateConfidenceIntervalOfProportion
Calculate 95% confidence interval for customer satisfaction rate
ciresults = CalculateConfidenceIntervalOfProportion( sampleproportion=0.78, # 78% satisfaction rate samplesize=200, # 200 survey responses confidenceinterval=0.95, plotsampledistribution=True, value_name="Satisfaction Rate" ) ```
Visualizations
The visualizations module provides a comprehensive set of tools for creating publication-quality statistical plots and charts:
Plot100PercentStackedBarChart
Creates a 100% stacked bar chart for comparing proportional compositions across categories.
```python from analysistoolbox.visualizations import Plot100PercentStackedBarChart
Create a stacked bar chart showing customer segments by region
chart = Plot100PercentStackedBarChart( dataframe=df, categoricalcolumnname='Region', valuecolumnname='Customers', groupingcolumnname='Segment' ) ```
PlotBarChart
Creates a customizable bar chart with options for highlighting specific categories.
```python from analysistoolbox.visualizations import PlotBarChart
Create a bar chart of sales by product
chart = PlotBarChart( dataframe=df, categoricalcolumnname='Product', valuecolumnname='Sales', topntohighlight=3, highlightcolor="#b0170c" ) ```
PlotBoxWhiskerByGroup
Creates box-and-whisker plots for comparing distributions across groups.
```python from analysistoolbox.visualizations import PlotBoxWhiskerByGroup
Compare salary distributions across departments
plot = PlotBoxWhiskerByGroup( dataframe=df, valuecolumnname='Salary', groupingcolumnname='Department' ) ```
PlotBulletChart
Creates bullet charts for comparing actual values against targets with optional range bands.
```python from analysistoolbox.visualizations import PlotBulletChart
Create bullet chart comparing actual vs target sales
chart = PlotBulletChart( dataframe=df, valuecolumnname='ActualSales', groupingcolumnname='Region', targetvaluecolumnname='TargetSales', listoflimitcolumns=['MinSales', 'MaxSales'] ) ```
PlotCard
Creates a simple card-style visualization with a value and an optional value label.
```python from analysistoolbox.visualizations import PlotCard
Create a simple KPI card
card = PlotCard( value=125000, # main value to display valuelabel="Monthly Revenue", # optional label valuefontsize=30, # size of the main value valuelabelfontsize=14, # size of the label figure_size=(3, 2) # dimensions of the card ) ```
PlotClusteredBarChart
Creates grouped bar charts for comparing multiple categories across groups.
```python from analysistoolbox.visualizations import PlotClusteredBarChart
Create clustered bar chart of sales by product and region
chart = PlotClusteredBarChart( dataframe=df, categoricalcolumnname='Product', valuecolumnname='Sales', groupingcolumnname='Region' ) ```
PlotContingencyHeatmap
Creates a heatmap visualization of contingency tables.
```python from analysistoolbox.visualizations import PlotContingencyHeatmap
Create heatmap of customer segments vs purchase categories
heatmap = PlotContingencyHeatmap( dataframe=df, categoricalcolumnname1='CustomerSegment', categoricalcolumnname2='PurchaseCategory', normalize_by="columns" ) ```
PlotCorrelationMatrix
Creates correlation matrix visualizations with optional scatter plots.
```python from analysistoolbox.visualizations import PlotCorrelationMatrix
Create correlation matrix of numeric variables
matrix = PlotCorrelationMatrix( dataframe=df, listofvaluecolumnnames=['Age', 'Income', 'Spending'], showaspairplot=True ) ```
PlotDensityByGroup
Creates density plots for comparing distributions across groups.
```python from analysistoolbox.visualizations import PlotDensityByGroup
Compare age distributions across customer segments
plot = PlotDensityByGroup( dataframe=df, valuecolumnname='Age', groupingcolumnname='Customer_Segment' ) ```
PlotDotPlot
Creates dot plots with optional connecting lines between groups.
```python from analysistoolbox.visualizations import PlotDotPlot
Compare before/after measurements
plot = PlotDotPlot( dataframe=df, categoricalcolumnname='Metric', valuecolumnname='Value', groupcolumnname='TimePeriod', connectdots=True ) ```
PlotHeatmap
Creates customizable heatmaps for visualizing two-dimensional data.
```python from analysistoolbox.visualizations import PlotHeatmap
Create heatmap of customer activity by hour and day
heatmap = PlotHeatmap( dataframe=df, xaxiscolumnname='Hour', yaxiscolumnname='Day', valuecolumnname='Activity', color_palette="RdYlGn" ) ```
PlotOverlappingAreaChart
Creates stacked or overlapping area charts for time series data.
```python from analysistoolbox.visualizations import PlotOverlappingAreaChart
Show product sales trends over time
chart = PlotOverlappingAreaChart( dataframe=df, timecolumnname='Date', valuecolumnname='Sales', variablecolumnname='Product' ) ```
PlotRiskTolerance
Creates specialized plots for risk analysis and tolerance visualization.
```python from analysistoolbox.visualizations import PlotRiskTolerance
Visualize risk tolerance levels
plot = PlotRiskTolerance( dataframe=df, valuecolumnname='RiskScore', tolerancelevelcolumnname='Tolerance' ) ```
PlotScatterplot
Creates scatter plots with optional trend lines and grouping.
```python from analysistoolbox.visualizations import PlotScatterplot
Create scatter plot of age vs income
plot = PlotScatterplot( dataframe=df, xaxiscolumnname='Age', yaxiscolumnname='Income', colorbycolumn_name='Education' ) ```
PlotSingleVariableCountPlot
Creates count plots for categorical variables.
```python from analysistoolbox.visualizations import PlotSingleVariableCountPlot
Show distribution of customer types
plot = PlotSingleVariableCountPlot( dataframe=df, categoricalcolumnname='CustomerType', topntohighlight=2 ) ```
PlotSingleVariableHistogram
Creates histograms for continuous variables.
```python from analysistoolbox.visualizations import PlotSingleVariableHistogram
Create histogram of transaction amounts
plot = PlotSingleVariableHistogram( dataframe=df, valuecolumnname='TransactionAmount', showmean=True, show_median=True ) ```
PlotTimeSeries
Creates time series plots with optional grouping and marker sizes.
```python from analysistoolbox.visualizations import PlotTimeSeries
Plot monthly sales with grouping
plot = PlotTimeSeries( dataframe=df, timecolumnname='Date', valuecolumnname='Sales', groupingcolumnname='Region', # optional grouping markersizecolumnname='Volume', # optional markers linecolor='#3269a8', figure_size=(8, 5) ) ```
RenderTableOne
Creates publication-ready summary statistics tables comparing variables across groups.
```python from analysistoolbox.visualizations import RenderTableOne
Create summary statistics table comparing age, education by department
table = RenderTableOne( dataframe=df, valuecolumnname='Age', # outcome variable groupingcolumnname='Department', # grouping variable listofrowvariables=['Education', 'Experience'], # variables to compare tableformat='html', # output format showpvalue=True # include statistical tests ) ```
Contributions
Contributions to the analysistoolbox package are welcome! Please submit a pull request with your changes.
License
The analysistoolbox package is licensed under the GNU License. Read more about the GNU License at https://www.gnu.org/licenses/gpl-3.0.html.
Owner
- Name: Kyle Protho
- Login: KyleProtho
- Kind: user
- Location: Pittsburgh, PA
- Website: https://www.linkedin.com/in/kyleprotho
- Twitter: kyle_protho
- Repositories: 1
- Profile: https://github.com/KyleProtho
Passionate about UX Design, intelligence, and analytics
GitHub Events
Total
- Push event: 24
Last Year
- Push event: 24
Committers
Last synced: about 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Kyle Protho | O****o@g****m | 334 |
| Kyle Protho | o****o@g****m | 160 |
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 3
- Total pull requests: 7
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 7
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- KyleProtho (3)
Pull Request Authors
- KyleProtho (7)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 1,816 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 86
- Total maintainers: 1
pypi.org: analysistoolbox
A collection tools in Python for data collection and processing, statistics, analytics, and intelligence analysis.
- Homepage: https://github.com/KyleProtho/AnalysisToolBox/tree/master/analysistoolbox
- Documentation: https://analysistoolbox.readthedocs.io/
- License: MIT
-
Latest release: 3.0.0
published 6 months ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v3 composite
- actions/setup-python v3 composite
- pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite
- Jinja2 ==3.1.2
- Levenshtein *
- PyPDF2 *
- beautifulsoup4 *
- folium *
- fuzzywuzzy *
- geopandas *
- lida *
- lifelines *
- mapclassify *
- matplotlib *
- mlxtend *
- numpy ==1.24.5
- openai *
- pandas *
- pygris *
- pymetalog *
- python-dotenv *
- pywin32 *
- requests *
- scikit-learn *
- scipy *
- seaborn *
- statsmodels *
- sympy *
- tableone *
- tensorflow *
- xgboost *
- yellowbrick *