https://github.com/kyleprotho/analysistoolbox

Analysis Tool Box (i.e. "analysistoolbox") is a collection of tools in Python for data collection and processing, statisitics, analytics, and intelligence analysis.

Keywords

analytics data-analysis open-source-intelligence python3 r research snippets statistics

Last synced: 9 months ago · JSON representation

Repository

Analysis Tool Box (i.e. "analysistoolbox") is a collection of tools in Python for data collection and processing, statisitics, analytics, and intelligence analysis.

Basic Info

Host: GitHub
Owner: KyleProtho
License: gpl-3.0
Language: Python
Default Branch: master
Homepage: https://kyleprotho.github.io/AnalysisToolBox/
Size: 2.32 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 3
Releases: 0

Topics

analytics data-analysis open-source-intelligence python3 r research snippets statistics

Created almost 6 years ago · Last pushed 9 months ago

Metadata Files

Readme License

Analysis Tool Box

Description

Analysis Tool Box (i.e. "analysistoolbox") is a collection of tools in Python for data collection and processing, statisitics, analytics, and intelligence analysis.

Getting Started

To install the package, run the following command in the root directory of the project:

bash pip install analysistoolbox

Visualizations are created using the matplotlib and seaborn libraries. While you can select whichever seaborn style you'd like, the following Seaborn style tends to get the best looking plots:

python sns.set( style="white", font="Arial", context="paper" )

Table of Contents / Usage

There are many modules in the analysistoolbox package, each with their own functions. The following is a list of the modules:

Calculus
Data Collection
Data Processing
Descriptive Analytics
File Management
Hypothesis Testing
Linear Algebra
LLM
- SendPromptToAnthropic
- SendPromptToChatGPT
Predictive Analytics
Prescriptive Analytics
- ConductLinearOptimization
- CreateContentBasedRecommender
Probability
- ProbabilityOfAtLeastOne
Simulations
Statistics
- CalculateConfidenceIntervalOfMean
- CalculateConfidenceIntervalOfProportion
Visualizations

Calculus

FindDerivative

The FindDerivative function calculates the derivative of a given function. It uses the sympy library, a Python library for symbolic mathematics, to perform the differentiation. The function also has the capability to print the original function and its derivative, return the derivative function, and plot both the original function and its derivative.

```python

Load the FindDerivative function from the Calculus submodule

from analysistoolbox.calculus import FindDerivative import sympy

Define a symbolic variable

x = sympy.symbols('x')

Define a function

fofx = x3 + 2*x2 + 3*x + 4

Use the FindDerivative function

FindDerivative( fofx, printfunctions=True, returnderivativefunction=True, plotfunctions=True ) ```

FindLimitOfFunction

The FindLimitOfFunction function finds the limit of a function at a specific point and optionally plot the function and its tangent line at that point. The script uses the matplotlib and numpy libraries for plotting and numerical operations respectively.

```python

Import the necessary libraries

from analysistoolbox.calculus import FindLimitOfFunction import numpy as np import sympy

Define a symbolic variable

x = sympy.symbols('x')

Define a function

fofx = np.sin(x) / x

Use the FindLimitOfFunction function

FindLimitOfFunction( fofx, point=0, step=0.01, plotfunction=True, xminimum=-10, xmaximum=10, n=1000, tangentline_window=1 ) ```

FindMinimumSquareLoss

The FindMinimumSquareLoss function calculates the minimum square loss between observed and predicted values. This function is often used in machine learning and statistics to measure the average squared difference between the actual and predicted outcomes.

```python

Import the necessary libraries

from analysistoolbox.calculus import FindMinimumSquareLoss

Define observed and predicted values

observedvalues = [1, 2, 3, 4, 5] predictedvalues = [1.1, 1.9, 3.2, 3.7, 5.1]

Use the FindMinimumSquareLoss function

minimumsquareloss = FindMinimumSquareLoss( observedvalues, predictedvalues, show_plot=True )

Print the minimum square loss

print(f"The minimum square loss is: {minimumsquareloss}") ```

PlotFunction

The PlotFunction function plots a mathematical function of x. It takes a lambda function as input and allows for customization of the plot.

```python

Import the necessary libraries

from analysistoolbox.calculus import PlotFunction import sympy

Set x as a symbolic variable

x = sympy.symbols('x')

Define the function to plot

fofx = lambda x: x**2

Plot the function with default settings

PlotFunction(fofx) ```

Data Collection

ExtractTextFromPDF

The ExtractTextFromPDF function extracts text from a PDF file, cleans it, then saves it to a text file.

```python

Import the function

from analysistoolbox.data_collection import ExtractTextFromPDF

Call the function

ExtractTextFromPDF( filepathtopdf="/path/to/your/input.pdf", filepathforexportedtext="/path/to/your/output.txt", startpage=1, end_page=None ) ```

FetchPDFFromURL

The FetchPDFFromURL function downloads a PDF file from a URL and saves it to a specified location.

```python

Import the function

from analysistoolbox.data_collection import FetchPDFFromURL

Call the function to download the PDF

FetchPDFFromURL( url="https://example.com/sample.pdf", filename="C:/folder/sample.pdf" ) ```

FetchUSShapefile

The FetchUSShapefile function fetches a geographical shapefile from the TIGER database of the U.S. Census Bureau.

```python

Import the function

from analysistoolbox.data_collection import FetchUSShapefile

Fetch the shapefile for the census tracts in King County, Washington, for the 2021 census year

shapefile = FetchUSShapefile( state='PA', county='Allegheny', geography='tract', census_year=2021 )

Print the first few rows of the shapefile

print(shapefile.head()) ```

FetchWebsiteText

The FetchWebsiteText function fetches the text from a website and saves it to a text file.

```python

Import the function

from analysistoolbox.data_collection import FetchWebsiteText

Call the function

text = FetchWebsiteText( url="https://www.example.com", browserlessapikey="yourbrowserlessapi_key" )

Print the fetched text

print(text) ```

GetCompanyFilings

The GetCompanyFilings function fetches company filings from the SEC EDGAR database. It returns a list of filings for a given company CIK (Central Index Key) and filing type.

```python

Import the function

from analysistoolbox.data_collection import GetCompanyFilings

Call the function to get company filings for 'Online Dating' companies in 2024

results = GetCompanyFilings( searchkeywords="Online Dating", startdate="2024-01-01", enddate="2024-12-31", filingtype="all", )

Print the results

print(results) ```

GetGoogleSearchResults

The GetGoogleSearchResults function fetches Google search results for a given query using the Serper API.

```python

Import the function

from analysistoolbox.data_collection import GetGoogleSearchResults

Call the function with the query

Make sure to replace 'yourserperapi_key' with your actual Serper API key

results = GetGoogleSearchResults( query="Python programming", serperapikey='yourserperapikey', numberofresults=5, applyautocorrect=True, display_results=True )

Print the results

print(results) ```

GetZipFile

The GetZipFile function downloads a zip file from a url and saves it to a specified folder. It can also unzip the file and print the contents of the zip file.

```python

Import the function

from analysistoolbox.data_collection import GetZipFile

Call the function

GetZipFile( url="http://example.com/file.zip", pathtosave_folder="/path/to/save/folder" ) ```

Data Processing

AddDateNumberColumns

The AddDateNumberColumns function adds columns for the year, month, quarter, week, day of the month, and day of the week to a dataframe.

```python

Import necessary packages

from analysistoolbox.data_processing import AddDateNumberColumns from datetime import datetime import pandas as pd

Create a sample dataframe

data = {'Date': [datetime(2020, 1, 1), datetime(2020, 2, 1), datetime(2020, 3, 1), datetime(2020, 4, 1)]} df = pd.DataFrame(data)

Use the function on the sample dataframe

df = AddDateNumberColumns( dataframe=df, datecolumnname='Date' )

Print the updated dataframe

print(df) ```

AddLeadingZeros

The AddLeadingZeros function adds leading zeros to a column. If fixedlength is not specified, the longest string in the column is used as the fixed length. If addasnewcolumn is set to True, the new column is added to the dataframe. Otherwise, the original column is updated.

```python

Import necessary packages

from analysistoolbox.data_processing import AddLeadingZeros import pandas as pd

Create a sample dataframe

data = {'ID': [1, 23, 456, 7890]} df = pd.DataFrame(data)

Use the AddLeadingZeros function

df = AddLeadingZeros( dataframe=df, columnname='ID', addasnewcolumn=True )

Print updated dataframe

print(df) ```

AddRowCountColumn

The AddRowCountColumn function adds a column to a dataframe that contains the row number for each row, based on a group (or groups) of columns. The function can also sort the dataframe by a column or columns before adding the row count column.

```python

Import necessary packages

from analysistoolbox.data_processing import AddRowCountColumn import pandas as pd

Create a sample dataframe

data = { 'Payment Method': ['Check', 'Credit Card', 'Check', 'Credit Card', 'Check', 'Credit Card', 'Check', 'Credit Card'], 'Transaction Value': [100, 200, 300, 400, 500, 600, 700, 800], 'Transaction Order': [1, 2, 3, 4, 5, 6, 7, 8] } df = pd.DataFrame(data)

Call the function

dfupdated = AddRowCountColumn( dataframe=df, listofgroupingvariables=['Payment Method'], listofordercolumns=['Transaction Order'], listofascendingorder_args=[True] )

Print the updated dataframe

print(df_updated) ```

AddTPeriodColumn

The AddTPeriodColumn function adds a T-period column to a dataframe. The T-period column is the number of intervals (e.g., days or weeks) since the earliest date in the dataframe.

```python

Import necessary libraries

from analysistoolbox.data_processing import AddTPeriodColumn from datetime import datetime import pandas as pd

Create a sample dataframe

data = { 'date': pd.date_range(start='1/1/2020', end='1/10/2020'), 'value': range(1, 11) } df = pd.DataFrame(data)

Use the function

dfupdated = AddTPeriodColumn( dataframe=df, datecolumnname='date', tperiod_interval='days' )

Print the updated dataframe

print(df_updated) ```

AddTukeyOutlierColumn

The AddTukeyOutlierColumn function adds a column to a dataframe that indicates whether a value is an outlier. The function uses the Tukey method to identify outliers.

```python

Import necessary libraries

from analysistoolbox.data_processing import AddTukeyOutlierColumn import pandas as pd

Create a sample dataframe

data = pd.DataFrame({'values': [1, 2, 3, 4, 5, 6, 7, 8, 9, 20]})

Use the function

dfupdated = AddTukeyOutlierColumn( dataframe=data, valuecolumnname='values', tukeyboundarymultiplier=1.5, plottukey_outliers=True )

Print the updated dataframe

print(df_updated) ```

CleanTextColumns

The CleanTextColumns function cleans string-type columns in a pandas DataFrame by removing all leading and trailing spaces.

```python

Import necessary libraries

from analysistoolbox.data_processing import CleanTextColumns import pandas as pd

Create a sample dataframe

df = pd.DataFrame({ 'A': [' hello', 'world ', ' python '], 'B': [1, 2, 3], })

Clean the dataframe

df_clean = CleanTextColumns(df) ```

ConductAnomalyDetection

The ConductAnomalyDetection function performs anomaly detection on a given dataset using the z-score method.

```python

Import necessary libraries

from analysistoolbox.data_processing import ConductAnomalyDetection import pandas as pd

Create a sample dataframe

df = pd.DataFrame({ 'A': [1, 2, 3, 1000], 'B': [4, 5, 6, 2000], })

Conduct anomaly detection

dfanomalydetected = ConductAnomalyDetection( dataframe=df, listofcolumnstoanalyze=['A', 'B'] )

Print the updated dataframe

print(dfanomalydetected) ```

ConductEntityMatching

The ConductEntityMatching function performs entity matching between two dataframes using various fuzzy matching algorithms.

```python from analysistoolbox.data_processing import ConductEntityMatching import pandas as pd

Create two dataframes

dataframe_1 = pd.DataFrame({ 'ID': ['1', '2', '3'], 'Name': ['John Doe', 'Jane Smith', 'Bob Johnson'], 'City': ['New York', 'Los Angeles', 'Chicago'] })

dataframe_2 = pd.DataFrame({ 'ID': ['a', 'b', 'c'], 'Name': ['Jon Doe', 'Jane Smyth', 'Robert Johnson'], 'City': ['NYC', 'LA', 'Chicago'] })

Conduct entity matching

matchedentities = ConductEntityMatching( dataframe1=dataframe1, dataframe1primarykey='ID', dataframe2=dataframe2, dataframe2primarykey='ID', levenshteindistancefilter=3, matchscorethreshold=80, columnstocompare=['Name', 'City'], matchmethods=['Partial Token Set Ratio', 'Weighted Ratio'] ) ```

ConvertOddsToProbability

The ConvertOddsToProbability function converts odds to probability in a new column.

```python

Import necessary packages

from analysistoolbox.data_processing import ConvertOddsToProbability import pandas as pd

Create a sample dataframe

data = { 'Team': ['Team1', 'Team2', 'Team3', 'Team4'], 'Odds': [2.5, 1.5, 3.0, np.nan] } df = pd.DataFrame(data)

Print the original dataframe

print("Original DataFrame:") print(df)

Use the function to convert odds to probability

df = ConvertOddsToProbability( dataframe=df, odds_column='Odds' ) ```

CountMissingDataByGroup

The CountMissingDataByGroup function counts the number of records with missing data in a Pandas dataframe, grouped by specified columns.

```python

Import necessary packages

from analysistoolbox.data_processing import CountMissingDataByGroup import pandas as pd import numpy as np

Create a sample dataframe with some missing values

data = { 'Group': ['A', 'B', 'A', 'B', 'A', 'B'], 'Value1': [1, 2, np.nan, 4, 5, np.nan], 'Value2': [np.nan, 8, 9, 10, np.nan, 12] } df = pd.DataFrame(data)

Use the function to count missing data by group

CountMissingDataByGroup( dataframe=df, listofgrouping_columns=['Group'] ) ```

CreateBinnedColumn

The CreateBinnedColumn function creates a new column in a Pandas dataframe based on a numeric variable. Binning is a process of transforming continuous numerical variables into discrete categorical 'bins'.

```python

Import necessary packages

from analysistoolbox.data_processing import CreateBinnedColumn import pandas as pd import numpy as np

Create a sample dataframe

data = { 'Group': ['A', 'B', 'A', 'B', 'A', 'B'], 'Value1': [1, 2, 3, 4, 5, 6], 'Value2': [7, 8, 9, 10, 11, 12] } df = pd.DataFrame(data)

Use the function to create a binned column

dfbinned = CreateBinnedColumn( dataframe=df, numericcolumnname='Value1', numberofbins=3, binningstrategy='uniform' ) ```

CreateDataOverview

The CreateDataOverview function creates an overview of a Pandas dataframe, including the data type, missing count, missing percentage, and summary statistics for each variable in the DataFrame.

```python

Import necessary packages

from analysistoolbox.data_processing import CreateDataOverview import pandas as pd import numpy as np

Create a sample dataframe

data = { 'Column1': [1, 2, 3, np.nan, 5, 6], 'Column2': ['a', 'b', 'c', 'd', np.nan, 'f'], 'Column3': [7.1, 8.2, 9.3, 10.4, np.nan, 12.5] } df = pd.DataFrame(data)

Use the function to create an overview of the dataframe

CreateDataOverview( dataframe=df, plot_missingness=True ) ```

CreateRandomSampleGroups

The CreateRandomSampleGroups function a takes a pandas DataFrame, shuffle its rows, assign each row to one of n groups, and then return the updated DataFrame with an additional column indicating the group number.

```python

Import necessary packages

from analysistoolbox.data_processing import CreateRandomSampleGroups import pandas as pd

Create a sample DataFrame

data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'Age': [25, 31, 35, 19, 45], 'Score': [85, 95, 78, 81, 92] } df = pd.DataFrame(data)

Use the function

groupeddf = CreateRandomSampleGroups( dataframe=df, numberofgroups=2, randomseed=123 ) ```

CreateRareCategoryColumn

The CreateRareCategoryColumn function creates a new column in a Pandas dataframe that indicates whether a categorical variable value is rare. A rare category is a category that occurs less than a specified percentage of the time.

```python

Import necessary packages

from analysistoolbox.data_processing import CreateRareCategoryColumn import pandas as pd

Create a sample DataFrame

data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Alice', 'Bob', 'Alice'], 'Age': [25, 31, 35, 19, 45, 23, 30, 24], 'Score': [85, 95, 78, 81, 92, 88, 90, 86] } df = pd.DataFrame(data)

Use the function

updateddf = CreateRareCategoryColumn( dataframe=df, categoricalcolumnname='Name', rarecategorylabel='Rare', rarecategorythreshold=0.05, newcolumn_suffix='(rare category)' ) ```

CreateStratifiedRandomSampleGroups

The CreateStratifiedRandomSampleGroups unction performs stratified random sampling on a pandas DataFrame. Stratified random sampling is a method of sampling that involves the division of a population into smaller groups known as strata. In stratified random sampling, the strata are formed based on members' shared attributes or characteristics.

```python

Import necessary packages

from analysistoolbox.data_processing import CreateStratifiedRandomSampleGroups import numpy as np import pandas as pd

Create a sample DataFrame

data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Alice', 'Bob', 'Alice'], 'Age': [25, 31, 35, 19, 45, 23, 30, 24], 'Score': [85, 95, 78, 81, 92, 88, 90, 86] } df = pd.DataFrame(data)

Use the function

stratifieddf = CreateStratifiedRandomSampleGroups( dataframe=df, numberofgroups=2, listcategoricalcolumnnames=['Name'], random_seed=42 ) ```

ImputeMissingValuesUsingNearestNeighbors

The ImputeMissingValuesUsingNearestNeighbors function imputes missing values in a dataframe using the nearest neighbors method. For each sample with missing values, it finds the n_neighbors nearest neighbors in the training set and imputes the missing values using the mean value of these neighbors.

```python

Import necessary packages

from analysistoolbox.data_processing import ImputeMissingValuesUsingNearestNeighbors import pandas as pd import numpy as np

Create a sample DataFrame with missing values

data = { 'A': [1, 2, np.nan, 4, 5], 'B': [np.nan, 2, 3, 4, 5], 'C': [1, 2, 3, np.nan, 5], 'D': [1, 2, 3, 4, np.nan] } df = pd.DataFrame(data)

Use the function

imputeddf = ImputeMissingValuesUsingNearestNeighbors( dataframe=df, listofnumericcolumnstoimpute=['A', 'B', 'C', 'D'], numberofneighbors=2, averaging_method='uniform' ) ```

VerifyGranularity

The VerifyGranularity function checks the granularity of a given dataframe based on a list of key columns. Granularity in this context refers to the level of detail or summarization in a set of data.

```python

Import necessary packages

from analysistoolbox.data_processing import VerifyGranularity import pandas as pd

Create a sample DataFrame

data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Alice', 'Bob', 'Alice'], 'Age': [25, 31, 35, 19, 45, 23, 30, 24], 'Score': [85, 95, 78, 81, 92, 88, 90, 86] } df = pd.DataFrame(data)

Use the function

VerifyGranularity( dataframe=df, listofkeycolumns=['Name', 'Age'], setkeyasindex=True, printasmarkdown=False ) ```

Descriptive Analytics

ConductManifoldLearning

The ConductManifoldLearning function performs manifold learning on a given dataframe and returns a new dataframe with the original columns and the new manifold learning components. Manifold learning is a type of unsupervised learning that is used to reduce the dimensionality of the data.

```python

Import necessary packages

from analysistoolbox.descriptiveanalytics import ConductManifoldLearning import pandas as pd from sklearn.datasets import loadiris

Load the iris dataset

iris = loadiris() irisdf = pd.DataFrame(data=iris.data, columns=iris.feature_names)

Use the function

newdf = ConductManifoldLearning( dataframe=irisdf, listofnumericcolumns=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], numberofcomponents=2, randomseed=42, showcomponentsummaryplots=True, snscolorpalette='Set2', summaryplot_size=(10, 10) ) ```

ConductPrincipalComponentAnalysis

The ConductPrincipalComponentAnalysis function performs Principal Component Analysis (PCA) on a given dataframe. PCA is a technique used in machine learning to reduce the dimensionality of data while retaining as much information as possible.

```python

Import necessary packages

from analysistoolbox.descriptiveanalytics import ConductManifoldLearning import pandas as pd from sklearn.datasets import loadiris

Load the iris dataset

iris = loadiris() irisdf = pd.DataFrame(data=iris.data, columns=iris.feature_names)

Call the function

result = ConductPrincipalComponentAnalysis( dataframe=irisdf, listofnumericcolumns=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], numberofcomponents=2 ) ```

ConductPropensityScoreMatching

Conducts propensity score matching to create balanced treatment and control groups for causal inference analysis.

```python from analysistoolbox.descriptive_analytics import ConductPropensityScoreMatching import pandas as pd

Create matched groups based on age, education, and experience

matcheddf = ConductPropensityScoreMatching( dataframe=df, subjectidcolumnname='employeeid', listofcolumnnamestobasematching=['age', 'education', 'yearsexperience'], groupingcolumnname='receivedtraining', controlgroupname='No', maxmatchespersubject=1, balancegroups=True, propensityscorecolumnname="PSScore", matchedidcolumnname="MatchedEmployeeID", random_seed=412 ) ```

CreateAssociationRules

The CreateAssociationRules function creates association rules from a given dataframe. Association rules are widely used in market basket analysis, where the goal is to find associations and/or correlations among a set of items.

```python

Import necessary packages

from analysistoolbox.descriptive_analytics import CreateAssociationRules import pandas as pd

Assuming you have a dataframe 'df' with 'TransactionID' and 'Item' columns

result = CreateAssociationRules( dataframe=df, transactionidcolumn='TransactionID', itemscolumn='Item', supportthreshold=0.01, confidencethreshold=0.2, plotlift=True, plottitle='Association Rules', plotsize=(10, 7) ) ```

CreateGaussianMixtureClusters

The CreateGaussianMixtureClusters function creates Gaussian mixture clusters from a given dataframe. Gaussian mixture models are a type of unsupervised learning that is used to find clusters in data. It adds the resulting clusters as a new column in the dataframe, and also calculates the probability of each data point belonging to each cluster.

```python

Import necessary packages

from analysistoolbox.descriptive_analytics import CreateGaussianMixtureClusters import pandas as pd from sklearn import datasets

Load the iris dataset

iris = datasets.load_iris()

Convert the iris dataset to a pandas dataframe

df = pd.DataFrame(data= np.c[iris['data'], iris['target']], columns= iris['featurenames'] + ['target'])

Call the CreateGaussianMixtureClusters function

dfclustered = CreateGaussianMixtureClusters( dataframe=df, listofnumericcolumnsforclustering=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], numberofclusters=3, columnnameforclusters='Gaussian Mixture Cluster', scalepredictorvariables=True, showclustersummaryplots=True, snscolorpalette='Set2', summaryplotsize=(15, 15), randomseed=123, maximumiterations=200 ) ```

CreateHierarchicalClusters

The CreateHierarchicalClusters function creates hierarchical clusters from a given dataframe. Hierarchical clustering is a type of unsupervised learning that is used to find clusters in data. It adds the resulting clusters as a new column in the dataframe.

```python

Import necessary packages

from analysistoolbox.descriptive_analytics import CreateHierarchicalClusters import pandas as pd from sklearn import datasets

Load the iris dataset

iris = datasets.loadiris() df = pd.DataFrame(data=iris.data, columns=iris.featurenames)

Call the CreateHierarchicalClusters function

dfclustered = CreateHierarchicalClusters( dataframe=df, listofvaluecolumnsforclustering=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], numberofclusters=3, columnnameforclusters='Hierarchical Cluster', scalepredictorvariables=True, showclustersummaryplots=True, colorpalette='Set2', summaryplotsize=(6, 4), randomseed=412, maximum_iterations=300 ) ```

CreateKMeansClusters

The CreateKMeansClusters function performs K-Means clustering on a given dataset and returns the dataset with an additional column indicating the cluster each record belongs to.

```python

Import necessary packages

from analysistoolbox.descriptive_analytics import CreateKMeansClusters import pandas as pd from sklearn import datasets

Load the iris dataset

iris = datasets.loadiris() df = pd.DataFrame(data=iris.data, columns=iris.featurenames)

Call the CreateKMeansClusters function

dfclustered = CreateKMeansClusters( dataframe=df, listofvaluecolumnsforclustering=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], numberofclusters=3, columnnameforclusters='K-Means Cluster', scalepredictorvariables=True, showclustersummaryplots=True, colorpalette='Set2', summaryplotsize=(6, 4), randomseed=412, maximum_iterations=300 ) ```

GenerateEDAWithLIDA

The GenerateEDAWithLIDA function uses the LIDA package from Microsoft to generate exploratory data analysis (EDA) goals.

```python

Import necessary packages

from analysistoolbox.descriptive_analytics import GenerateEDAWithLIDA import pandas as pd from sklearn import datasets

Load the iris dataset

iris = datasets.loadiris() df = pd.DataFrame(data=iris.data, columns=iris.featurenames)

Call the GenerateEDAWithLIDA function

dfsummary = GenerateEDAWithLIDA( dataframe=df, llmapikey="yourllmapikeyhere", llmprovider="openai", llmmodel="gpt-3.5-turbo", visualizationlibrary="seaborn", goaltemperature=0.50, codegenerationtemperature=0.05, datasummarymethod="llm", numberofsamplestoshowinsummary=5, returndatafieldssummary=True, numberofgoalstogenerate=5, plotrecommendedvisualization=True, showcodeforrecommendedvisualization=True ) ```

File Management

ImportDataFromFolder

The ImportDataFromFolder function imports all CSV and Excel files from a specified folder and combines them into a single DataFrame. It ensures that column names match across all files if specified.

```python

Import necessary packages

from analysistoolbox.file_management import ImportDataFromFolder

Specify the folder path

folder_path = "path/to/your/folder"

Call the ImportDataFromFolder function

combineddf = ImportDataFromFolder( folderpath=folderpath, forcecolumnnamesto_match=True ) ```

CreateFileTree

The CreateFileTree function recursively walks a directory tree and prints a diagram of all the subdirectories and files.

```python

Import necessary packages

from analysistoolbox.file_management import CreateFileTree

Specify the directory path

directory_path = "path/to/your/directory"

Call the CreateFileTree function

CreateFileTree( path=directorypath, indentspaces=2 ) ```

CreateCopyOfPDF

The CreateCopyOfPDF function creates a copy of a PDF file, with options to specify the start and end pages.

```python

Import necessary packages

from analysistoolbox.file_management import CreateCopyOfPDF

Specify the input and output file paths

inputpdf = "path/to/input.pdf" outputpdf = "path/to/output.pdf"

Call the CreateCopyOfPDF function

CreateCopyOfPDF( inputfile=inputpdf, outputfile=outputpdf, startpage=1, endpage=5 ) ```

ConvertWordDocsToPDF

The ConvertWordDocsToPDF function converts all Word documents in a specified folder to PDF format.

```python

Import necessary packages

from analysistoolbox.file_management import ConvertWordDocsToPDF

Specify the folder paths

wordfolder = "path/to/word/documents" pdffolder = "path/to/save/pdf/documents"

Call the ConvertWordDocsToPDF function

ConvertWordDocsToPDF( wordfolderpath=wordfolder, pdffolderpath=pdffolder, openeachdoc=False ) ```

Hypothesis Testing

ChiSquareTestOfIndependence

The ChiSquareTestOfIndependence function performs a chi-square test of independence to determine if there is a significant relationship between two categorical variables.

```python from analysistoolbox.hypothesis_testing import ChiSquareTestOfIndependence

Create sample data

data = { 'Education': ['High School', 'College', 'High School', 'Graduate', 'College'], 'Employment': ['Employed', 'Unemployed', 'Employed', 'Employed', 'Unemployed'] } df = pd.DataFrame(data)

Conduct chi-square test

ChiSquareTestOfIndependence( dataframe=df, firstcategoricalcolumn='Education', secondcategoricalcolumn='Employment', plotcontingencytable=True ) ```

ChiSquareTestOfIndependenceFromTable

The ChiSquareTestOfIndependenceFromTable function performs a chi-square test using a pre-computed contingency table.

```python from analysistoolbox.hypothesis_testing import ChiSquareTestOfIndependenceFromTable

Create contingency table

contingency_table = pd.DataFrame({ 'Online': [100, 150], 'In-Store': [200, 175] }, index=['Male', 'Female'])

Conduct chi-square test

ChiSquareTestOfIndependenceFromTable( contingencytable=contingencytable, plotcontingencytable=True ) ```

ConductCoxProportionalHazardRegression

The ConductCoxProportionalHazardRegression function performs survival analysis using Cox Proportional Hazard regression.

```python from analysistoolbox.hypothesis_testing import ConductCoxProportionalHazardRegression

Conduct Cox regression

model = ConductCoxProportionalHazardRegression( dataframe=df, outcomecolumn='event', durationcolumn='time', listofpredictorcolumns=['age', 'sex', 'treatment'], plotsurvival_curve=True ) ```

ConductLinearRegressionAnalysis

The ConductLinearRegressionAnalysis function performs linear regression analysis with optional plotting.

```python from analysistoolbox.hypothesis_testing import ConductLinearRegressionAnalysis

Conduct linear regression

results = ConductLinearRegressionAnalysis( dataframe=df, outcomecolumn='sales', listofpredictorcolumns=['advertising', 'price'], plotregressiondiagnostic=True ) ```

ConductLogisticRegressionAnalysis

The ConductLogisticRegressionAnalysis function performs logistic regression for binary outcomes.

```python from analysistoolbox.hypothesis_testing import ConductLogisticRegressionAnalysis

Conduct logistic regression

results = ConductLogisticRegressionAnalysis( dataframe=df, outcomecolumn='purchased', listofpredictorcolumns=['age', 'income'], plotregressiondiagnostic=True ) ```

OneSampleTTest

The OneSampleTTest function performs a one-sample t-test to compare a sample mean to a hypothesized population mean.

```python from analysistoolbox.hypothesis_testing import OneSampleTTest

Conduct one-sample t-test

OneSampleTTest( dataframe=df, outcomecolumn='score', hypothesizedmean=70, alternativehypothesis='two-sided', confidenceinterval=0.95 ) ```

OneWayANOVA

The OneWayANOVA function performs a one-way analysis of variance to compare means across multiple groups.

```python from analysistoolbox.hypothesis_testing import OneWayANOVA

Conduct one-way ANOVA

OneWayANOVA( dataframe=df, outcomecolumn='performance', groupingcolumn='treatmentgroup', plotsample_distributions=True ) ```

TTestOfMeanFromStats

The TTestOfMeanFromStats function performs a t-test using summary statistics rather than raw data.

```python from analysistoolbox.hypothesis_testing import TTestOfMeanFromStats

Conduct t-test from statistics

TTestOfMeanFromStats( samplemean=75, samplesize=30, samplestandarddeviation=10, hypothesizedmean=70, alternativehypothesis='greater' ) ```

TTestOfProportionFromStats

The TTestOfProportionFromStats function tests a sample proportion against a hypothesized value.

```python from analysistoolbox.hypothesis_testing import TTestOfProportionFromStats

Test proportion from statistics

TTestOfProportionFromStats( sampleproportion=0.65, # 65% proportion samplesize=200, # 200 survey responses hypothesizedproportion=0.50, alternativehypothesis='two-sided' ) ```

TTestOfTwoMeansFromStats

The TTestOfTwoMeansFromStats function compares two means using summary statistics.

```python from analysistoolbox.hypothesis_testing import TTestOfTwoMeansFromStats

Compare two means from statistics

TTestOfTwoMeansFromStats( firstsamplemean=75, firstsamplesize=30, firstsamplestandarddeviation=10, secondsamplemean=70, secondsamplesize=30, secondsamplestandarddeviation=12 ) ```

TwoSampleTTestOfIndependence

The TwoSampleTTestOfIndependence function performs an independent samples t-test to compare means between two groups.

```python from analysistoolbox.hypothesis_testing import TwoSampleTTestOfIndependence

Conduct independent samples t-test

TwoSampleTTestOfIndependence( dataframe=df, outcomecolumn='score', groupingcolumn='group', alternativehypothesis='two-sided', homogeneityof_variance=True ) ```

TwoSampleTTestPaired

The TwoSampleTTestPaired function performs a paired samples t-test for before-after comparisons.

```python from analysistoolbox.hypothesis_testing import TwoSampleTTestPaired

Conduct paired samples t-test

TwoSampleTTestPaired( dataframe=df, firstoutcomecolumn='prescore', secondoutcomecolumn='postscore', alternative_hypothesis='greater' ) ```

Linear Algebra

CalculateEigenvalues

The CalculateEigenvalues function calculates and visualizes the eigenvalues and eigenvectors of a matrix.

```python from analysistoolbox.linear_algebra import CalculateEigenvalues import numpy as np

Create a 2x2 matrix

matrix = np.array([ [4, -2], [1, 1] ])

Calculate eigenvalues and eigenvectors

CalculateEigenvalues( matrix=matrix, ploteigenvectors=True, plottransformation=True ) ```

ConvertMatrixToRowEchelonForm

The ConvertMatrixToRowEchelonForm function converts a matrix to row echelon form using Gaussian elimination.

```python from analysistoolbox.linear_algebra import ConvertMatrixToRowEchelonForm import numpy as np

Create a matrix

matrix = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ])

Convert to row echelon form

rowechelon = ConvertMatrixToRowEchelonForm( matrix=matrix, showpivot_columns=True ) ```

ConvertSystemOfEquationsToMatrix

The ConvertSystemOfEquationsToMatrix function converts a system of linear equations to matrix form.

```python from analysistoolbox.linear_algebra import ConvertSystemOfEquationsToMatrix import numpy as np

Define system of equations:

2x + 3y = 8

4x - y = 1

coefficients = np.array([ [2, 3], [4, -1] ]) constants = np.array([8, 1])

Convert to matrix form

matrix = ConvertSystemOfEquationsToMatrix( coefficients=coefficients, constants=constants, show_determinant=True ) ```

PlotVectors

The PlotVectors function visualizes vectors in 2D or 3D space.

```python from analysistoolbox.linear_algebra import PlotVectors import numpy as np

Define vectors

vectors = [ [3, 2], # First vector [-1, 4], # Second vector [2, -3] # Third vector ]

Plot vectors

PlotVectors( listofvectors=vectors, origin=[0, 0], plot_sum=True, grid=True ) ```

SolveSystemOfEquations

The SolveSystemOfEquations function solves a system of linear equations and optionally visualizes the solution.

```python from analysistoolbox.linear_algebra import SolveSystemOfEquations import numpy as np

Define system of equations:

2x + y = 5

x - 3y = -1

coefficients = np.array([ [2, 1], [1, -3] ]) constants = np.array([5, -1])

Solve the system

solution = SolveSystemOfEquations( coefficients=coefficients, constants=constants, showplot=True, plotboundary=10 ) ```

VisualizeMatrixAsLinearTransformation

The VisualizeMatrixAsLinearTransformation function visualizes how a matrix transforms space as a linear transformation.

```python from analysistoolbox.linear_algebra import VisualizeMatrixAsLinearTransformation import numpy as np

Define transformation matrix

transformation_matrix = np.array([ [2, -1], [1, 1] ])

Visualize the transformation

VisualizeMatrixAsLinearTransformation( transformationmatrix=transformationmatrix, plotgrid=True, plotunitvectors=True, animationframes=30 ) ```

LLM

SendPromptToAnthropic

The SendPromptToAnthropic function sends a prompt to Anthropic's Claude API using LangChain. It supports template-based prompting and requires an Anthropic API key.

```python from analysistoolbox.llm import SendPromptToAnthropic

Define your prompt template with variables in curly braces

prompt_template = "Given the text: {text}\nSummarize the main points in bullet form."

Create a dictionary with your input variables

user_input = { "text": "Your text to analyze goes here..." }

Send the prompt to Claude

response = SendPromptToAnthropic( prompttemplate=prompttemplate, userinput=userinput, systemmessage="You are a helpful assistant.", anthropicapikey="your-api-key-here", temperature=0.0, chatmodelname="claude-3-opus-20240229", maximumtokens=1000 )

print(response) ```

SendPromptToChatGPT

The SendPromptToChatGPT function sends a prompt to OpenAI's ChatGPT API using LangChain. It supports template-based prompting and requires an OpenAI API key.

```python from analysistoolbox.llm import SendPromptToChatGPT

Define your prompt template with variables in curly braces

prompt_template = "Analyze the following data: {data}\nProvide key insights."

Create a dictionary with your input variables

user_input = { "data": "Your data to analyze goes here..." }

Send the prompt to ChatGPT

response = SendPromptToChatGPT( prompttemplate=prompttemplate, userinput=userinput, systemmessage="You are a helpful assistant.", openaiapikey="your-api-key-here", temperature=0.0, chatmodelname="gpt-4o-mini", maximumtokens=1000 )

print(response) ```

Predictive Analytics

CreateARIMAModel

Builds an ARIMA (Autoregressive Integrated Moving Average) model for time series forecasting.

```python from analysistoolbox.predictive_analytics import CreateARIMAModel import pandas as pd

Create time series forecast

forecast = CreateARIMAModel( dataframe=df, timecolumn='date', valuecolumn='sales', forecast_periods=12 ) ```

CreateBoostedTreeModel

Creates a gradient boosted tree model for classification or regression tasks, offering high performance and feature importance analysis.

```python from analysistoolbox.predictive_analytics import CreateBoostedTreeModel

Train a boosted tree classifier

model = CreateBoostedTreeModel( dataframe=df, outcomevariable='churn', listofpredictorvariables=['usage', 'tenure', 'satisfaction'], isoutcomecategorical=True, plotmodeltest_performance=True ) ```

CreateDecisionTreeModel

Builds an interpretable decision tree for classification or regression, with visualization options.

```python from analysistoolbox.predictive_analytics import CreateDecisionTreeModel

Create a decision tree for predicting house prices

model = CreateDecisionTreeModel( dataframe=df, outcomevariable='price', listofpredictorvariables=['sqft', 'bedrooms', 'location'], isoutcomecategorical=False, maximum_depth=5 ) ```

CreateLinearRegressionModel

Fits a linear regression model with optional scaling and comprehensive performance visualization.

```python from analysistoolbox.predictive_analytics import CreateLinearRegressionModel

Predict sales based on advertising spend

model = CreateLinearRegressionModel( dataframe=df, outcomevariable='sales', listofpredictorvariables=['tvads', 'radioads', 'newspaperads'], scalevariables=True, plotmodeltest_performance=True ) ```

CreateLogisticRegressionModel

Implements logistic regression for binary classification tasks with regularization options.

```python from analysistoolbox.predictive_analytics import CreateLogisticRegressionModel

Predict customer churn probability

model = CreateLogisticRegressionModel( dataframe=df, outcomevariable='churn', listofpredictorvariables=['usage', 'complaints', 'satisfaction'], scalepredictorvariables=True, showclassificationplot=True ) ```

CreateNeuralNetwork_SingleOutcome

Builds and trains a neural network for single-outcome prediction tasks, with customizable architecture.

```python from analysistoolbox.predictiveanalytics import CreateNeuralNetworkSingleOutcome

Create a neural network for image classification

model = CreateNeuralNetworkSingleOutcome( dataframe=df, outcomevariable='label', listofpredictorvariables=featurecolumns, numberofhiddenlayers=3, isoutcomecategorical=True, plotloss=True ) ```

Prescriptive Analytics

The prescriptive analytics module provides tools for making data-driven recommendations and decisions:

ConductLinearOptimization

Conducts linear optimization to find the optimal input values for a given output variable, with optional constraints.

```python import pandas as pd from analysistoolbox.prescriptive_analytics.ConductLinearOptimization import ConductLinearOptimization

Sample data

data = pd.DataFrame({ 'input1': [1, 2, 3, 4, 5], 'input2': [2, 4, 6, 8, 10], 'output': [10, 20, 30, 40, 50] })

Define constraints (optional)

constraints = { 'input1': (0, 10), # input1 between 0 and 10 'input2': (None, 15) # input2 maximum 15, no minimum }

Run optimization

results = ConductLinearOptimization( dataframe=data, outputvariable='output', listofinputvariables=['input1', 'input2'], optimizationtype='maximize', inputconstraints=constraints ) ```

CreateContentBasedRecommender

Builds a content-based recommendation system using neural networks to learn user and item embeddings.

```python from analysistoolbox.prescriptive_analytics import CreateContentBasedRecommender import pandas as pd

Create a movie recommendation system

recommender = CreateContentBasedRecommender( dataframe=movieratingsdf, outcomevariable='rating', userlistofpredictorvariables=['age', 'gender', 'occupation'], itemlistofpredictorvariables=['genre', 'year', 'director', 'budget'], usernumberofhiddenlayers=2, itemnumberofhiddenlayers=2, numberofrecommendations=5, scalevariables=True, plot_loss=True ) ```

Probability

The probability module provides tools for working with probability distributions and statistical models:

ProbabilityOfAtLeastOne

Calculates and visualizes the probability of at least one event occurring in a series of independent trials.

```python from analysistoolbox.probability import ProbabilityOfAtLeastOne

Calculate probability of at least one defect in 10 products

given a 5% defect rate per product

prob = ProbabilityOfAtLeastOne( probabilityofevent=0.05, numberofevents=10, formataspercent=True, showplot=True, risktolerance=0.20 # Highlight 20% risk threshold )

Calculate probability of at least one successful sale

given 30 customer interactions with 15% success rate

prob = ProbabilityOfAtLeastOne( probabilityofevent=0.15, numberofevents=30, formataspercent=True, showplot=True, titleforplot="Sales Success Probability", subtitlefor_plot="Probability of at least one sale in 30 customer interactions" ) ```

Simulations

The simulations module provides a comprehensive set of tools for statistical simulations and probability distributions:

CreateMetalogDistribution

Creates a flexible metalog distribution from data, useful for modeling complex probability distributions.

```python from analysistoolbox.simulations import CreateMetalogDistribution

Create a metalog distribution from historical data

distribution = CreateMetalogDistribution( dataframe=df, variable='sales', lowerbound=0, numberofsamples=10000, plotmetalog_distribution=True ) ```

CreateMetalogDistributionFromPercentiles

Builds a metalog distribution from known percentile values.

```python from analysistoolbox.simulations import CreateMetalogDistributionFromPercentiles

Create distribution from percentiles

distribution = CreateMetalogDistributionFromPercentiles( listofvalues=[10, 20, 30, 50], listofpercentiles=[0.1, 0.25, 0.75, 0.9], lowerbound=0, showdistribution_plot=True ) ```

CreateSIPDataframe

Generates Stochastically Indexed Percentiles (SIP) for uncertainty analysis.

```python from analysistoolbox.simulations import CreateSIPDataframe

Create SIP dataframe for risk analysis

sipdf = CreateSIPDataframe( numberofpercentiles=10, numberof_trials=1000 ) ```

CreateSLURPDistribution

Creates a SIP with relationships preserved (SLURP) based on a linear regression model's prediction interval.

```python from analysistoolbox.simulations import CreateSLURPDistribution

Create a SLURP distribution from a linear regression model

slurpdist = CreateSLURPDistribution( linearregressionmodel=model, # statsmodels regression model listofpredictionvalues=[x1, x2, ...], # values for predictors numberoftrials=10000, # number of samples to generate predictioninterval=0.95, # confidence level for prediction interval lowerbound=None, # optional lower bound constraint upper_bound=None # optional upper bound constraint ) ```

SimulateCountOfSuccesses

Simulates binomial outcomes (number of successes in fixed trials).

```python from analysistoolbox.simulations import SimulateCountOfSuccesses

Simulate customer conversion rates

results = SimulateCountOfSuccesses( probabilityofsuccess=0.15, samplesizepertrial=100, numberoftrials=10000, plotsimulation_results=True ) ```

SimulateCountOutcome

Simulates Poisson-distributed count data.

```python from analysistoolbox.simulations import SimulateCountOutcome

Simulate daily customer arrivals

arrivals = SimulateCountOutcome( expectedcount=25, numberoftrials=10000, plotsimulation_results=True ) ```

SimulateCountUntilFirstSuccess

Simulates geometric distributions (trials until first success).

```python from analysistoolbox.simulations import SimulateCountUntilFirstSuccess

Simulate number of attempts until success

attempts = SimulateCountUntilFirstSuccess( probabilityofsuccess=0.2, numberoftrials=10000, plotsimulationresults=True ) ```

SimulateNormallyDistributedOutcome

Generates normally distributed random variables.

```python from analysistoolbox.simulations import SimulateNormallyDistributedOutcome

Simulate product weights

weights = SimulateNormallyDistributedOutcome( mean=100, standarddeviation=5, numberoftrials=10000, plotsimulation_results=True ) ```

SimulateTDistributedOutcome

Generates Student's t-distributed random variables.

```python from analysistoolbox.simulations import SimulateTDistributedOutcome

Simulate with heavy-tailed distribution

values = SimulateTDistributedOutcome( degreesoffreedom=5, numberoftrials=10000, plotsimulationresults=True ) ```

SimulateTimeBetweenEvents

Simulates exponentially distributed inter-arrival times.

```python from analysistoolbox.simulations import SimulateTimeBetweenEvents

Simulate time between customer arrivals

times = SimulateTimeBetweenEvents( averagetimebetweenevents=30, numberoftrials=10000, plotsimulation_results=True ) ```

SimulateTimeUntilNEvents

Simulates Erlang-distributed waiting times.

```python from analysistoolbox.simulations import SimulateTimeUntilNEvents

Simulate time until 5 events occur

waittime = SimulateTimeUntilNEvents( averagetimebetweenevents=10, numberofevents=5, numberoftrials=10000, plotsimulationresults=True ) ```

Statistics

The statistics module provides essential tools for statistical inference and estimation:

CalculateConfidenceIntervalOfMean

Calculates confidence intervals for population means, automatically handling both large (z-distribution) and small (t-distribution) sample sizes.

```python from analysistoolbox.statistics import CalculateConfidenceIntervalOfMean

Calculate 95% confidence interval for average customer spending

ciresults = CalculateConfidenceIntervalOfMean( samplemean=45.2, samplestandarddeviation=12.5, samplesize=100, confidenceinterval=0.95, plotsampledistribution=True, value_name="Average Spending ($)" ) ```

CalculateConfidenceIntervalOfProportion

Calculates confidence intervals for population proportions, with automatic selection of the appropriate distribution based on sample size.

```python from analysistoolbox.statistics import CalculateConfidenceIntervalOfProportion

Calculate 95% confidence interval for customer satisfaction rate

ciresults = CalculateConfidenceIntervalOfProportion( sampleproportion=0.78, # 78% satisfaction rate samplesize=200, # 200 survey responses confidenceinterval=0.95, plotsampledistribution=True, value_name="Satisfaction Rate" ) ```

Visualizations

The visualizations module provides a comprehensive set of tools for creating publication-quality statistical plots and charts:

Plot100PercentStackedBarChart

Creates a 100% stacked bar chart for comparing proportional compositions across categories.

```python from analysistoolbox.visualizations import Plot100PercentStackedBarChart

Create a stacked bar chart showing customer segments by region

chart = Plot100PercentStackedBarChart( dataframe=df, categoricalcolumnname='Region', valuecolumnname='Customers', groupingcolumnname='Segment' ) ```

PlotBarChart

Creates a customizable bar chart with options for highlighting specific categories.

```python from analysistoolbox.visualizations import PlotBarChart

Create a bar chart of sales by product

chart = PlotBarChart( dataframe=df, categoricalcolumnname='Product', valuecolumnname='Sales', topntohighlight=3, highlightcolor="#b0170c" ) ```

PlotBoxWhiskerByGroup

Creates box-and-whisker plots for comparing distributions across groups.

```python from analysistoolbox.visualizations import PlotBoxWhiskerByGroup

Compare salary distributions across departments

plot = PlotBoxWhiskerByGroup( dataframe=df, valuecolumnname='Salary', groupingcolumnname='Department' ) ```

PlotBulletChart

Creates bullet charts for comparing actual values against targets with optional range bands.

```python from analysistoolbox.visualizations import PlotBulletChart

Create bullet chart comparing actual vs target sales

chart = PlotBulletChart( dataframe=df, valuecolumnname='ActualSales', groupingcolumnname='Region', targetvaluecolumnname='TargetSales', listoflimitcolumns=['MinSales', 'MaxSales'] ) ```

PlotCard

Creates a simple card-style visualization with a value and an optional value label.

```python from analysistoolbox.visualizations import PlotCard

Create a simple KPI card

card = PlotCard( value=125000, # main value to display valuelabel="Monthly Revenue", # optional label valuefontsize=30, # size of the main value valuelabelfontsize=14, # size of the label figure_size=(3, 2) # dimensions of the card ) ```

PlotClusteredBarChart

Creates grouped bar charts for comparing multiple categories across groups.

```python from analysistoolbox.visualizations import PlotClusteredBarChart

Create clustered bar chart of sales by product and region

chart = PlotClusteredBarChart( dataframe=df, categoricalcolumnname='Product', valuecolumnname='Sales', groupingcolumnname='Region' ) ```

PlotContingencyHeatmap

Creates a heatmap visualization of contingency tables.

```python from analysistoolbox.visualizations import PlotContingencyHeatmap

Create heatmap of customer segments vs purchase categories

heatmap = PlotContingencyHeatmap( dataframe=df, categoricalcolumnname1='CustomerSegment', categoricalcolumnname2='PurchaseCategory', normalize_by="columns" ) ```

PlotCorrelationMatrix

Creates correlation matrix visualizations with optional scatter plots.

```python from analysistoolbox.visualizations import PlotCorrelationMatrix

Create correlation matrix of numeric variables

matrix = PlotCorrelationMatrix( dataframe=df, listofvaluecolumnnames=['Age', 'Income', 'Spending'], showaspairplot=True ) ```

PlotDensityByGroup

Creates density plots for comparing distributions across groups.

```python from analysistoolbox.visualizations import PlotDensityByGroup

Compare age distributions across customer segments

plot = PlotDensityByGroup( dataframe=df, valuecolumnname='Age', groupingcolumnname='Customer_Segment' ) ```

PlotDotPlot

Creates dot plots with optional connecting lines between groups.

```python from analysistoolbox.visualizations import PlotDotPlot

Compare before/after measurements

plot = PlotDotPlot( dataframe=df, categoricalcolumnname='Metric', valuecolumnname='Value', groupcolumnname='TimePeriod', connectdots=True ) ```

PlotHeatmap

Creates customizable heatmaps for visualizing two-dimensional data.

```python from analysistoolbox.visualizations import PlotHeatmap

Create heatmap of customer activity by hour and day

heatmap = PlotHeatmap( dataframe=df, xaxiscolumnname='Hour', yaxiscolumnname='Day', valuecolumnname='Activity', color_palette="RdYlGn" ) ```

PlotOverlappingAreaChart

Creates stacked or overlapping area charts for time series data.

```python from analysistoolbox.visualizations import PlotOverlappingAreaChart

Show product sales trends over time

chart = PlotOverlappingAreaChart( dataframe=df, timecolumnname='Date', valuecolumnname='Sales', variablecolumnname='Product' ) ```

PlotRiskTolerance

Creates specialized plots for risk analysis and tolerance visualization.

```python from analysistoolbox.visualizations import PlotRiskTolerance

Visualize risk tolerance levels

plot = PlotRiskTolerance( dataframe=df, valuecolumnname='RiskScore', tolerancelevelcolumnname='Tolerance' ) ```

PlotScatterplot

Creates scatter plots with optional trend lines and grouping.

```python from analysistoolbox.visualizations import PlotScatterplot

Create scatter plot of age vs income

plot = PlotScatterplot( dataframe=df, xaxiscolumnname='Age', yaxiscolumnname='Income', colorbycolumn_name='Education' ) ```

PlotSingleVariableCountPlot

Creates count plots for categorical variables.

```python from analysistoolbox.visualizations import PlotSingleVariableCountPlot

Show distribution of customer types

plot = PlotSingleVariableCountPlot( dataframe=df, categoricalcolumnname='CustomerType', topntohighlight=2 ) ```

PlotSingleVariableHistogram

Creates histograms for continuous variables.

```python from analysistoolbox.visualizations import PlotSingleVariableHistogram

Create histogram of transaction amounts

plot = PlotSingleVariableHistogram( dataframe=df, valuecolumnname='TransactionAmount', showmean=True, show_median=True ) ```

PlotTimeSeries

Creates time series plots with optional grouping and marker sizes.

```python from analysistoolbox.visualizations import PlotTimeSeries

Plot monthly sales with grouping

plot = PlotTimeSeries( dataframe=df, timecolumnname='Date', valuecolumnname='Sales', groupingcolumnname='Region', # optional grouping markersizecolumnname='Volume', # optional markers linecolor='#3269a8', figure_size=(8, 5) ) ```

RenderTableOne

Creates publication-ready summary statistics tables comparing variables across groups.

```python from analysistoolbox.visualizations import RenderTableOne

Create summary statistics table comparing age, education by department

table = RenderTableOne( dataframe=df, valuecolumnname='Age', # outcome variable groupingcolumnname='Department', # grouping variable listofrowvariables=['Education', 'Experience'], # variables to compare tableformat='html', # output format showpvalue=True # include statistical tests ) ```

Contributions

Contributions to the analysistoolbox package are welcome! Please submit a pull request with your changes.

License

The analysistoolbox package is licensed under the GNU License. Read more about the GNU License at https://www.gnu.org/licenses/gpl-3.0.html.

Owner

Name: Kyle Protho
Login: KyleProtho
Kind: user
Location: Pittsburgh, PA

Website: https://www.linkedin.com/in/kyleprotho
Twitter: kyle_protho
Repositories: 1
Profile: https://github.com/KyleProtho

Passionate about UX Design, intelligence, and analytics

GitHub Events

Total

Push event: 24

Last Year

Push event: 24

Committers

Last synced: over 2 years ago

All Time

Total Commits: 494
Total Committers: 2
Avg Commits per committer: 247.0
Development Distribution Score (DDS): 0.324

Past Year

Commits: 317
Committers: 2
Avg Commits per committer: 158.5
Development Distribution Score (DDS): 0.167

Top Committers

Name	Email	Commits
Kyle Protho	O**o@g**m	334
Kyle Protho	o**o@g**m	160

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 3
Total pull requests: 7
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

KyleProtho (3)

Pull Request Authors

KyleProtho (7)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 1,816 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 86
Total maintainers: 1

pypi.org: analysistoolbox

A collection tools in Python for data collection and processing, statistics, analytics, and intelligence analysis.

Homepage: https://github.com/KyleProtho/AnalysisToolBox/tree/master/analysistoolbox
Documentation: https://analysistoolbox.readthedocs.io/
License: MIT
Latest release: 3.0.0
published 9 months ago

Versions: 86
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 1,816 Last month

Rankings

Dependent packages count: 9.8%

Forks count: 29.9%

Average: 36.6%

Stargazers count: 38.9%

Dependent repos count: 67.9%

Maintainers (1)

kprotho24

Last synced: 9 months ago

https://github.com/kyleprotho/analysistoolbox

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Analysis Tool Box

Description

Getting Started

Table of Contents / Usage

Calculus

FindDerivative

Load the FindDerivative function from the Calculus submodule

Define a symbolic variable

Define a function

Use the FindDerivative function

FindLimitOfFunction

Import the necessary libraries

Define a symbolic variable

Define a function

Use the FindLimitOfFunction function

FindMinimumSquareLoss

Import the necessary libraries

Define observed and predicted values

Use the FindMinimumSquareLoss function

Print the minimum square loss

PlotFunction

Import the necessary libraries

Set x as a symbolic variable

Define the function to plot

Plot the function with default settings

Data Collection

ExtractTextFromPDF

Import the function

Call the function

FetchPDFFromURL

Import the function

Call the function to download the PDF

FetchUSShapefile

Import the function

Fetch the shapefile for the census tracts in King County, Washington, for the 2021 census year

Print the first few rows of the shapefile

FetchWebsiteText

Import the function

Call the function

Print the fetched text

GetCompanyFilings

Import the function

Call the function to get company filings for 'Online Dating' companies in 2024

Print the results

GetGoogleSearchResults

Import the function

Call the function with the query

Make sure to replace 'yourserperapi_key' with your actual Serper API key

Print the results

GetZipFile

Import the function

Call the function

Data Processing

AddDateNumberColumns

Import necessary packages

Create a sample dataframe

Use the function on the sample dataframe

Print the updated dataframe

AddLeadingZeros

Import necessary packages

Create a sample dataframe

Use the AddLeadingZeros function

Print updated dataframe

AddRowCountColumn

Import necessary packages

Create a sample dataframe

Call the function

Print the updated dataframe

AddTPeriodColumn

Import necessary libraries

Create a sample dataframe