practicalstats-pucsp-2024

Statistical Measures in Python - Age and Salary Analysis

https://github.com/fabianacampanari/practicalstats-pucsp-2024

Keywords

excel hipoteses-tests hypothesis-testing linear-regression math matplotlib numpy pandas probabilistic-data-structures probability-distribution probability-statistics pythob3 scikit-learn scipy scipy-stats seaborn statisctics statsmodels

Keywords from Contributors

mesh sequences interactive hacking network-simulation

Last synced: 6 months ago · JSON representation

Repository

Statistical Measures in Python - Age and Salary Analysis

Basic Info

Host: GitHub
Owner: FabianaCampanari
License: mit
Language: Jupyter Notebook
Default Branch: main
Homepage: https://github.com/FabianaCampanari/statisticalMeasures-python-
Size: 61.6 MB

Statistics

Stars: 3
Watchers: 1
Forks: 0
Open Issues: 3
Releases: 0

Topics

excel hipoteses-tests hypothesis-testing linear-regression math matplotlib numpy pandas probabilistic-data-structures probability-distribution probability-statistics pythob3 scikit-learn scipy scipy-stats seaborn statisctics statsmodels

Created over 1 year ago · Last pushed 6 months ago

Metadata Files

Readme Funding License Code of conduct Citation Security Support

README.md

[🇧🇷 Português] [🇺🇸 English]

✍🏻 Practical Statistics and Probability in Python and Excel

University of Data Science and Artificial Intelligence - PUC-SP - 2nd Semester/2024

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix 🖤

https://github.com/user-attachments/assets/867282f7-2962-4957-a080-12fd42151ebd

📺 For better resolution, watch the video on YouTube.

Statistics and Probability

This repository, created by Fabiana 🚀 Campanari in the 2nd semester of 2024, consolidates the materials and code developed for the Statistics and Probability course within the Data Science and Artificial Intelligence program at PUC-SP, under the guidance of Professor Eric Bacconi Gonçalves. It is designed to support hands-on learning through exercises, scripts, and datasets.

Repository Contents:

Python Scripts: This section includes scripts for a wide range of statistical analyses, covering key topics such as distributions, population and sample concepts, and hypothesis testing. Calculations include:
- Central Tendency: Mean, median, and mode
- Dispersion: Standard deviation, variance, and range
- Positional Measures: Percentiles and quartiles (Q1, Q2, Q3)
- Distribution Shape: Skewness and kurtosis
- Confidence Intervals for estimating population parameters
- Correlation and Covariance for bivariate analysis

Practical Exercises: Available in Python and Excel, these exercises provide hands-on practice in calculating statistical measures and applying concepts, including:
- Analysis of Variance (ANOVA): Comparing means across multiple groups
- Hypothesis Testing: Null hypothesis (( H_0 )) tests, such as T-tests (one-sample, independent, and paired), ANOVA, and Chi-square tests
- Regression Analysis: Linear regression models for predictive analysis
- Probability: Exercises covering probability distributions, expected value, and variance

Statistical Tests: This section includes implementations of statistical tests tailored to analyze variables like age and salary, categorized by region and educational level. Each test includes the process of setting up and testing the null hypothesis (( H_0 )) for statistical significance.

Support Materials: Supplementary documentation on probability, relevant datasets, and homework assignments to reinforce key concepts.

Study Topics:

This repository provides a comprehensive foundation in core topics, including Descriptive Statistics, Probability Distributions, Population and Sample, Hypothesis Testing I and II (featuring null hypothesis (( H_0 )) testing), and Regression Analysis. It serves as a practical tool for building statistical and analytical skills.

Statistical Measures Analysis in Python:

This repository contains Python scripts for descriptive statistical analysis of employee salary and age data, including analyses for the entire dataset as well as subgrouped by education level and region.

Features:

Descriptive statistics: Mean, Median, Mode, Variance, Standard Deviation, Coefficient of Variation (CV), and Amplitude (Range). Grouped analysis: The same statistics calculated by grouping data based on region and education level. Designed for students: Easy-to-follow code with comments and explanations for each step.

Dataset:

The dataset used in this analysis contains employee details, including their age, salary, region of origin (regproc), and education level (grauinstrucao). Click here to get the Dataset

Getting Started:

To run this script, ensure you have the following:

Python 3 installed.
Necessary libraries (pandas) installed.
An Excel file containing the dataset in the appropriate format.

Codes:

1. Statics Measures

👇 Copy code

```python

Copy code

Importing the necessary library

import pandas as pd

Define the file path to the dataset

filepath = 'addyourdatasetpath_here'

Load the dataset into a DataFrame

df = pd.readexcel(filepath)

Display the first few rows of the dataset

df.head()

--- Descriptive Statistics for 'SALARIO' (salary) ---

Generate descriptive statistics for the 'SALARIO' column

print("Descriptive statistics for 'SALARIO':") print(df.salario.describe())

Calculate the range (amplitude) of the 'SALARIO' column

amplsalario = df['salario'].max() - df['salario'].min() print("\nAmplitude of 'SALARIO':", amplsalario)

Calculate the mode of the 'SALARIO' column

modasalario = df.salario.mode()[0] print("\nMode of 'SALARIO':", modasalario)

Calculate the variance of the 'SALARIO' column

varsalario = df.salario.var() print("\nVariance of 'SALARIO':", varsalario)

Calculate the coefficient of variation (CV) for 'SALARIO'

cvsalario = df.salario.std() / df.salario.mean() print("\nCoefficient of variation (CV) of 'SALARIO':", cvsalario)

--- Descriptive Statistics for 'SALARIO' by 'grau_instrucao' (educational level) ---

Generate descriptive statistics for 'SALARIO' grouped by 'grau_instrucao' (education level)

print("\nDescriptive statistics for 'SALARIO' grouped by 'grauinstrucao':") print(df.groupby('grauinstrucao')['salario'].describe())

Calculate the range (amplitude) of 'SALARIO' by 'grau_instrucao'

amplsalariograu = df.groupby('grauinstrucao')['salario'].max() - df.groupby('grauinstrucao')['salario'].min() print("\nAmplitude of 'SALARIO' by 'grauinstrucao':") print(amplsalario_grau)

Calculate the mode of 'SALARIO' by 'grau_instrucao'

modasalariograu = df.groupby('grauinstrucao')['salario'].agg(lambda x: pd.Series.mode(x)[0]) print("\nMode of 'SALARIO' by 'grauinstrucao':") print(modasalariograu)

Calculate the variance of 'SALARIO' by 'grau_instrucao'

varsalariograu = df.groupby('grauinstrucao')['salario'].var() print("\nVariance of 'SALARIO' by 'grauinstrucao':") print(varsalariograu)

Calculate the coefficient of variation (CV) for 'SALARIO' by 'grau_instrucao'

cvsalariograu = df.groupby('grauinstrucao')['salario'].std() / df.groupby('grauinstrucao')['salario'].mean() print("\nCoefficient of variation (CV) of 'SALARIO' by 'grauinstrucao':") print(cvsalario_grau)

Summary of key descriptive statistics

print("\nSummary for 'SALARIO' as a Whole:") print(f"\nAmplitude: {amplsalario}") print(f"\nMode: {modasalario}") print(f"\nVariance: {varsalario}") print(f"\nCoefficient of variation (CV): {cvsalario}")

print("\nSummary by 'grauinstrucao':") print(f"\nAmplitude by 'grauinstrucao': \n{amplsalariograu}") print(f"\nMode by 'grauinstrucao': \n{modasalariograu}") print(f"\nVariance by 'grauinstrucao': \n{varsalariograu}") print(f"\nCoefficient of variation (CV) by 'grauinstrucao': \n{cvsalario_grau}") ```

2. Sample Selection

👇 Copy code

```python

Import pandas and numpy libraries

import pandas as pd import numpy as np

Define the file path

filepath = 'addyourdatasetpath_here'

Read the Excel file into a DataFrame

df = pd.readexcel(filepath)

Sample Selection

Simple random sample without replacement with 20 elements

sample = df.sample(20, replace=False) print(sample)

Simple random sample without replacement with 20 elements (fixing the random seed)

sample = df.sample(n=20, replace=False, random_state=2903) print(sample)

Check the classes of the variable and their proportions

percestciv = df["estadocivil"].valuecounts(normalize=True) print(percestciv)

Execute equal stratified sample by marital status

samplestratequal = df.groupby(['estadocivil'], groupkeys=False).apply(lambda x: x.sample(n=10, replace=False, randomstate=2903)) print(samplestrat_equal)

Define desired total

N = 20

Execute proportional stratified sample by marital status

samplestratprop = df.groupby(['estadocivil'], groupkeys=False).apply( lambda x: x.sample(int(np.rint(N * len(x) / len(df))), randomstate=2903) # Proportional sample calculation ).sample(frac=1, randomstate=2903).resetindex(drop=True) # Shuffle the sample print(samplestrat_prop)

Check the proportion of each marital status

percestciv = samplestratprop["estadocivil"].valuecounts(normalize=True) print(percestciv)

Execute equal stratified sample by region of origin

samplestratequal = df.groupby(['regproc'], groupkeys=False).apply(lambda x: x.sample(n=10, replace=False, randomstate=2903)) print(samplestrat_equal)

Execute equal stratified sample by education level

samplestratequal = df.groupby(['grauinstrucao'], groupkeys=False).apply(lambda x: x.sample(n=10, replace=False, randomstate=2903)) print(samplestrat_equal)

Execute equal stratified sample by region of origin and education level

samplestratequal = df.groupby(['regproc', 'grauinstrucao'], groupkeys=False).apply(lambda x: x.sample(n=10, replace=False, randomstate=2903)) print(samplestratequal)

Create an equal stratified sample by education level and region of origin

samplestratequal = df.groupby(['grauinstrucao', 'regproc'], groupkeys=False).apply(lambda x: x.sample(n=min(len(x), 10), replace=False, randomstate=2903)).resetindex(drop=True) print(samplestrat_equal)

Save the stratified sample to a new Excel file

outputpath = 'pathtosavesample/stratifiedsample.xlsx' samplestratequal.toexcel(output_path, index=False)

print(f"The stratified sample has been saved to {output_path}") ```

3. One-Sample t-Test

👇 Copy code

```python

Exercise 3 – Test the hypothesis that the salary is equal to 12. What is your conclusion?

Import pandas library

import pandas as pd

Import scipy library

import scipy.stats as stats

File path

filepath = 'addyourdatasetpath_here'

Load the file into Python

df = pd.readexcel(filepath) print(df.head())

Bring only the age variable to perform the test

base_age = df['idade']

Execute the t-test testing H0: age = 32 and H1: age ≠ 32

resultttest = stats.ttest1samp(baseage, 32) pvalue = resultt_test.pvalue alpha = 0.05

if p_value < alpha: print("We reject the null hypothesis (H0).") else: print("We do not reject the null hypothesis (H0).")

Remember the mean

print(f"Mean age: {df.idade.mean()}")

Execute the t-test testing H0: age = 34 and H1: age ≠ 34

resultttest = stats.ttest1samp(baseage, 34) pvalue = resultt_test.pvalue alpha = 0.05

Decision based on p-value

if p_value < alpha: print("We reject the null hypothesis (H0).") else: print("We do not reject the null hypothesis (H0).")

With a p-value of 0.045, which is less than the significance level of 0.05, we reject the null hypothesis. This indicates that there is statistically significant evidence to suggest that the average age of employees is different from 34.

Execute the t-test testing H0: age = 35 and H1: age ≠ 35

resultttest = stats.ttest1samp(baseage, 35) pvalue = resultt_test.pvalue alpha = 0.05

Decision based on p-value

if p_value < alpha: print("We reject the null hypothesis (H0).") else: print("We do not reject the null hypothesis (H0).")

With a p-value of 0.2234, which is greater than the significance level of 0.05, we do not reject the null hypothesis. This means that there is not enough evidence to conclude that the average age of employees is different from the hypothesized value (32, 35, or any other value being tested).

Answering question 3

Test the hypothesis that the salary is equal to 12. What is your conclusion?

import pandas as pd import scipy.stats as stats

File path

filepath = 'addyourdatasetpath_here'

Load the data

df = pd.readexcel(filepath) print(df.head())

Select the variable of interest (salary)

salaries = df['salario']

Execute the t-test testing H0: salary = 12 and H1: salary ≠ 12

resultttest = stats.ttest1samp(salaries, 12) print(resultt_test)

Get the p-value from the test result

pvalue = resultt_test.pvalue

Define the significance level

alpha = 0.05

Interpret the result

p_value = 8.755117588192511e-06

if p_value < alpha: print("We reject the null hypothesis (H0).") else: print("We do not reject the null hypothesis (H0).")

Define the conclusion based on the p-value and the significance level

if p_value < alpha: conclusion = "We reject the null hypothesis (H0)." else: conclusion = "We do not reject the null hypothesis (H0)."

Display the test result and conclusion

print(f"t-test statistic: {resultttest.statistic}") print(f"p-value: {resultttest.pvalue}") print(conclusion)

Analysis and Conclusion with p-value and significance level

In the hypothesis test performed, we tested the null hypothesis (H0) that the average salary of employees is equal to 12 against the alternative hypothesis (H1) that the average salary of employees is different from 12.

The results of the t-test were as follows:

- t-test statistic: -4.500727991746298

- p-value: 8.755117588192511e-06

With a p-value of approximately 8.76e-06, which is significantly less than the significance level of 0.05, we reject the null hypothesis (H0). This indicates that there is statistically significant evidence to suggest that the average salary of employees is different from 12.

Therefore, the conclusion of the test is that we reject the null hypothesis (H0) and accept the alternative hypothesis (H1), indicating that the average salary of employees is not equal to 12.

```

4. Two-Sample t-Test

👇 Copy code

```python

Question 4

To test the hypothesis that income is equal for the two marital statuses (single and married), we can use the t-test for independent samples. We will follow these steps:

Visualization

1. Extract income data for singles and married individuals.

2. Perform the t-test for independent samples.

3. Interpret the p-value to accept or reject the null hypothesis.

Steps in pseudocode:

Import necessary libraries (pandas and scipy.stats).
Load data from the Excel file.
Extract income columns for singles and married individuals.
Perform the t-test for independent samples.
Interpret the result based on the p-value.

Install scipy if necessary

%pip install scipy pandas

import pandas as pd from scipy import stats

Load the data from the Excel file

filepath = 'addyourdatasetpathhere' df = pd.readexcel(file_path)

Visualize the first rows of the DataFrame

print(df.head())

Check the mean salary by marital status group

print(df.groupby(['estado_civil'])['salario'].describe())

Extract income columns for singles and married individuals

singleincome = df[df['estadocivil'] == 's']['salario'] marriedincome = df[df['estadocivil'] == 'c']['salario']

Perform the t-test for independent samples

tstat, pvalue = stats.ttestind(marriedincome, singleincome, equalvar=False)

Display the results of the t-test

print("Results of the t-Test:") print(f"t-statistic: {tstat}") print(f"p-value: {pvalue}")

Interpret the result

alpha = 0.05 if p_value < alpha: print("Conclusion: We reject the null hypothesis. The incomes are different for the two marital statuses.") else: print("Conclusion: We do not reject the null hypothesis. The incomes are equal for the two marital statuses.")

Conclusion Results of the t-Test:

Interpretation:

t-statistic: 4.567472731259726
p-value: 6.527014259249644e-06

The p-value is extremely small (6.527014259249644e-06), much smaller than the common significance level (0.05). This indicates that there is a significant difference in income between the two marital statuses.

Conclusion:

We reject the null hypothesis. The incomes are different for the two marital statuses (single and married). ```

5. One-Way ANOVA

👇 Copy code

```python

Import necessary libraries

import pandas as pd import statsmodels.api as sm from statsmodels.formula.api import ols from statsmodels.stats.multicomp import pairwise_tukeyhsd

File path

filepath = 'addyourdatasetpath_here'

Load the data into Python

df = pd.readexcel(filepath)

Display the first few rows of the DataFrame

print(df.head())

Check the average salary by education level

print("Average salary by education level:") print(df.groupby(['grau_instrucao'])['salario'].describe())

Create a model to compare salary by education level

model = ols('salario ~ grau_instrucao', data=df).fit()

Perform ANOVA

anovaresult = sm.stats.anovalm(model) print("ANOVA Results:") print(anova_result)

Interpret the results

alpha = 0.05 pvalue = anovaresult['PR(>F)'][0] if pvalue < alpha: conclusionanova = "There is a significant difference in salaries among different education levels." else: conclusionanova = "There is no significant difference in salaries among different education levels." print(f"Conclusion from ANOVA: {conclusionanova}")

Post Hoc Tukey Test to evaluate specific differences

tukey = pairwisetukeyhsd(endog=df.salario, groups=df.grauinstrucao) print("Post Hoc Tukey Test Results:") print(tukey.summary())

Interpret Tukey results

print("Interpreting Tukey's test results:") for result in tukey.summary().data[1:]: # Skip header group1, group2, meandiff, padj, lower, upper, reject = result if reject: print(f"Significant difference between {group1} and {group2}: mean difference = {meandiff:.4f}, p-adj = {padj:.4f}") else: print(f"No significant difference between {group1} and {group2}: mean difference = {meandiff:.4f}, p-adj = {p_adj:.4f}")

Overall conclusion

print("Overall Conclusion:") print("The results indicate that salary is significantly affected by education level, with higher education corresponding to higher salaries.") ```

6. Two-Way ANOVA

👇 Copy code

```python

Import necessary libraries

import pandas as pd import statsmodels.api as sm from statsmodels.formula.api import ols from statsmodels.stats.multicomp import pairwise_tukeyhsd

File path

filepath = 'addyourdatasetpath_here'

Load the data into Python

df = pd.readexcel(filepath)

Display the first few rows of the DataFrame

print("Initial DataFrame:") print(df.head())

Check the average salary by education level and marital status

print("\nAverage salary by education level and marital status:") print(df.groupby(['grauinstrucao', 'estadocivil'])['salario'].describe())

Generate the model to compare salary by education level and marital status

model = ols('salario ~ grauinstrucao + estadocivil', data=df).fit()

Apply ANOVA

anovaresult = sm.stats.anovalm(model) print("\nANOVA Results:") print(anova_result)

Interpret the results

alpha = 0.05 pvalueinstrucao = anovaresult['PR(>F)']['grau_instrucao'] pvaluecivil = anovaresult['PR(>F)']['estado_civil']

if pvalueinstrucao < alpha: conclusioninstrucao = "There is a significant difference in salaries among different education levels." else: conclusioninstrucao = "There is no significant difference in salaries among different education levels."

if pvaluecivil < alpha: conclusioncivil = "There is a significant difference in salaries among different marital statuses." else: conclusioncivil = "There is no significant difference in salaries among different marital statuses."

print(f"\nConclusion from ANOVA for Education Level: {conclusioninstrucao}") print(f"Conclusion from ANOVA for Marital Status: {conclusioncivil}")

Post Hoc Tukey Test to evaluate specific differences for marital status

print("\nPost Hoc Tukey Test Results for Marital Status:") tukeyestadocivil = pairwisetukeyhsd(endog=df.salario, groups=df.estadocivil) print(tukeyestadocivil.summary())

Post Hoc Tukey Test to evaluate specific differences for education level

print("\nPost Hoc Tukey Test Results for Education Level:") tukeyinstrucao = pairwisetukeyhsd(endog=df.salario, groups=df.grauinstrucao) print(tukeyinstrucao.summary())

Overall conclusion

print("\nOverall Conclusion:") print("The results indicate that salary is significantly affected by both education level and marital status.") ```

7. Two-way ANOVA with Interaction

👇 Copy code

```python

Import necessary libraries

import pandas as pd import statsmodels.api as sm from statsmodels.formula.api import ols from statsmodels.stats.multicomp import pairwise_tukeyhsd

File path

filepath = 'addyourdatasetpath_here'

Load the data into Python

df = pd.readexcel(filepath)

Display the first few rows of the DataFrame

print("Initial DataFrame:") print(df.head())

Check the average salary by education level and marital status

print("\nAverage salary by education level and marital status:") print(df.groupby(['grauinstrucao', 'estadocivil'])['salario'].describe())

Generate the model to compare salary by education level and marital status, including interaction

model = ols('salario ~ grauinstrucao * estadocivil', data=df).fit()

Apply ANOVA

anova = sm.stats.anova_lm(model) print("\nANOVA Results:") print(anova)

Interpret the results

alpha = 0.05 pvalueinstrucao = anova['PR(>F)']['grauinstrucao'] pvaluecivil = anova['PR(>F)']['estado_civil'] pvalue_interaction = anova['PR(>F)']['grauinstrucao:estadocivil']

Conclusions from ANOVA

conclusions = { "Grau de Instrução": "significant" if pvalueinstrucao < alpha else "not significant", "Estado Civil": "significant" if pvaluecivil < alpha else "not significant", "Interação": "significant" if pvalueinteraction < alpha else "not significant" }

for factor, result in conclusions.items(): print(f"Conclusion from ANOVA for {factor}: There is a {result} effect.")

Post Hoc Tukey Test for marital status

print("\nPost Hoc Tukey Test Results for Marital Status:") tukeyestadocivil = pairwisetukeyhsd(endog=df.salario, groups=df.estadocivil) print(tukeyestadocivil.summary())

Post Hoc Tukey Test for education level

print("\nPost Hoc Tukey Test Results for Education Level:") tukeyinstrucao = pairwisetukeyhsd(endog=df.salario, groups=df.grauinstrucao) print(tukeyinstrucao.summary())

Overall conclusion

print("\nOverall Conclusion:") print("The results indicate that salary is significantly affected by both education level and marital status,") print("and there is also a significant interaction between these two factors.") ```

8. Chi-Square Test for One Variable.

👇 Copy code

```python

Import necessary libraries

import pandas as pd import scipy.stats as stats

File path

filepath = 'addyourdatasetpath_here'

Load the data into a DataFrame

df = pd.readexcel(filepath)

Display the first few rows of the DataFrame

print("Initial DataFrame:") print(df.head())

Frequency table for the variable 'reg_proc'

freqregproc = df['regproc'].valuecounts() print("\nFrequency of the 'regproc' variable:") print(freqreg_proc)

Perform Chi-Square Test

chi2stat, pval = stats.chisquare(freqregproc) print("\nChi-Square Test Results:") print(f"Chi-Square Statistic: {chi2stat}") print(f"p-value: {pval}")

Interpretation and Conclusion

alpha = 0.05 # Significance level if p_val < alpha: print("Reject the null hypothesis. The distribution of the region of origin is not the same in the sample.") else: print("Fail to reject the null hypothesis. The distribution of the region of origin is the same in the sample.") ```

Chi-Square Test for Independence of Two Variables

👇 Copy code

```python

Import necessary libraries

import pandas as pd import scipy.stats as stats

File path

filepath = 'addyourdatasetpath_here'

Load the data into a DataFrame

df = pd.readexcel(filepath)

Display the first few rows of the DataFrame

print("Initial DataFrame:") print(df.head())

Create a contingency table for 'grauinstrucao' and 'regproc'

contingencytable = pd.crosstab(df['grauinstrucao'], df['regproc']) print("\nContingency Table:") print(contingencytable)

Perform Chi-Square Test of Independence

chi2stat, pval, dof, expected = stats.chi2contingency(contingencytable) print("\nResults of the Chi-Square Test of Independence:") print(f"Chi-Square Statistic: {chi2stat}") print(f"p-value: {pval}") print(f"Degrees of Freedom: {dof}") print("Expected Frequencies:") print(expected)

Interpretation and Conclusion

alpha = 0.05 # Significance level if p_val < alpha: print("\nConclusion: Reject the null hypothesis. The distribution of education levels varies according to the region of origin.") else: print("\nConclusion: Fail to reject the null hypothesis. The distribution of education levels does not vary according to the region of origin.") ```

Back to Top

Owner

Name: Fabiana ⚡️ Campanari
Login: FabianaCampanari
Kind: user
Location: Brazil 🇧🇷
Company: @Mindful-AI-Assistants | @Quantum-Software-Development

Website: fabicampanari@proton.me
Twitter: CampanariFabi
Repositories: 20
Profile: https://github.com/FabianaCampanari

🇶 AI/ML Dev · Data Scientist (Humanistic AI) · Software & Design · Psych Grad · Quantum Mindset · 🕸️ Seeker of the Unknown 𝚿

GitHub Events

Total

Issues event: 9
Watch event: 2
Delete event: 135
Push event: 202
Pull request event: 269
Create event: 135

Last Year

Issues event: 9
Watch event: 2
Delete event: 135
Push event: 202
Pull request event: 269
Create event: 135

Committers

Last synced: over 1 year ago

All Time

Total Commits: 257
Total Committers: 2
Avg Commits per committer: 128.5
Development Distribution Score (DDS): 0.004

Past Year

Commits: 257
Committers: 2
Avg Commits per committer: 128.5
Development Distribution Score (DDS): 0.004

Top Committers

Name	Email	Commits
Fabiana 🚀 Campanari	f**i@g**m	256
dependabot[bot]	4****]	1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 8
Total pull requests: 295
Average time to close issues: 4 minutes
Average time to close pull requests: about 3 hours
Total issue authors: 1
Total pull request authors: 3
Average comments per issue: 0.0
Average comments per pull request: 0.01
Merged pull requests: 292
Bot issues: 0
Bot pull requests: 4

Past Year

Issues: 8
Pull requests: 295
Average time to close issues: 4 minutes
Average time to close pull requests: about 3 hours
Issue authors: 1
Pull request authors: 3
Average comments per issue: 0.0
Average comments per pull request: 0.01
Merged pull requests: 292
Bot issues: 0
Bot pull requests: 4

practicalstats-pucsp-2024

Science Score: 26.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

✍🏻 Practical Statistics and Probability in Python and Excel

University of Data Science and Artificial Intelligence - PUC-SP - 2nd Semester/2024

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix 🖤

📺 For better resolution, watch the video on YouTube.

Statistics and Probability

Repository Contents:

Study Topics:

Statistical Measures Analysis in Python:

Features:

Dataset:

Getting Started:

To run this script, ensure you have the following:

Codes:

1. Statics Measures

Importing the necessary library

Define the file path to the dataset

Load the dataset into a DataFrame

Display the first few rows of the dataset

--- Descriptive Statistics for 'SALARIO' (salary) ---

Generate descriptive statistics for the 'SALARIO' column

Calculate the range (amplitude) of the 'SALARIO' column

Calculate the mode of the 'SALARIO' column

Calculate the variance of the 'SALARIO' column

Calculate the coefficient of variation (CV) for 'SALARIO'

--- Descriptive Statistics for 'SALARIO' by 'grau_instrucao' (educational level) ---

Generate descriptive statistics for 'SALARIO' grouped by 'grau_instrucao' (education level)

Calculate the range (amplitude) of 'SALARIO' by 'grau_instrucao'

Calculate the mode of 'SALARIO' by 'grau_instrucao'

Calculate the variance of 'SALARIO' by 'grau_instrucao'

Calculate the coefficient of variation (CV) for 'SALARIO' by 'grau_instrucao'

Summary of key descriptive statistics

2. Sample Selection

Import pandas and numpy libraries

Define the file path

Read the Excel file into a DataFrame

Sample Selection

Simple random sample without replacement with 20 elements

Simple random sample without replacement with 20 elements (fixing the random seed)

Check the classes of the variable and their proportions

Execute equal stratified sample by marital status

Define desired total

Execute proportional stratified sample by marital status

Check the proportion of each marital status

Execute equal stratified sample by region of origin

Execute equal stratified sample by education level

Execute equal stratified sample by region of origin and education level

Create an equal stratified sample by education level and region of origin

Save the stratified sample to a new Excel file

3. One-Sample t-Test

Exercise 3 – Test the hypothesis that the salary is equal to 12. What is your conclusion?

Import pandas library

Import scipy library

File path

Load the file into Python

Bring only the age variable to perform the test

Execute the t-test testing H0: age = 32 and H1: age ≠ 32

Remember the mean

Execute the t-test testing H0: age = 34 and H1: age ≠ 34

Decision based on p-value

With a p-value of 0.045, which is less than the significance level of 0.05, we reject the null hypothesis. This indicates that there is statistically significant evidence to suggest that the average age of employees is different from 34.

Execute the t-test testing H0: age = 35 and H1: age ≠ 35

Decision based on p-value

With a p-value of 0.2234, which is greater than the significance level of 0.05, we do not reject the null hypothesis. This means that there is not enough evidence to conclude that the average age of employees is different from the hypothesized value (32, 35, or any other value being tested).

Answering question 3

Test the hypothesis that the salary is equal to 12. What is your conclusion?

File path

Load the data

Select the variable of interest (salary)

Execute the t-test testing H0: salary = 12 and H1: salary ≠ 12

Get the p-value from the test result