basic_statistics

Repository for teaching basics of statistics for machine learning

https://github.com/neelsoumya/basic_statistics

Keywords

confidence-intervals data-science datascience datascience-machinelearning datasciencebasics lecture lecture-notes lectures lectures-slides machine-learning precision recall rob-tibshirani statistics teaching teaching-materials teaching-statistics teaching-tools topics

Last synced: 6 months ago · JSON representation ·

Repository

Repository for teaching basics of statistics for machine learning

Basic Info

Host: GitHub
Owner: neelsoumya
Language: R
Default Branch: master
Homepage: https://sites.google.com/site/neelsoumya/research-resources/basic-statistics
Size: 10.2 MB

Statistics

Stars: 1
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 15

Topics

confidence-intervals data-science datascience datascience-machinelearning datasciencebasics lecture lecture-notes lectures lectures-slides machine-learning precision recall rob-tibshirani statistics teaching teaching-materials teaching-statistics teaching-tools topics

Created almost 6 years ago · Last pushed 6 months ago

Metadata Files

Readme Citation

README.md

basic_statistics

This is a repository for teaching the basics of statistics for data science and machine learning. It is intended for use in an introductory data science class.

This material can also be used by working professionals or lay people who want to learn the basics of data science, statistics and machine learning.

Type 1 errors, Type 2 errors and p value
- https://youtu.be/Hdbbx7DIweQ
- Shiny app to explain p-value using coin toss
- p-value = probability of observing the data, if the null hypothesis is true
- https://sb2333medschl.shinyapps.io/pvalueexplanationshiny/
Power and Type 2 error
- https://www.youtube.com/watch?v=6_Cuz0QqRWc
- https://www.khanacademy.org/math/ap-statistics/tests-significance-ap
p value
- https://www.youtube.com/watch?v=5Z9OIYA8He8
- https://www.youtube.com/watch?v=yzQHONabWhs&list=PLOg0ngHtcqbPTlZzRHA2ocQZqB1D_qZ5V&index=10
q value and false discovery rate
- https://www.youtube.com/watch?v=S268k-DWRrE
- https://www.youtube.com/watch?v=K8LQSvtjcEo
- CONCEPT: look at distribution of p-values. q-value tells us the expected fraction of false positives in the significant tests below this threshold.
Power calculation
- https://youtu.be/6_Cuz0QqRWc
- power.t.test(n = NULL, power = .95, sd = 5, alternative = "two.sided", sig.level = 0.001, delta = 0.1)
Bias variance tradeoff
- https://www.youtube.com/watch?v=VaN1RUDuioQ&list=PLOg0ngHtcqbPTlZzRHA2ocQZqB1D_qZ5V&index=5
- http://scott.fortmann-roe.com/docs/BiasVariance.html
- VERY GOOD picture explanation
  - https://github.com/neelsoumya/basicstatistics/blob/master/biasvariance.png
- My lecture on the bias variance tradeoff
  - https://www.youtube.com/watch?v=4_la9-Ehvmo
Cross validation
- https://github.com/neelsoumya/basicstatistics/blob/master/Capturecrossvalidation.PNG
- https://github.com/neelsoumya/basicstatistics/blob/master/Capturecrossvalidation_split.PNG
Confidence intervals
- How many standard deviations from the mean must you go to capture 95% of the scores
- Computing 95% confidence intervals Mean +/- 1.96 * std/sqrt(no of samples)
  
  cialpha <- 0.05 qnorm(cialpha / 2) qnorm(1 - (ci_alpha/2))
  
  95% of the probability mass is found in about 2 standard deviations of the mean (see video below)
  
  https://www.youtube.com/watch?v=hlM7zdf7zwU
  
  boostrapped confidence intervals using confint(x, method = 'boot')
  
  d <- data.frame(w=rnorm(100), x=rnorm(100), y=sample(LETTERS[1:2], 100, replace=TRUE), z=sample(LETTERS[3:4], 100, replace=TRUE) ) do GLM on this new data frame fm2 <- glm(y ~ w + x + z, data=d, family=binomial) confint(object = fm2, method = 'boot')
  
  lb = quantile(listauc, 0.025) ub = quantile(listauc, 0.975) mean = mean(list_uac)
  
  also in Python and R empirical 95% confidence interval
  lb = np.percentile(listauc, 2.5) ub = np.percentile(listauc, 97.5)
- meaning of confidence intervals
  - SUMMARY: if you repeat the experiment 100 times, 95 times the true value of the mean will fall within this interval. This does not mean than with 95% probability, the mean will fall in this interval
- another explanation of confidence intervals by ISLR people (Rob Tibshirani)
  - https://www.youtube.com/watch?v=7TgVOK75EY&list=PLOg0ngHtcqbPTlZzRHA2ocQZqB1DqZ5V&index=8
  - https://www.coursera.org/learn/epidemiology/lecture/hzpDZ/confidence-intervals
Precision and recall
- https://en.wikipedia.org/wiki/Precisionandrecall
- https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
- VERY GOOD pictures of precision, recall, confusion matrix, false positive, true positive, sensitivity and specificity
  - https://github.com/neelsoumya/basic_statistics/blob/master/Screen%20Shot%202020-07-16%20at%2011.12.44%20AM.png
  - https://github.com/neelsoumya/basicstatistics/blob/master/800px-Sensitivityand_specificity.svg.png
  - https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-learning/
Explanation of AUC (area under curve)
- https://github.com/neelsoumya/basicstatistics/blob/master/aucexplanation.png
Linear models and interaction effects (by ISLR authors Rob Tibshirani and Efron)
- https://www.youtube.com/watch?v=IFzVxLv0TKQ&list=PL5-da3qGB5IBSSCPANhTgrw82ws7w_or9&index=5
- Woes of interpreting regression coefficients
  - https://youtu.be/yzQHONabWhs?t=498
ANOVA
- https://cambiotraining.github.io/stats-mixed-effects-models/materials/06-significance-and-model-comparison.html
- Code
  - https://github.com/neelsoumya/basicstatistics/blob/master/anovabasic.R
  - https://github.com/neelsoumya/basicstatistics/blob/master/anovapoliteness.R
Mixed effects models
- https://github.com/neelsoumya/basicstatistics/blob/master/mixedeffects_basics.Rmd

Owner

Name: Soumya Banerjee
Login: neelsoumya
Kind: user
Location: Cambridge, UK
Company: University of Cambridge

Website: https://sites.google.com/site/neelsoumya/
Repositories: 249
Profile: https://github.com/neelsoumya

My research interests are in complex systems data science, machine learning, computational biology, computational immunology and computational immunogenomics.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Banerjee"
  given-names: "Soumya"
  orcid: "https://orcid.org/0000-0001-7748-9885"
title: "basic_statistics"
version: 1.0.0
doi: 10.5281/zenodo.4743435 
date-released: 2021-09-05
url: "https://github.com/neelsoumya/basic_statistics"

GitHub Events

Total

Push event: 4

Last Year

Push event: 4

Committers

Last synced: 8 months ago

All Time

Total Commits: 105
Total Committers: 2
Avg Commits per committer: 52.5
Development Distribution Score (DDS): 0.01

Past Year

Commits: 3
Committers: 2
Avg Commits per committer: 1.5
Development Distribution Score (DDS): 0.333

Top Committers

Name	Email	Commits
Soumya Banerjee	n**a@g**m	104
soumyabanerjee	s**e@s**l	1

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

basic_statistics

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

basic_statistics

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels