https://github.com/amalan-constat/needs4bigdata

R package implementing subsampling methods to find informative samples from big data

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: researchgate.net
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.6%) to scientific vocabulary

Keywords

big-data cran experimental-design subsampling

Last synced: 5 months ago · JSON representation

Repository

R package implementing subsampling methods to find informative samples from big data

Basic Info

Host: GitHub
Owner: Amalan-ConStat
License: other
Language: R
Default Branch: main
Homepage: https://amalan-constat.github.io/NeEDS4BigData/
Size: 86.2 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

big-data cran experimental-design subsampling

Created almost 2 years ago · Last pushed 9 months ago

Metadata Files

Readme Changelog License

README.Rmd

---
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,comment = "#>",collapse = TRUE, fig.retina=2, fig.path = "man/figures/",
                      out.width = "100%")
library(badger)
```

# NeEDS4BigData 



`r badge_cran_release("NeEDS4BigData")`
`r badge_cran_checks("NeEDS4BigData")`
`r badge_runiverse()`

`r badge_cran_download("NeEDS4BigData", "grand-total", "green")`
`r badge_cran_download("NeEDS4BigData", "last-month", "green")`
`r badge_cran_download("NeEDS4BigData", "last-week", "green")`

`r badge_repostatus("Active")`
`r badge_lifecycle("stable")`
[![GitHub issues](https://img.shields.io/github/issues/Amalan-ConStat/NeEDS4BigData.svg?style=popout)](https://github.com/Amalan-ConStat/NeEDS4BigData/issues)

[![codecov](https://codecov.io/gh/Amalan-ConStat/NeEDS4BigData/graph/badge.svg?token=UHFWYFPDSI)](https://codecov.io/gh/Amalan-ConStat/NeEDS4BigData)
`r badge_codefactor("Amalan-ConStat/NeEDS4BigData")`
`r badge_code_size("Amalan-ConStat/NeEDS4BigData")`

[![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://lbesson.mit-license.org/)
`r badge_doi("10.1007/s00362-023-01446-9", "green")`


_The R package "NeEDS4BigData" provides approaches to implement subsampling methods to analyse big data._

### What is “NeEDS4BigData” an abbreviation for?

*Ne*w *E*xperimental *D*esign based *S*ubsampling methods *for Big Data*.

### How to engage with "NeEDS4BigData" the first time ? 

```{r NeEDS4BigData from GitHub or CRAN,eval=FALSE}
## Installing the package from GitHub
devtools::install_github("Amalan-ConStat/NeEDS4BigData")

## Installing the package from CRAN
install.packages("NeEDS4BigData")
```

### Subsampling Methods

1. A- and L-optimality based subsampling for GLMs.
2. A-optimality based subsampling for Gaussian Linear Models.
3. Leverage sampling for GLMs.
4. Local case control sampling for logistic regression.
5. A-optimality based subsampling under measurement constraints for GLMs.
6. Model robust subsampling method for GLMs.
7. Subsampling method for GLMs when the model is potentially misspecified.

These seven methods are described in the following articles under the topics

1. Introduction - explains the need for subsampling methods.
2. Model based subsampling
3. Model robust and misspecification
4. Benchmarking Functions

For $2)$ we assume the main effects model can describe the data. 
While for $3)$ first we consider there are several models that can describe the big data, then later we assume the given main effects model is misspecified. 
Under these conditions from $2)$ and $3)$ we explore subsampling for four given big data sets.
Further, to explore the computation time we ran simulations for the scenarios $2)$ and $3)$ where we compare our subsampling functions against full data modelling in $4)$.

#### Thank You

[![Twitter](https://img.shields.io/twitter/url/https/github.com/Amalan-ConStat/NeEDS4BigData.svg?style=social)](https://twitter.com/intent/tweet?text=Wow:&url=https%3A%2F%2Fgithub.com%2FAmalan-ConStat%2FNeEDS4BigData)

[ ![](https://img.shields.io/badge/LinkedIn-Amalan%20Mahendran-black.svg?style=flat) ]( https://www.linkedin.com/in/amalan-mahendran-72b86b37/)
[ ![](https://img.shields.io/badge/Research%20Gate-Amalan%20Mahendran-black.svg?style=flat) ]( https://www.researchgate.net/profile/Amalan_Mahendran )

Owner

Name: M. Amalan
Login: Amalan-ConStat
Kind: user
Location: Kandy, Sri Lanka and Brisbane, Australia
Company: QUT

Website: https://amalan-con-stat.netlify.com/
Twitter: Amalan_Con_Stat
Repositories: 5
Profile: https://github.com/Amalan-ConStat

Well, I am a statistician with practices in R statistical programming. Interests include R packages, Rmarkdown Reports, Rshiny Apps and #TidyTuesday.

GitHub Events

Total

Watch event: 1
Push event: 16

Last Year

Watch event: 1
Push event: 16

Dependencies

DESCRIPTION cran

R >= 3.5.0 depends
Rdpack * imports
Rfast * imports
dplyr * imports
foreach * imports
gam * imports
ggh4x * imports
ggplot2 * imports
matrixStats * imports
psych * imports
rlang * imports
stats * imports
tidyr * imports
doParallel * suggests
ggpubr * suggests
kableExtra * suggests
knitr * suggests
parallel * suggests
rmarkdown * suggests
spelling * suggests
testthat >= 3.0.0 suggests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/amalan-constat/needs4bigdata

Science Score: 49.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Dependencies