Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.8%) to scientific vocabulary
Repository
Interesting datasets for use in teaching Statistics
Basic Info
- Host: GitHub
- Owner: pmean
- Language: R
- Default Branch: main
- Size: 2.6 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
title: README file for data repository
data
This repository, data, stores various datasets that I use in my talks and my classes. It replaces an earlier repository, datasets. It does not include any datasets that I use in my private consulting practice. Those datasets are assumed to be owned by the client and would only be stored in a private repository.
I will eventually consolidate all the datasets that I have stored in various other repositories here. Some of the repositories that I need to move data from include
For many of the datasets in this repository, the copyright is not clearly stated anywhere that I could find. Use by individuals for educational purposes is probably acceptable under the Fair Use provisions of copyright law.
If you own the copyright to any of these datasets and wish to clarify the conditions under which they can be used, please contact me. If I am not allowed to store your data on my repository, I would be glad to remove your dataset.
Here are the datasets in this repository so far:
absorbent-paper : Foot progression angle (FPA) is typically measured using an expensive pressure platform, but this dataset examines a simpler approach using water, absorbent paper, and a goniometer. Thirteen children were assessed both on the left foot and the right foot. This was repeated two weeks later. In this study, all the assessments were negative, representing in-toeing gait.
aids-cases : This dataset shows the yearly number of AIDS cases from 1982 to 1988 in two provinces of Australia.
airline-bumping : This dataset shows the number of times passengers were "bumped" from an airline flight. Bumping occurs when an airline overbooks reservations for a flight. Normally, the airline will ask for volunteers who will agree to take a later flight in exchange for cash or travel vouchers. But if no one volunteers, airlines will involuntarily bump passengers from the flight.
albuquerque-housing : From the original source (no longer available) A random sample of records of resales of homes from Feb 15 to Apr 30, 1993 from the files maintained by the Albuquerque Board of Realtors. This type of data is collected by multiple listing agencies in many cities and is used by realtors as an information base.
art-malpresentations : Assisted reproductive technology (ART) has helped many infertile couples, but there are concerns about the risks during pregnancy and child birth. This paper found 11 studies that examined whether the risk of malpresentation (e.g., breech births) was higher in ART pregnancies compared to natural conception (NC).
back-pain-runners : The study recruited runners with lower back pain and found matching subjects among runners without lower back pain and among a sedentary group of adults.
bacterial-cultures : Five strains of Staphylococcus aureus were cultured under a variety of laboratory conditions. The goal is to see what conditions are optimal for growth.
balance-measures-tall : See balance1-data-dictionary.yaml
balance-measures-wide : The study examined balance for subjects on two different floor conditions (normal or foam) and with three different levels of sight (eyes open, eyes closed, or closed dome). Each of these conditions was replicated twice. The outcome measure was an ordinal variable on a scale from 1 to 4.
body-dimensions : Various body dimensions and body fat measurements
breast-feeding-preterm : This data comes from a research study done at Children's Mercy Hospital and St. Luke's Medical Center. This was a study of breast feeding in pre-term infants. Infants were randomized into either a treatment group (NG tube) or a control group (Bottle). Infants in the NG tube group were fed in the hospital via their nasogastral tube when the mother was not available for breast feeding. Infants in the bottle group received bottles when the mothers were not available. Both groups were monitored for six months after discharge from the hospital.
burger-calories : Fat, sodium, and calories for various fast food hamburgers.
caries-detection : These data represent the diagnostic performance of artificial intelligence models for caries detection in bitewing radiographs. Five studies were identified for a sytematic overview.
carotid-plaque : The details on this dataset are a bit vague because the article associated with this data is hidden behind a paywall. Ultarsonagraphy was used to identify carotid plaque and/or cerebrovascular disease. The imaging produced measures of Von Mises strain. The key measure is peak, which appears to be an average or a maximum over four or five cycles. There are separate measurements at systole and diastole.
cholesterol-after-heart-attack : Cholesterol Levels after Heart Attack
cigarette-measurements : Information on tar and nicotine for 11 brands of cigarettes.
collaborative-consumption : From the original source: "The data set presents data collected by online survey with a questionnaire using Likert scale. The survey sample included 184 adults (18+), active and potential users of different sharing services platforms."
cracker-bloat : This data set shows side effects of specially prepared diet crackers.
cracker-fiber : Dietary fiber
digital-citizenship-revised : This is a modification/simplification of the digital-citizenship data set. An aggregate variable, internet_sum, was computed as the sum of the 27 internet variables. The categories for Age were recoded down into three categories. Rows associated with Country=Other and United States were removed. Many other variables were removed.
digital-citizenship : From the original source: This dataset was used in our study that investigated the psychometric properties of the Digital Citizenship Scale (DCS), originally developed by Choi, Glassman and Cristol. 1915 responses were gathered in late 2018 and 1820 responses with valid data from three countries were analysed (Canada, n=817; Australia, n=589; and Slovenia, n=414). The dataset also included Big-Factor-Five personality data as well.
energy : Energy requirements while running, walking, and cycling. Other datasets named similarly are modifications of this data to illustrate various options for data input.
energy00 : Energy requirements while running, walking, and cycling. The datasets energy01, energy02, energy03, and energy04 are modifications of the original data to illustrate various options for data input.
energy01 : Energy requirements while running, walking, and cycling. This dataset and others named similarly are modifications of the original data to illustrate various options for data input. For more information, refer to energy00.yaml
energy02 : Energy requirements while running, walking, and cycling. This dataset and others named similarly are modifications of the original data to illustrate various options for data input. For more information, refer to energy00.yaml
energy03 : Energy requirements while running, walking, and cycling. This dataset and others named similarly are modifications of the original data to illustrate various options for data input. For more information, refer to energy00.yaml
energy04 : Energy requirements while running, walking, and cycling. This dataset and others named similarly are modifications of the original data to illustrate various options for data input. For more information, refer to energy00.yaml
exercise-programs : This dataset is used in a tutorial about interactions. A description from the original source: The dataset consists of data describing the amount of weight loss achieved by 900 participants in a year-long study of 3 different exercise programs, a jogging program, a swimming program, and a reading program which serves as a control activity. Researchers were interested in how the weekly number of hours subjects chose to exercise predicted weight loss.
fasting-turtles : Plasma Protein of Fasting Turtles. Four male and four female turtles had their plasma protein measured while they were well fed and after ten and twenty days of fasting.
fat : This dataset includes two measures of body fat (a quantity that is normally quite difficult to measure) along with some simpler measures of body size that could be used to predict body fat.
fev : Forced Expiratory Volume (FEV) in children. The data was collected in Boston in the 1970s.
fishing : The data was collected by Peter Drew and Matt Seidemann, statistics students at the Queensland University of Technology, in a subject taught by Dr Margaret Mackisack. They ran a simple experiment looking at factors affecting the distance that a fishing line was cast based on the rod, line, and sinker.
fly1 : This dataset provides a simple example of what survival and censoring. It provides an inuitive explanation of estimation of survival probabilities.
fly2 : This dataset provides a simple example of survival and censoring. For more details, refer to the fly1 data dictionary.
fly3 : This dataset provides a simple example of survival and censoring. For more details, refer to the fly1 data dictionary.
fruitfly : Does access to mating affect the lifespan of fruitflies? This data shows the longevity of male fruitflies in the presence or absence of female fruitflies to mate with. Male fruitflies were housed with 0, 1, or 8 females. In some groups, the females were pregnant and thus not available for mating. There are two covariates, length of the thorax and percentage of time sleeping, that might also influence longevity.
full-moon-er-admissions : The data give the admission rates to the emergency room of a Virginia mental health clinic before, during and after the 12 full moons from August 1971 to July 1972.
gardasil : This data set shows information about young women who received the Gardasil shot. Of particular interest is the proportion of women who received all three shots.
gingko-memory : Patients were given an over-the-counter medication, gingko, or a placebo to if this medication could improve memory in elderly adults. The measure of memory is not clearly defined.
glycyrrhizin : This data is from a meta-analysis of the effectiveness of glycyrrhizin, an herbal medicine that is thought to have some anti-allergenic and immune boosting effects. This study examined its use in treating chronic urticaria, a condition characterized by rashes and/or angiodema. The researchers found 24 randomized trials addressing this topic. The outcome was the total efficiency rate, defined as a binary outcome based on the Urticaria Symptom Score Reduction Index.
grace1000 : This dataset illustrates the use of time-varying covariates.
heroin : This dataset shows information about heroin addicts who are treated at two rehabilitation clinics. It is useful for showing (among other things) how to identify and control for the clinic factor which does not meet the assumptions of proportional hazards needed by the standard Cox regression model.
hiv-intervention : This is a longitudinal study of an intervention in 14-18 adolescents intended to increase the frequency of condom protected sex. Subjects were allocated randomly to treatment or control groups. All were evaluated prior to the intervention, immediately after the intervention, 6 months and 12 months after the intervention.The outcome variable is the logarithm-transformed frequency of condom-protected sex ( log(Y+1) )."
leader : This data set shows how long political leaders of various countries stayed in power and is useful for (among other things) illustrating how to fit competing risk models based on the manner in which each leader lost power.
legionnaires-disease : Fictional data on bacteria counts before and after air conditioning maintenance.
litter-weights : Hypothetical data simulated to illustrate analysis issues associated with random litter effects.
module02-datasets : This is an R binary file that contains three dataframes (bump, fd, and sleep). See more information in the individual data dictionaries.
moon : This data set shows a perceptual experiment where subjects were asked to estimate a size ratio with their head level to the ground and then with their head elevated (in other words, looking upward). Although the objects being compared were the same size, almost all subjects overestimated the relative sizes. The hypothesis to be tested is whether the overestimation is greater with eyes level than with eyes elevated.
postural-sway : Postural sway is a measure of how well patients can balance. The postural sway was measured using a force plate in two groups of subjects, elderly or young. Sway was measured in the forward/back direction and in the side-to-side direction.
psychiatric-discharges : This dataset shows an example of left truncated data for a survival model.
quake : Depression scores measured before and at multiple time points after a major earthquake.
rat-litter : This dataset shows survival times for rats provided an unspecified medical treatment or a placebo. The experiment randomized three pups within an individual litter, one to the treatment group and two to the control group. This data is useful for (among other things) showing how to account for a litter effect in a Cox proportional hazards model.
samara-velocity : From the Ryan et al article: "In autumn, small winged fruit called samara fall off maple trees, spinning as they go. A forest scientist studied the relationship between how fast they fell and their "disk loading" (a quantity based on their size and weight). The samara disk loading is related to the aerodynamics of helicopters."
sharing : From the original source: "The data set presents data collected by online survey with a questionnaire using Likert scale. The survey sample included 184 adults (18+), active and potential users of different sharing services platforms."
singapore-diamond-prices : The data is intended to teach some lessons about regression models. The size of a diamond, as well as some categorical descriptors (color and clarity) are listed for 308 diamonds, along with their sales price.
sleep : This dataset has information about sleep patterns in 62 common mammals, along with other information that might help you understand what influences variations in sleep.
swim-speeds : This experiment was conducted by Kim Horsfall, Sue Hall and Simone Golik, statistics students at the Queensland University of Technology in a subject taught by Dr Margaret Mackisack. The students designed and conducted an experiment to determine the factors affecting the time to swim one lap of a 25m pool.
termites : Resins Rid Termites from Trees
titanic : The Titanic was a large cruise ship, the biggest of its kind in 1912. It was thought to be unsinkable, but when it set sail from England to America in its maiden voyage, it struck an iceberg and sank, killing many of the passengers and crew. You can get fairly good data on the characteristics of passengers who died and compare them to those that survived. The data indicate a strong effect due to age and gender, representing a philosophy of "women and children first" that held during the boarding of life boats.
transplant0 : This data dictionary provides information about a dataset that for a variety of reasons, comes in four different files. Two files of the files (transplant.txt and transplant1.csv) are identical except for formatting and are represented with one row per patient. Another two files (heart.csv and transplant2.csv) are also identical except for formatting and are represented using start-stop coding. For these two data sets, an extra row is needed for patients who switch the covariate pattern in the middle of the study.
transplant1 : Please refer to transplant0.yaml
transplant2 : Please refer to transplant0.yaml
transplant3 : Please refer to transplant0.yaml
two-small-dataframes : This is an R binary file that contains two dataframes (bump, fd). See the individual data dictionaries for more details.
vaccine-willingness : This paper identified seven studies that examined vaccine literacy (VL). Patients in these studies also stated their vaccine prefernce and were categorized as willing or unwilling.
whas100 : The data represents survival times for a 100 patient subset of data from the Worcester Heart Attack Study. You can find more information about this data set in Chapter 1 of Hosmer, Lemeshow, and May.
whas500 : The data represents survival times for a 500 patient subset of data from the Worcester Heart Attack Study. You can find more information about this data set in Chapter 1 of Hosmer, Lemeshow, and May.
wolf-river-pollution : Ten water samples were taken at three different depths in Wolf River. Two pollutants, Aldrin and HCB, were measured in each sample.
woodard : Information from a Wake County, NC database, taken in 2008. This a random sample from the entire county.
Owner
- Name: Steve Simon
- Login: pmean
- Kind: user
- Location: Leawood, KS
- Company: P.Mean Consulting
- Website: www.pmean.com
- Repositories: 3
- Profile: https://github.com/pmean
Teacher/consultant. I blog about Statistics, research ethics, and evidence based medicine. I also run 5K races, but very slowly.
Citation (CITATION)
Many of these datasets have copyright restrictions that require you to cite the original source if you use them. Refer to the corresponding data dictionaries for details. Please respect any restictions that the owners of a dataset might require. The data dictionaries, all written by me (Steve Simon), are placed in the public domain. You are free to use any data dictionary without acknowledgement or credit. If you do wish to give credit, however, when you use any data dictionary, it would be appreciated. An example of appropriate credit would be "Thanks to Steve Simon (list the url of this repository) for sharing this material."
GitHub Events
Total
- Push event: 19
Last Year
- Push event: 19