visualization-better-with-r
https://github.com/shyamgupta196/visualization-better-with-r
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.5%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: shyamgupta196
- License: mit
- Language: Shell
- Default Branch: main
- Size: 2.74 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
- Releases: 0
Metadata Files
readme.md
Visualize Better with R
Author: Shyam Gupta (ORCID)
Email: shyam.gupta@gesis.org
Affiliation: GESIS Leibniz Institute for the Social Sciences
Date: 2025-08-12
Learning Objective
I was working on a social science research project—trying to decode the relationship between people’s well-being and their social interactions. After spending hours collecting survey responses, I realized that raw numbers alone weren’t telling the entire story.
“I needed a better way to visualize my data to spot trends, patterns, and outliers.”
That is when my journey with data visualization in R began.
By completing this tutorial, you will:
- Understand the strengths and limitations of each visualization type.
- Be able to implement and customize these visualizations in
ggplot2and related packages. - Interpret results in the context of social science data exploration.
- Apply best practices to ensure clarity, accessibility, and reproducibility of your plots.
In this tutorial, I’ll walk you through the plots and visual techniques. I’ll share each plot, explaining why you might want to use them, what they reveal, and how to make them.
By the end of this tutorial, you’ll have a handy arsenal of 9 powerful visualization techniques perfect for social science data exploration. Let’s get started!
Target Audience
This tutorial is designed for:
- Social Science Researchers & Graduate Students: Those analyzing survey or observational data who need clear, reproducible visualizations to support hypotheses and publications.
- Data Analysts & Statisticians: Professionals exploring complex datasets in R who want to expand their toolbox with advanced plotting techniques.
- Academic Instructors & Educators: Teachers seeking structured examples to demonstrate best practices in data visualization to social science students.
- Interdisciplinary Teams & Policy Analysts: Practitioners who require accessible, publication-quality figures to communicate insights to stakeholders and decision-makers.
Computational Environment Setup
Ensure you have R (version 3.6.0 or higher) installed on your system.
You can download it from: https://cran.r-project.org/.
Install the required R packages by running:
```{r}
Core data wrangling & plotting
install.packages("dplyr") install.packages("tidyr") install.packages("hrbrthemes") install.packages("ggplot2")
Specialized geoms
install.packages("ggbeeswarm") install.packages("forcats")
Time series utilities
install.packages("zoo")
Correlation plotting
install.packages("corrplot")
Mapping
install.packages("maps") install.packages("viridis") # for color scales on maps
(Optional) Color‐blind–friendly palettes
install.packages("viridisLite") ```
Duration
- Reading & Setup: 5 minutes
- Code Walkthrough: 20–30 minutes
- Hands-on Practice: 10–15 minutes
Total: 35–50 minutes
Input Data
Generate a toy dataset with two features and a group indicator
```{r}
1. load necessary libraries
library(dplyr) # Data manipulation library(tidyr) # Data reshaping library(ggplot2) # Core plotting package library(ggbeeswarm) # For geom_beeswarm() ```
```{r}
2. Generate synthetic dataset
set.seed(42) n <- 100 diagnosis <- sample(c("Group A", "Group B"), n, replace = TRUE)
Group-specific means and correlations
radiusmean <- rnorm(n, mean = ifelse(diagnosis == "Group A", 14, 15), sd = 0.7) texturemean <- rnorm(n, mean = ifelse(diagnosis == "Group A", 18, 22), sd = 2) perimetermean <- radiusmean * 6 + rnorm(n, 0, 3) # correlated with radius areamean <- radiusmean^2 * pi + rnorm(n, 0, 30) # correlated with radius concavitymean <- rbeta(n, shape1 = ifelse(diagnosis == "Group A", 2, 4), shape2 = 10) symmetrymean <- rnorm(n, mean = ifelse(diagnosis == "Group A", 0.18, 0.22), sd = 0.03)
df <- data.frame( radiusmean = radiusmean, texturemean = texturemean, perimetermean = perimetermean, areamean = areamean, concavitymean = concavitymean, symmetrymean = symmetrymean, diagnosis = diagnosis ) ```
```{r}
3. Reshape data from wide to long format
data before reshaping
head(df)
- pivot_longer(): collapse multiple feature columns into key-value pairs
dflong <- df %>% pivotlonger( cols = radiusmean:symmetrymean, namesto = "feature", valuesto = "value" )
data after reshaping
head(df_long) ```
1. Swarm Plots over Scatter Plots
The Discovery:
My first big “aha” moment came when I had to compare numerical data (like response scores) across different groups (like age brackets). I found that regular scatter plots overlapped points too much—making it hard to see the data distribution.
The Hero (Swarm Plot):
A swarm plot is useful when you want to display the distribution of a numerical variable across different categories without losing individual data points. Unlike simple scatter plots, swarm plots arrange points to avoid overlap, helping you see each observation more clearly.
When to Use:
- You have a moderate number of data points (not too large).
- You want to see each individual data point by category.
- You want a more “transparent” view of the distribution than a box or violin plot alone can provide.
```{r}
2. Create a swarm plot
- aes(x=feature, y=value): map feature names on x-axis and values on y-axis
- color by diagnosis group
Swarm plot
swarmplot <- ggplot(dflong, aes(x = feature, y = value, color = diagnosis)) + geombeeswarm(alpha = 0.8, size = 2) + thememinimal(basesize = 14) + labs( title = "Swarm Plot: Distribution of Mean Features by Diagnosis Group", subtitle = "Swarm plot reveals every individual observation without overlap", x = "Feature", y = "Value", color = "Diagnosis Group" ) + theme( axis.text.x = elementtext(angle = 45, hjust = 1), legend.position = "top" ) print(swarm_plot)
Scatter plot for comparison
scatterplot <- ggplot(dflong, aes(x = feature, y = value, color = diagnosis)) + geompoint(alpha = 0.8, size = 2, position = positionjitter(width = 0)) + thememinimal(basesize = 14) + labs( title = "Scatter Plot: Distribution of Mean Features by Diagnosis Group", subtitle = "Scatter plot suffers from overplotting—points overlap and hide density", x = "Feature", y = "Value", color = "Diagnosis Group" ) + theme( axis.text.x = elementtext(angle = 45, hjust = 1), legend.position = "top" ) print(scatterplot) ```

Interpretation of why choose swarm plot over scatter plot
Swarm plots are preferred over scatter plots when visualizing categorical groupings of continuous data because they prevent overplotting—each data point is visible and not hidden behind others. In scatter plots, especially with many overlapping points, it's hard to assess the true distribution or spot clusters and outliers. Swarm plots arrange points to minimize overlap, making the sample size, density, and group differences much clearer at a glance.
Explanation & Interpretation
- geom_beeswarm(): Positions points in a compact, non-overlapping arrangement. Useful for moderate sample sizes (<200).
- Transparency (
alpha): Helps reveal point density when slight overlap occurs.
Interpretation: In the resulting plot, clusters of points indicate where many observations fall. Differences in the vertical spread between Group A and Group B highlight variability in that feature.
Customization Tips
- Point shape & size: use
shape=andsize=insidegeom_beeswarm(). - Color palettes: integrate
scale_color_brewer(palette = "Set1")orscale_color_viridis_d()for colorblind-friendly palettes. - Grouping multiple categories: for more than two groups, ensure contrast in color or shape.
Common Pitfalls
- Large datasets: for >500 points, swarm plots can become cluttered; consider violin or box plots instead.
- Uneven group sizes: extremely small groups may appear as lone points; annotate directly if needed.
2. Density Plots
The Discovery: Next, I wanted to understand the distribution of responses for certain questions (like “level of trust in institutions”) across two demographic segments. A bar chart wasn’t capturing the shape of each distribution. A histogram might work, but I wanted a smoothed look.
The Hero (Density Plot): A density plot provides a smoothed curve of the distribution. It’s like a refined histogram where you can compare the shapes of different groups without the rough bin edges.
When to Use:
- You want to see the distribution (shape, skew, modality) of a numerical variable.
- You have one or more categorical variables to compare (e.g., Group A vs. Group B).
- A smoother visualization of distribution is more intuitive than a histogram.
```{{}r}
Create density plot
densityplot <- ggplot(df, aes(x = areamean, fill = diagnosis)) + geomdensity(alpha = 0.5, adjust = 1.2) + # adjust smoothness parameter thememinimal(basesize = 14) + labs( title = "Density Plot of Area Mean by Diagnosis Group", subtitle = "Smoothed distribution of 'areamean' across groups", x = "Area Mean", y = "Density", fill = "Group" )
print(density_plot)
```
Explanation & Interpretation
- adjust: controls bandwidth of density estimation (higher = smoother).
- alpha: semi-transparency allows overlapping fills to be distinguishable.
Interpretation: Overlap between curves indicates similar distributions. Divergence reveals differences in modality or skewness.
Customization Tips
- To overlay median lines: add
geom_vline(data = summary_df, aes(xintercept = median), linetype = "dashed"). - Compare more than two groups: use faceting (
facet_wrap(~diagnosis)). - Show rug plots:
geom_rug(alpha = 0.3)adds tick marks for individual observations.
Common Pitfalls
- Misleading smoothing: overly small
adjustcan produce spurious bumps; overly large can mask real structure. - Overlapping groups of very different sample sizes: transparency alone may not suffice—consider facetting or scaling.
3. Box Plots with Jitter
The Discovery: Sometimes, I needed a quick sense of how responses (like “support for policy X”) varied across multiple categories (such as different regions). A box plot shows median, quartiles, and outliers, but I still wanted to see some individual points.
The Hero (Box Plot + Jitter): By adding a jittered layer of points over the box plot, you get summary statistics and a look at each observation. This helps you see if outliers are truly outliers or part of a cluster.
When to Use:
- You have at least one categorical variable and one numeric variable.
- You want summary statistics plus raw data.
- You’re dealing with moderate sample sizes.
{r boxjitter-setup, message=FALSE}
library(ggplot2)
set.seed(123)
data_box <- data.frame(
region = rep(c("North","South","East","West"), times = c(150, 150, 30, 70)),
score = c(rnorm(150, 50, 10), rnorm(150, 55, 8), rnorm(30, 65, 12), rnorm(70, 52, 9))
)
```{r boxjitter-plot, fig.width=7, fig.height=5}
Create boxplot + jitter
boxjitter <- ggplot(databox, aes(x = region, y = score, fill = region)) + geomboxplot(width = 0.6, outlier.shape = NA) + # remove default outliers geomjitter(width = 0.2, size = 1, alpha = 0.6, color = "black") + thememinimal(basesize = 14) + labs( title = "Regional Score Distributions with Box Plot + Jitter", subtitle = "Combining summary statistics and individual data points", x = "Region", y = "Score" ) + theme(legend.position = "none")
print(box_jitter) ```

4. Violin Plot
The Discovery: I wanted to see not just the median and quartiles but the full kernel density shape of my data. A violin plot shows both summary statistics and density.
The Hero (Violin Plot): Violin plots combine box plot and density plot ideas, giving you a wider view of distribution for each category.
When to Use:
- Similar to box plots, but you care about the full distribution shape.
- Comparing multiple groups where you want a holistic view.
```{r violin-setup, message=FALSE} library(ggplot2) library(dplyr) library(forcats)
Load example dataset
url <- "https://raw.githubusercontent.com/holtzy/datatoviz/master/Exampledataset/10OneNumSevCatSubgroupsSevObs.csv" dataviolin <- read.csv(url) %>% mutate(tippct = round(tip/total_bill * 100, 1)) ```
```{r violin-plot, fig.width=7, fig.height=5}
Create violin plot
violinplot <- ggplot(dataviolin, aes(x = fctreorder(day, tippct), y = tippct, fill = sex)) + geomviolin(position = positiondodge(width = 0.9), alpha = 0.7, trim = FALSE) + geomboxplot(width = 0.1, position = positiondodge(width = 0.9), outlier.shape = NA, alpha = 0.5) + coordcartesian(ylim = c(0, 40)) + labs( title = "Tip Percentage by Day of Week and Gender", subtitle = "Violin plots show kernel density; boxplots show quartiles", x = "Day of Week", y = "Tip (% of Total Bill)", fill = "Gender" ) + thememinimal(basesize = 14)
print(violin_plot) ```

Explanation & Interpretation
- trim = FALSE: Display full tail of density beyond data range.
- geom_boxplot(): Adds quartile summary inside violins.
Interpretation: Wider sections of the violin indicate where tips concentrate. Differences between days/genders highlight behavioral patterns.
Customization Tips
- Adjust bandwidth with
geom_violin(..., adjust = 1.5). - Flip coordinates (
coord_flip()) for horizontal violins if labels overlap.
Common Pitfalls
- Very small sample sizes: density estimation may be misleading—consider using jitter only.
5. Bar + Line Plot
The Discovery: When analyzing time-series survey data (e.g., monthly participant counts), I wanted both bar-chart values and a trend line.
The Hero (Bar + Line Combo): Use bars for absolute values and overlay a line to depict the trend over time.
When to Use:
- Showing period values (bars) and overall trend (line) for time-series data.
```{r barline-setup, message=FALSE} library(ggplot2) library(zoo)
data("AirPassengers")
df_air <- data.frame(
Month = as.Date(as.yearmon(time(AirPassengers))),
Passengers = as.numeric(AirPassengers)
)
Now we plot the Bar + Line Combo Plot.
{r barline-plot, fig.width=7, fig.height=5}
Create bar + line chart
ggplot(dfair, aes(x = Month)) +
geomcol(aes(y = Passengers), width = 25, fill = "steelblue", alpha = 0.7) +
geomline(aes(y = Passengers), size = 1.2, color = "darkred") +
labs(
title = "Monthly International Airline Passengers (1949–1960)",
subtitle = "Bar = monthly count; Line = overall trend",
x = "Month",
y = "Number of Passengers"
) +
scalexdate(datelabels = "%Y-%m", datebreaks = "1 year") +
thememinimal(basesize = 14) +
theme(axis.text.x = elementtext(angle = 45, hjust = 1))
```

Explanation & Interpretation
- geom_col(): Creates bars using data values directly.
- geom_line(): Plots a continuous trend line across points.
- scalexdate(): Customizes date axis formatting.
Interpretation: Seasonal peaks in summer months emerge clearly via bars, while the red trend line contextualizes year-over-year growth.
Customization Tips
- Add moving average: compute a rolling mean (e.g.,
df_air$MA <- rollmean(df_air$Passengers, 12, fill = NA)) and overlay withgeom_line(aes(y = MA), linetype = "dashed").
Common Pitfalls
- Date axis overcrowding: adjust
date_breaksor rotate labels. - Inconsistent widths: ensure
widthaligns with date units (days).
6. Correlation Heatmap
The Discovery: I needed an overview of how numerical features related. Rather than a numeric matrix, a correlation heatmap highlights associations at a glance.
The Hero (Correlation Heatmap): A color-scaled matrix showing strength and direction of correlations.
When to Use:
- Dealing with multiple numerical variables.
- Quickly assessing which pairs are highly correlated.
```{r corr-setup, message=FALSE} library(corrplot)
Use df from section 1: numeric columns only
numcols <- df %>% select(radiusmean, texturemean, perimetermean, areamean, concavitymean, symmetrymean) corrmatrix <- cor(num_cols) ```
{r corr-plot, fig.width=6, fig.height=6}
corrplot(
corr_matrix,
method = "shade", # use shaded squares
type = "upper", # show only upper triangle
tl.col = "black", # variable names in black
addCoef.col = "white", # add correlation coefficients
number.cex = 0.7, # size of coefficients
tl.srt = 45, # rotate labels
title = "Correlation Heatmap of Synthetic Features",
mar = c(0, 0, 2, 0) # margin for title
)

Explanation & Interpretation
- method: defines tile style (
shade,color,circle). - type = 'upper': hides redundant lower triangle.
- addCoef.col: overlays numeric r-values.
Interpretation: High positive correlations (e.g., perimeter_mean vs. radius_mean) appear as darker tiles; near-zero appear as lighter.
Customization Tips
- Use
corrplot.mixed()to combine circle and number views. - Cluster variables with
hc.order = TRUE, order = "hclust"for dendrogram ordering.
Common Pitfalls
- Correlation does not imply causation: always inspect scatter plots for nonlinear patterns.
- Including non-numeric data: ensure you select only numeric columns.
7. Scatter Plot with Regression Line
The Discovery: I often wondered if one factor could predict another—like, does “annual income” predict “charitable donations”?
The Hero (Scatter + Regression): Overlay a linear regression line on a scatter plot to gauge trend direction and strength.
When to Use:
- Examining relationship between two numerical variables.
- Getting a quick visual indication of a linear trend.
```{r scatter-setup, message=FALSE}
Reuse df from section 1
ggplot2::themeset(thememinimal(base_size = 14)) ```
```{r scatter-plot, fig.width=7, fig.height=5} scatterreg <- ggplot(df, aes(x = radiusmean, y = areamean, color = diagnosis)) + geompoint(alpha = 0.7, size = 2) + geomsmooth(method = "lm", se = TRUE, linetype = "dashed") + labs( title = "Scatter Plot with Linear Regression Line", subtitle = "Relationship between radiusmean and area_mean by group", x = "Radius Mean", y = "Area Mean", color = "Diagnosis" )
print(scatter_reg) ```

Explanation & Interpretation
- geom_smooth(method = 'lm', se = TRUE): adds linear model fit with shaded confidence band.
Interpretation: The slope of the dashed line indicates the strength and direction of the linear relationship. Overlapping confidence bands suggest similar slopes across groups or pooled data.
Customization Tips
- To fit separate models per group: include
aes(group = diagnosis)insidegeom_smooth(). - For nonlinear trends: use
method = 'loess'.
Common Pitfalls
- Overfitting with LOESS on large samples—consider limiting points or bandwith.
- Outliers can disproportionately influence linear fit.
8. Stacked Bar Charts
The Discovery: I wanted to show how responses divided among categories (like political affiliation vs. policy preference). A stacked bar chart does this intuitively.
The Hero (Stacked Bar Chart): Compares total counts and internal composition of each group.
When to Use:
- You have two or more categorical variables.
- You want overall counts and proportional breakdown.
{r stacked-setup, message=FALSE}
set.seed(123)
df_stack <- data.frame(
Region = sample(c("North", "South", "East", "West"), 200, replace = TRUE),
Preference = sample(c("Support", "Oppose", "Neutral"), 200, replace = TRUE)
)
```{r stacked-plot, fig.width=7, fig.height=5} stackedbar <- ggplot(dfstack, aes(x = Region, fill = Preference)) + geombar(position = "stack") + labs( title = "Stacked Bar Chart: Preference by Region", subtitle = "Shows both total counts and subgroup composition", x = "Region", y = "Count", fill = "Preference" ) + thememinimal(base_size = 14)
print(stacked_bar)
```

9. Maps: Choropleth & Point Map
The Discovery:
I noticed that survey responses and social indicators often have strong geographic patterns—regions with similar characteristics cluster together. Traditional charts couldn’t reveal these spatial relationships. Mapping data onto geographic outlines and points immediately highlights regional hotspots and geographic trends.
The Hero (Choropleth & Point Maps):
- Choropleth Map: Colors fill geographic regions (e.g., states or countries) by aggregated metrics, allowing you to spot regional differences at a glance.
- Point Map: Overlays data points (e.g., cities with event counts) on a geographic background, with point size or color encoding additional variables.
When to Use:
- Choropleth: You have region-level aggregated data (e.g., state-level social index, county-level voting percentages).
- Point Map: You have point-level observations (e.g., survey locations, incidents) and want to visualize spatial distribution or density.
- Combined Use: Compare overall regional patterns (choropleth) with specific event clusters or outliers (points).
Step-by-Step Code for US Choropleth
```{r map-setup, message=FALSE} library(ggplot2) library(maps) library(dplyr)
Load map data for US states
statesmap <- mapdata("state")
Create synthetic social index data per state
statedata <- data.frame( region = tolower(state.name), socialindex = runif(length(state.name), min = 0, max = 100) )
Merge map coordinates with social index
usdata <- leftjoin(statesmap, statedata, by = "region") ```
```{r choropleth-plot, fig.width=7, fig.height=5} choropleth <- ggplot(usdata, aes(x = long, y = lat, group = group, fill = socialindex)) + geompolygon(color = "white") + coordfixed(1.3) + scalefillviridisc(option = "plasma") + labs( title = "Choropleth: US Social Index by State", subtitle = "Synthetic data demonstrating thematic mapping", fill = "Social Index" ) + thememinimal(basesize = 14) + theme( axis.text = elementblank(), axis.title = elementblank(), panel.grid = elementblank() )
print(choropleth) ```

Explanation & Interpretation
- coord_fixed(1.3): ensures correct aspect ratio for maps.
- scalefillviridis_c(): continuous palette that is perceptually uniform.
Interpretation: Darker states have higher social index values in this synthetic example; look for regional clusters (e.g., coastal vs. inland patterns).
Step-by-Step Code for World Point Map
```{r pointmap-plot, fig.width=7, fig.height=5}
Load world map data
worldmap <- mapdata("world")
Create synthetic event data for cities
set.seed(456) cities <- data.frame( city = paste("City", 1:30), long = runif(30, -180, 180), lat = runif(30, -90, 90), event_count = sample(1:100, 30, replace = TRUE) )
pointmap <- ggplot() + geompolygon( data = worldmap, aes(x = long, y = lat, group = group), fill = "gray90", color = "white" ) + geompoint( data = cities, aes(x = long, y = lat, size = eventcount), alpha = 0.7 ) + coordfixed(1.3) + labs( title = "Global Points: Synthetic Social Event Counts", subtitle = "Size of point corresponds to event frequency", size = "Event Count" ) + thememinimal(basesize = 14) + theme( axis.text = elementblank(), axis.title = elementblank(), panel.grid = element_blank() )
print(point_map) ```
NOTE - since this is a random generated data. the data points are not interpretable
Customization Tips
- Use
coord_quickmap()for faster rendering on large world datasets. - Cluster overlapping points with
geom_jitter()on lat/long or use packages likesf+ggspatial.
Common Pitfalls
- Ignoring map projections: for publication, consider appropriate projections with
coord_map()orsf. - Overplotting in points map: adjust transparency or bin with heatmaps.
Interpretation: Darker states have higher social index values in this synthetic example; look for regional clusters (e.g., coastal vs. inland patterns).
Conclusion
Throughout this tutorial, we've covered:
- Swarm Plots: Reveal individual observations clearly.
- Density Plots: Smooth distributions for group comparison.
- Box + Jitter: Combine summary statistics and raw data.
- Violin Plots: Show full density shapes with internal summaries.
- Bar + Line Combo: Dual view of counts and trends.
- Correlation Heatmap: Visual overview of variable associations.
- Scatter + Regression: Assess linear relationships with confidence bands.
- Stacked Bar Charts: Display totals and compositions of categories.
- Geographic Maps: Thematic and point-based spatial visualizations.
Use these techniques as a foundation. Customize further with color palettes, themes, annotations, and interactivity (e.g., with plotly or shiny). Share reproducible code, include captions explaining insights, and follow accessibility guidelines to make your research inclusive.
References and Reading Material
- South East Asia Analytics – Kaggle Notebook
- Mental Health Viz – Kaggle Notebook
- R for Data Science (Wickham & Grolemund) – A comprehensive guide on data manipulation and visualization in R.
- Data to Viz – data-to-viz.com for insights on selecting the right chart type.
Contact
Questions? Reach out at shyam.gupta@gesis.org
Owner
- Name: shyam gupta
- Login: shyamgupta196
- Kind: user
- Location: indore
- Company: sankhyikii
- Website: sankhyikii.com
- Repositories: 3
- Profile: https://github.com/shyamgupta196
Following my passion of data science❣️❣️
Citation (CITATION.cff)
GitHub Events
Total
- Issues event: 17
- Issue comment event: 12
- Push event: 41
- Create event: 3
Last Year
- Issues event: 17
- Issue comment event: 12
- Push event: 41
- Create event: 3