Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.5%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: shyamgupta196
  • License: mit
  • Language: Shell
  • Default Branch: main
  • Size: 2.74 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Created about 1 year ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

readme.md

Visualize Better with R

MethodsHub❤️ Guidelines

Author: Shyam Gupta (ORCID)
Email: shyam.gupta@gesis.org
Affiliation: GESIS Leibniz Institute for the Social Sciences
Date: 2025-08-12

Learning Objective

I was working on a social science research project—trying to decode the relationship between people’s well-being and their social interactions. After spending hours collecting survey responses, I realized that raw numbers alone weren’t telling the entire story.

“I needed a better way to visualize my data to spot trends, patterns, and outliers.”
That is when my journey with data visualization in R began.

By completing this tutorial, you will:

  1. Understand the strengths and limitations of each visualization type.
  2. Be able to implement and customize these visualizations in ggplot2 and related packages.
  3. Interpret results in the context of social science data exploration.
  4. Apply best practices to ensure clarity, accessibility, and reproducibility of your plots.

In this tutorial, I’ll walk you through the plots and visual techniques. I’ll share each plot, explaining why you might want to use them, what they reveal, and how to make them.

By the end of this tutorial, you’ll have a handy arsenal of 9 powerful visualization techniques perfect for social science data exploration. Let’s get started!

Target Audience

This tutorial is designed for:

  • Social Science Researchers & Graduate Students: Those analyzing survey or observational data who need clear, reproducible visualizations to support hypotheses and publications.
  • Data Analysts & Statisticians: Professionals exploring complex datasets in R who want to expand their toolbox with advanced plotting techniques.
  • Academic Instructors & Educators: Teachers seeking structured examples to demonstrate best practices in data visualization to social science students.
  • Interdisciplinary Teams & Policy Analysts: Practitioners who require accessible, publication-quality figures to communicate insights to stakeholders and decision-makers.

Computational Environment Setup

Ensure you have R (version 3.6.0 or higher) installed on your system.
You can download it from: https://cran.r-project.org/.

Install the required R packages by running:

```{r}

Core data wrangling & plotting

install.packages("dplyr") install.packages("tidyr") install.packages("hrbrthemes") install.packages("ggplot2")

Specialized geoms

install.packages("ggbeeswarm") install.packages("forcats")

Time series utilities

install.packages("zoo")

Correlation plotting

install.packages("corrplot")

Mapping

install.packages("maps") install.packages("viridis") # for color scales on maps

(Optional) Color‐blind–friendly palettes

install.packages("viridisLite") ```

Duration

  • Reading & Setup: 5 minutes
  • Code Walkthrough: 20–30 minutes
  • Hands-on Practice: 10–15 minutes

Total: 35–50 minutes

Input Data

Generate a toy dataset with two features and a group indicator

```{r}

1. load necessary libraries

library(dplyr) # Data manipulation library(tidyr) # Data reshaping library(ggplot2) # Core plotting package library(ggbeeswarm) # For geom_beeswarm() ```

```{r}

2. Generate synthetic dataset

set.seed(42) n <- 100 diagnosis <- sample(c("Group A", "Group B"), n, replace = TRUE)

Group-specific means and correlations

radiusmean <- rnorm(n, mean = ifelse(diagnosis == "Group A", 14, 15), sd = 0.7) texturemean <- rnorm(n, mean = ifelse(diagnosis == "Group A", 18, 22), sd = 2) perimetermean <- radiusmean * 6 + rnorm(n, 0, 3) # correlated with radius areamean <- radiusmean^2 * pi + rnorm(n, 0, 30) # correlated with radius concavitymean <- rbeta(n, shape1 = ifelse(diagnosis == "Group A", 2, 4), shape2 = 10) symmetrymean <- rnorm(n, mean = ifelse(diagnosis == "Group A", 0.18, 0.22), sd = 0.03)

df <- data.frame( radiusmean = radiusmean, texturemean = texturemean, perimetermean = perimetermean, areamean = areamean, concavitymean = concavitymean, symmetrymean = symmetrymean, diagnosis = diagnosis ) ```

```{r}

3. Reshape data from wide to long format

data before reshaping

head(df)

- pivot_longer(): collapse multiple feature columns into key-value pairs

dflong <- df %>% pivotlonger( cols = radiusmean:symmetrymean, namesto = "feature", valuesto = "value" )

data after reshaping

head(df_long) ```

1. Swarm Plots over Scatter Plots

The Discovery:
My first big “aha” moment came when I had to compare numerical data (like response scores) across different groups (like age brackets). I found that regular scatter plots overlapped points too much—making it hard to see the data distribution.

The Hero (Swarm Plot):
A swarm plot is useful when you want to display the distribution of a numerical variable across different categories without losing individual data points. Unlike simple scatter plots, swarm plots arrange points to avoid overlap, helping you see each observation more clearly.

When to Use:
- You have a moderate number of data points (not too large).
- You want to see each individual data point by category.
- You want a more “transparent” view of the distribution than a box or violin plot alone can provide.

```{r}

2. Create a swarm plot

- aes(x=feature, y=value): map feature names on x-axis and values on y-axis

- color by diagnosis group

Swarm plot

swarmplot <- ggplot(dflong, aes(x = feature, y = value, color = diagnosis)) + geombeeswarm(alpha = 0.8, size = 2) + thememinimal(basesize = 14) + labs( title = "Swarm Plot: Distribution of Mean Features by Diagnosis Group", subtitle = "Swarm plot reveals every individual observation without overlap", x = "Feature", y = "Value", color = "Diagnosis Group" ) + theme( axis.text.x = elementtext(angle = 45, hjust = 1), legend.position = "top" ) print(swarm_plot)

Scatter plot for comparison

scatterplot <- ggplot(dflong, aes(x = feature, y = value, color = diagnosis)) + geompoint(alpha = 0.8, size = 2, position = positionjitter(width = 0)) + thememinimal(basesize = 14) + labs( title = "Scatter Plot: Distribution of Mean Features by Diagnosis Group", subtitle = "Scatter plot suffers from overplotting—points overlap and hide density", x = "Feature", y = "Value", color = "Diagnosis Group" ) + theme( axis.text.x = elementtext(angle = 45, hjust = 1), legend.position = "top" ) print(scatterplot) ```

alt text

Interpretation of why choose swarm plot over scatter plot

Swarm plots are preferred over scatter plots when visualizing categorical groupings of continuous data because they prevent overplotting—each data point is visible and not hidden behind others. In scatter plots, especially with many overlapping points, it's hard to assess the true distribution or spot clusters and outliers. Swarm plots arrange points to minimize overlap, making the sample size, density, and group differences much clearer at a glance.

Explanation & Interpretation

  • geom_beeswarm(): Positions points in a compact, non-overlapping arrangement. Useful for moderate sample sizes (<200).
  • Transparency (alpha): Helps reveal point density when slight overlap occurs.

Interpretation: In the resulting plot, clusters of points indicate where many observations fall. Differences in the vertical spread between Group A and Group B highlight variability in that feature.

Customization Tips

  • Point shape & size: use shape= and size= inside geom_beeswarm().
  • Color palettes: integrate scale_color_brewer(palette = "Set1") or scale_color_viridis_d() for colorblind-friendly palettes.
  • Grouping multiple categories: for more than two groups, ensure contrast in color or shape.

Common Pitfalls

  • Large datasets: for >500 points, swarm plots can become cluttered; consider violin or box plots instead.
  • Uneven group sizes: extremely small groups may appear as lone points; annotate directly if needed.

2. Density Plots

The Discovery: Next, I wanted to understand the distribution of responses for certain questions (like “level of trust in institutions”) across two demographic segments. A bar chart wasn’t capturing the shape of each distribution. A histogram might work, but I wanted a smoothed look.

The Hero (Density Plot): A density plot provides a smoothed curve of the distribution. It’s like a refined histogram where you can compare the shapes of different groups without the rough bin edges.

When to Use:

  • You want to see the distribution (shape, skew, modality) of a numerical variable.
  • You have one or more categorical variables to compare (e.g., Group A vs. Group B).
  • A smoother visualization of distribution is more intuitive than a histogram.

```{{}r}

Create density plot

densityplot <- ggplot(df, aes(x = areamean, fill = diagnosis)) + geomdensity(alpha = 0.5, adjust = 1.2) + # adjust smoothness parameter thememinimal(basesize = 14) + labs( title = "Density Plot of Area Mean by Diagnosis Group", subtitle = "Smoothed distribution of 'areamean' across groups", x = "Area Mean", y = "Density", fill = "Group" )

print(density_plot)

``` image

Explanation & Interpretation

  • adjust: controls bandwidth of density estimation (higher = smoother).
  • alpha: semi-transparency allows overlapping fills to be distinguishable.

Interpretation: Overlap between curves indicates similar distributions. Divergence reveals differences in modality or skewness.

Customization Tips

  • To overlay median lines: add geom_vline(data = summary_df, aes(xintercept = median), linetype = "dashed").
  • Compare more than two groups: use faceting (facet_wrap(~diagnosis)).
  • Show rug plots: geom_rug(alpha = 0.3) adds tick marks for individual observations.

Common Pitfalls

  • Misleading smoothing: overly small adjust can produce spurious bumps; overly large can mask real structure.
  • Overlapping groups of very different sample sizes: transparency alone may not suffice—consider facetting or scaling.

3. Box Plots with Jitter

The Discovery: Sometimes, I needed a quick sense of how responses (like “support for policy X”) varied across multiple categories (such as different regions). A box plot shows median, quartiles, and outliers, but I still wanted to see some individual points.

The Hero (Box Plot + Jitter): By adding a jittered layer of points over the box plot, you get summary statistics and a look at each observation. This helps you see if outliers are truly outliers or part of a cluster.

When to Use:

  • You have at least one categorical variable and one numeric variable.
  • You want summary statistics plus raw data.
  • You’re dealing with moderate sample sizes.

{r boxjitter-setup, message=FALSE} library(ggplot2) set.seed(123) data_box <- data.frame( region = rep(c("North","South","East","West"), times = c(150, 150, 30, 70)), score = c(rnorm(150, 50, 10), rnorm(150, 55, 8), rnorm(30, 65, 12), rnorm(70, 52, 9)) )

```{r boxjitter-plot, fig.width=7, fig.height=5}

Create boxplot + jitter

boxjitter <- ggplot(databox, aes(x = region, y = score, fill = region)) + geomboxplot(width = 0.6, outlier.shape = NA) + # remove default outliers geomjitter(width = 0.2, size = 1, alpha = 0.6, color = "black") + thememinimal(basesize = 14) + labs( title = "Regional Score Distributions with Box Plot + Jitter", subtitle = "Combining summary statistics and individual data points", x = "Region", y = "Score" ) + theme(legend.position = "none")

print(box_jitter) ```

alt text

4. Violin Plot

The Discovery: I wanted to see not just the median and quartiles but the full kernel density shape of my data. A violin plot shows both summary statistics and density.

The Hero (Violin Plot): Violin plots combine box plot and density plot ideas, giving you a wider view of distribution for each category.

When to Use:

  • Similar to box plots, but you care about the full distribution shape.
  • Comparing multiple groups where you want a holistic view.

```{r violin-setup, message=FALSE} library(ggplot2) library(dplyr) library(forcats)

Load example dataset

url <- "https://raw.githubusercontent.com/holtzy/datatoviz/master/Exampledataset/10OneNumSevCatSubgroupsSevObs.csv" dataviolin <- read.csv(url) %>% mutate(tippct = round(tip/total_bill * 100, 1)) ```

```{r violin-plot, fig.width=7, fig.height=5}

Create violin plot

violinplot <- ggplot(dataviolin, aes(x = fctreorder(day, tippct), y = tippct, fill = sex)) + geomviolin(position = positiondodge(width = 0.9), alpha = 0.7, trim = FALSE) + geomboxplot(width = 0.1, position = positiondodge(width = 0.9), outlier.shape = NA, alpha = 0.5) + coordcartesian(ylim = c(0, 40)) + labs( title = "Tip Percentage by Day of Week and Gender", subtitle = "Violin plots show kernel density; boxplots show quartiles", x = "Day of Week", y = "Tip (% of Total Bill)", fill = "Gender" ) + thememinimal(basesize = 14)

print(violin_plot) ```

alt text

Explanation & Interpretation

  • trim = FALSE: Display full tail of density beyond data range.
  • geom_boxplot(): Adds quartile summary inside violins.

Interpretation: Wider sections of the violin indicate where tips concentrate. Differences between days/genders highlight behavioral patterns.

Customization Tips

  • Adjust bandwidth with geom_violin(..., adjust = 1.5).
  • Flip coordinates (coord_flip()) for horizontal violins if labels overlap.

Common Pitfalls

  • Very small sample sizes: density estimation may be misleading—consider using jitter only.

5. Bar + Line Plot

The Discovery: When analyzing time-series survey data (e.g., monthly participant counts), I wanted both bar-chart values and a trend line.

The Hero (Bar + Line Combo): Use bars for absolute values and overlay a line to depict the trend over time.

When to Use:

  • Showing period values (bars) and overall trend (line) for time-series data.

```{r barline-setup, message=FALSE} library(ggplot2) library(zoo)

data("AirPassengers") df_air <- data.frame( Month = as.Date(as.yearmon(time(AirPassengers))), Passengers = as.numeric(AirPassengers) ) Now we plot the Bar + Line Combo Plot. {r barline-plot, fig.width=7, fig.height=5}

Create bar + line chart

ggplot(dfair, aes(x = Month)) + geomcol(aes(y = Passengers), width = 25, fill = "steelblue", alpha = 0.7) + geomline(aes(y = Passengers), size = 1.2, color = "darkred") + labs( title = "Monthly International Airline Passengers (1949–1960)", subtitle = "Bar = monthly count; Line = overall trend", x = "Month", y = "Number of Passengers" ) + scalexdate(datelabels = "%Y-%m", datebreaks = "1 year") + thememinimal(basesize = 14) + theme(axis.text.x = elementtext(angle = 45, hjust = 1)) ``` alt text

Explanation & Interpretation

  • geom_col(): Creates bars using data values directly.
  • geom_line(): Plots a continuous trend line across points.
  • scalexdate(): Customizes date axis formatting.

Interpretation: Seasonal peaks in summer months emerge clearly via bars, while the red trend line contextualizes year-over-year growth.

Customization Tips

  • Add moving average: compute a rolling mean (e.g., df_air$MA <- rollmean(df_air$Passengers, 12, fill = NA)) and overlay with geom_line(aes(y = MA), linetype = "dashed").

Common Pitfalls

  • Date axis overcrowding: adjust date_breaks or rotate labels.
  • Inconsistent widths: ensure width aligns with date units (days).

6. Correlation Heatmap

The Discovery: I needed an overview of how numerical features related. Rather than a numeric matrix, a correlation heatmap highlights associations at a glance.

The Hero (Correlation Heatmap): A color-scaled matrix showing strength and direction of correlations.

When to Use:

  • Dealing with multiple numerical variables.
  • Quickly assessing which pairs are highly correlated.

```{r corr-setup, message=FALSE} library(corrplot)

Use df from section 1: numeric columns only

numcols <- df %>% select(radiusmean, texturemean, perimetermean, areamean, concavitymean, symmetrymean) corrmatrix <- cor(num_cols) ```

{r corr-plot, fig.width=6, fig.height=6} corrplot( corr_matrix, method = "shade", # use shaded squares type = "upper", # show only upper triangle tl.col = "black", # variable names in black addCoef.col = "white", # add correlation coefficients number.cex = 0.7, # size of coefficients tl.srt = 45, # rotate labels title = "Correlation Heatmap of Synthetic Features", mar = c(0, 0, 2, 0) # margin for title )

Explanation & Interpretation

  • method: defines tile style (shade, color, circle).
  • type = 'upper': hides redundant lower triangle.
  • addCoef.col: overlays numeric r-values.

Interpretation: High positive correlations (e.g., perimeter_mean vs. radius_mean) appear as darker tiles; near-zero appear as lighter.

Customization Tips

  • Use corrplot.mixed() to combine circle and number views.
  • Cluster variables with hc.order = TRUE, order = "hclust" for dendrogram ordering.

Common Pitfalls

  • Correlation does not imply causation: always inspect scatter plots for nonlinear patterns.
  • Including non-numeric data: ensure you select only numeric columns.

7. Scatter Plot with Regression Line

The Discovery: I often wondered if one factor could predict another—like, does “annual income” predict “charitable donations”?

The Hero (Scatter + Regression): Overlay a linear regression line on a scatter plot to gauge trend direction and strength.

When to Use:

  • Examining relationship between two numerical variables.
  • Getting a quick visual indication of a linear trend.

```{r scatter-setup, message=FALSE}

Reuse df from section 1

ggplot2::themeset(thememinimal(base_size = 14)) ```

```{r scatter-plot, fig.width=7, fig.height=5} scatterreg <- ggplot(df, aes(x = radiusmean, y = areamean, color = diagnosis)) + geompoint(alpha = 0.7, size = 2) + geomsmooth(method = "lm", se = TRUE, linetype = "dashed") + labs( title = "Scatter Plot with Linear Regression Line", subtitle = "Relationship between radiusmean and area_mean by group", x = "Radius Mean", y = "Area Mean", color = "Diagnosis" )

print(scatter_reg) ```

alt text

Explanation & Interpretation

  • geom_smooth(method = 'lm', se = TRUE): adds linear model fit with shaded confidence band.

Interpretation: The slope of the dashed line indicates the strength and direction of the linear relationship. Overlapping confidence bands suggest similar slopes across groups or pooled data.

Customization Tips

  • To fit separate models per group: include aes(group = diagnosis) inside geom_smooth().
  • For nonlinear trends: use method = 'loess'.

Common Pitfalls

  • Overfitting with LOESS on large samples—consider limiting points or bandwith.
  • Outliers can disproportionately influence linear fit.

8. Stacked Bar Charts

The Discovery: I wanted to show how responses divided among categories (like political affiliation vs. policy preference). A stacked bar chart does this intuitively.

The Hero (Stacked Bar Chart): Compares total counts and internal composition of each group.

When to Use:

  • You have two or more categorical variables.
  • You want overall counts and proportional breakdown.

{r stacked-setup, message=FALSE} set.seed(123) df_stack <- data.frame( Region = sample(c("North", "South", "East", "West"), 200, replace = TRUE), Preference = sample(c("Support", "Oppose", "Neutral"), 200, replace = TRUE) )

```{r stacked-plot, fig.width=7, fig.height=5} stackedbar <- ggplot(dfstack, aes(x = Region, fill = Preference)) + geombar(position = "stack") + labs( title = "Stacked Bar Chart: Preference by Region", subtitle = "Shows both total counts and subgroup composition", x = "Region", y = "Count", fill = "Preference" ) + thememinimal(base_size = 14)

print(stacked_bar) ``` alt text

9. Maps: Choropleth & Point Map

The Discovery:
I noticed that survey responses and social indicators often have strong geographic patterns—regions with similar characteristics cluster together. Traditional charts couldn’t reveal these spatial relationships. Mapping data onto geographic outlines and points immediately highlights regional hotspots and geographic trends.

The Hero (Choropleth & Point Maps):
- Choropleth Map: Colors fill geographic regions (e.g., states or countries) by aggregated metrics, allowing you to spot regional differences at a glance.
- Point Map: Overlays data points (e.g., cities with event counts) on a geographic background, with point size or color encoding additional variables.

When to Use:
- Choropleth: You have region-level aggregated data (e.g., state-level social index, county-level voting percentages).
- Point Map: You have point-level observations (e.g., survey locations, incidents) and want to visualize spatial distribution or density.
- Combined Use: Compare overall regional patterns (choropleth) with specific event clusters or outliers (points).

Step-by-Step Code for US Choropleth

```{r map-setup, message=FALSE} library(ggplot2) library(maps) library(dplyr)

Load map data for US states

statesmap <- mapdata("state")

Create synthetic social index data per state

statedata <- data.frame( region = tolower(state.name), socialindex = runif(length(state.name), min = 0, max = 100) )

Merge map coordinates with social index

usdata <- leftjoin(statesmap, statedata, by = "region") ```

```{r choropleth-plot, fig.width=7, fig.height=5} choropleth <- ggplot(usdata, aes(x = long, y = lat, group = group, fill = socialindex)) + geompolygon(color = "white") + coordfixed(1.3) + scalefillviridisc(option = "plasma") + labs( title = "Choropleth: US Social Index by State", subtitle = "Synthetic data demonstrating thematic mapping", fill = "Social Index" ) + thememinimal(basesize = 14) + theme( axis.text = elementblank(), axis.title = elementblank(), panel.grid = elementblank() )

print(choropleth) ```

alt text

Explanation & Interpretation

  • coord_fixed(1.3): ensures correct aspect ratio for maps.
  • scalefillviridis_c(): continuous palette that is perceptually uniform.

Interpretation: Darker states have higher social index values in this synthetic example; look for regional clusters (e.g., coastal vs. inland patterns).

Step-by-Step Code for World Point Map

```{r pointmap-plot, fig.width=7, fig.height=5}

Load world map data

worldmap <- mapdata("world")

Create synthetic event data for cities

set.seed(456) cities <- data.frame( city = paste("City", 1:30), long = runif(30, -180, 180), lat = runif(30, -90, 90), event_count = sample(1:100, 30, replace = TRUE) )

pointmap <- ggplot() + geompolygon( data = worldmap, aes(x = long, y = lat, group = group), fill = "gray90", color = "white" ) + geompoint( data = cities, aes(x = long, y = lat, size = eventcount), alpha = 0.7 ) + coordfixed(1.3) + labs( title = "Global Points: Synthetic Social Event Counts", subtitle = "Size of point corresponds to event frequency", size = "Event Count" ) + thememinimal(basesize = 14) + theme( axis.text = elementblank(), axis.title = elementblank(), panel.grid = element_blank() )

print(point_map) ```

alt text NOTE - since this is a random generated data. the data points are not interpretable

Customization Tips

  • Use coord_quickmap() for faster rendering on large world datasets.
  • Cluster overlapping points with geom_jitter() on lat/long or use packages like sf + ggspatial.

Common Pitfalls

  • Ignoring map projections: for publication, consider appropriate projections with coord_map() or sf.
  • Overplotting in points map: adjust transparency or bin with heatmaps.

Interpretation: Darker states have higher social index values in this synthetic example; look for regional clusters (e.g., coastal vs. inland patterns).

Conclusion

Throughout this tutorial, we've covered:

  1. Swarm Plots: Reveal individual observations clearly.
  2. Density Plots: Smooth distributions for group comparison.
  3. Box + Jitter: Combine summary statistics and raw data.
  4. Violin Plots: Show full density shapes with internal summaries.
  5. Bar + Line Combo: Dual view of counts and trends.
  6. Correlation Heatmap: Visual overview of variable associations.
  7. Scatter + Regression: Assess linear relationships with confidence bands.
  8. Stacked Bar Charts: Display totals and compositions of categories.
  9. Geographic Maps: Thematic and point-based spatial visualizations.

Use these techniques as a foundation. Customize further with color palettes, themes, annotations, and interactivity (e.g., with plotly or shiny). Share reproducible code, include captions explaining insights, and follow accessibility guidelines to make your research inclusive.

References and Reading Material

  • South East Asia AnalyticsKaggle Notebook
  • Mental Health VizKaggle Notebook
  • R for Data Science (Wickham & Grolemund) – A comprehensive guide on data manipulation and visualization in R.
  • Data to Vizdata-to-viz.com for insights on selecting the right chart type.

Contact

Questions? Reach out at shyam.gupta@gesis.org

Owner

  • Name: shyam gupta
  • Login: shyamgupta196
  • Kind: user
  • Location: indore
  • Company: sankhyikii

Following my passion of data science❣️❣️

Citation (CITATION.cff)


      

GitHub Events

Total
  • Issues event: 17
  • Issue comment event: 12
  • Push event: 41
  • Create event: 3
Last Year
  • Issues event: 17
  • Issue comment event: 12
  • Push event: 41
  • Create event: 3