https://github.com/erictleung/nlmitc19
:speaker: Twitter analysis of #NLMITC19
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.5%) to scientific vocabulary
Keywords
conference
informatics
nlm
r
rmarkdown
training
twitter
Last synced: 4 months ago
·
JSON representation
Repository
:speaker: Twitter analysis of #NLMITC19
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
conference
informatics
nlm
r
rmarkdown
training
twitter
Created about 7 years ago
· Last pushed over 6 years ago
Metadata Files
Readme
README.Rmd
---
title: "#NLMITC19 Twitter Analysis"
author: "Eric Leung"
output:
md_document:
toc: true
df_print: "kable"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE)
```
## Load libraries
```{r load_packages, message=FALSE, warning=FALSE}
library(tidyverse)
library(tidytext)
library(ggrepel)
if (!requireNamespace("rtweet", quietly = TRUE)) install.packages("rtweet")
library(rtweet)
```
## Query data
Below is the code to query the Twitter data for the `#NLMITC19`. I ran this at
2019-06-28 22:50.
```{r query_tweets, eval=FALSE}
rt <- search_tweets("#NLMITC19 OR #NLMIT19", n = 1800, include_rts = FALSE)
saveRDS(rt, "nlmitc19_search.rds")
saveRDS(rt$status_id, "nlmitc19_search-ids.rds")
```
But instead, here I'll just look up the status IDs.
```{r read_in_data}
ids_file <- "nlmitc19_search-ids.rds"
nlmitc19_file <- "nlmitc19_search.rds"
# Read in search directly if exists
if (file.exists(nlmitc19_file)) {
rt <- readRDS(nlmitc19_file)
} else {
# Download status IDs file
download.file(
"https://github.com/erictleung/NLMITC19/blob/master/data/nlmitc19_search-ids.rds?raw=true",
ids_file
)
# Read status IDs from downloaded file
ids <- readRDS(ids_file)
# Lookup data associated with status ids
rt <- rtweet::lookup_tweets(ids)
}
```
## General tweet prevalence over time
Code modified from [`rstudioconf_tweets`][mk].
[mk]: https://github.com/mkearney/rstudioconf_tweets
```{r tweets_over_time, fig.height=7, fig.width=9}
rt %>%
ts_plot("30 minutes", color = "transparent") +
geom_smooth(method = "loess",
se = FALSE,
span = 0.05,
size = 2,
color = "#0066aa") +
geom_point(size = 5,
shape = 21,
fill = "#ADFF2F99",
color = "#000000dd") +
# ggplot2 theme
theme_minimal(base_size = 15) +
theme(axis.text = element_text(colour = "#222222"),
plot.title = element_text(size = rel(1.7), face = "bold"),
plot.subtitle = element_text(size = rel(1.3)),
plot.caption = element_text(colour = "#444444")) +
# Caption information
labs(title = "Frequency of tweets about #NLMITC19 over time",
subtitle = "Twitter status counts aggregated using half-hour intervals",
caption = "\n\nSource: Data gathered via Twitter's standard `search/tweets` API using rtweet",
x = NULL, y = NULL)
```
Makes sense considering there were two days of conference time.
## Most prolific tweeters?
```{r most_prolific_tweeter, fig.height=7, fig.width=9}
rt %>%
group_by(screen_name) %>%
summarise(tweets = n()) %>%
ggplot(aes(x = tweets, y = reorder(screen_name, tweets))) +
geom_point() +
# Theme styling information
theme_minimal(base_size = 15) +
theme(axis.text = element_text(colour = "#222222"),
plot.title = element_text(size = rel(1.7), face = "bold"),
plot.subtitle = element_text(size = rel(1.3)),
plot.caption = element_text(colour = "#444444")) +
# Labels
labs(title = "Top tweeters using\n#NLMITC19 or #NLMIT19",
x = "Total number of tweets",
y = "Twitter username",
caption = "\n\nSource: Data gathered via Twitter's standard `search/tweets` API using rtweet")
```
## Relationship between follower count and tweet popularity
Do more followers have more popular tweets?
I take the average number of favorite of an individual's tweets and normalize it
based on the total number of tweets.
```{r follower_vs_favorites, fig.height=7, fig.width=9}
rt %>%
# Preprocess and count average favorites normalized by number of tweets
group_by(screen_name) %>%
mutate(avg_fav = mean(favorite_count)) %>%
mutate(avg_norm_fav = avg_fav / n()) %>%
ungroup() %>%
select(screen_name, avg_fav, avg_norm_fav, followers_count) %>%
distinct() %>%
# Offset to not create infinite values when log transforming
mutate(followers_count = followers_count + 0.001) %>%
mutate(avg_norm_fav = avg_norm_fav + 0.001) %>%
# Plot results
ggplot(aes(x = followers_count, y = avg_norm_fav, label = screen_name)) +
geom_text_repel() +
geom_point() +
# Use log-scale for x-axis and y-axis
labs(title = "Average normalized number of favorites\nversus user follower count",
x = "Number of followers",
y = "Average normalized number of favorites",
caption = "\nSource: Data gathered via Twitter's standard `search/tweets` API using rtweet") +
# Theme styling information
theme_minimal(base_size = 15) +
theme(axis.text = element_text(colour = "#222222"),
plot.title = element_text(size = rel(1.7), face = "bold"),
plot.subtitle = element_text(size = rel(1.3)),
plot.caption = element_text(colour = "#444444"))
```
## Chatterplot of tweet words
```{r process_for_chatter}
rt_no_stop <- rt %>%
# Just look at tweet text
select(text, favorite_count) %>%
# Remove web links
mutate(text = str_replace_all(text, "https?[:graph:]+", "'")) %>%
# Remove mentions
# Rule are that names are alphanumeric and can have underscores.
# Names can also be preceeded with "." or end with some punctuation
# Twitter:
# help.twitter.com/en/managing-your-account/twitter-username-rules
# To avoid emails:
# stackoverflow.com/questions/4424179/how-to-validate-a-twitter-username-using-regex#comment21201837_4424288
mutate(text = str_replace_all(text,
"\\.?@([:alnum:]|_){1,15}(?![.A-Za-z])[:graph:]?",
"")) %>%
# Tokenize text to just single words
unnest_tokens(word, text) %>%
# Remove stop words (e.g., "a", "the", "and", etc)
anti_join(get_stopwords())
# Get average number of favorites
rt_word_avg_fav <- rt_no_stop %>%
# Average favorite count
group_by(word) %>%
summarize(avg_fav = mean(favorite_count))
# Count number of mentions
rt_counts <- rt_no_stop %>%
# Create word counts
count(word, sort = TRUE)
# Filter low counts and join counts and average favorite score
chatter_rt <- rt_counts %>%
filter(n > 1) %>%
filter(word != "nlmitc19") %>%
left_join(rt_word_avg_fav, by = "word")
```
Code below modified from ["RIP wordclouds, long live CHATTERPLOTS"][wordcloud].
[wordcloud]: https://towardsdatascience.com/rip-wordclouds-long-live-chatterplots-e76a76896098
```{r plot_chatter, fig.height=7, fig.width=9}
chatter_rt %>%
# Add small offset average favorite counts because some are zero and we log
# transform, which can introduce infinite values
mutate(avg_fav = avg_fav + 0.001) %>%
# Gather just top 100 mentions
top_n(100, wt = n) %>%
ggplot(aes(x = avg_fav, y = n, label = word)) +
geom_text_repel(segment.alpha = 0,
aes(colour = avg_fav, size = n)) +
# Set color gradient,log transform & customize legend
scale_color_gradient(low = "green3", high = "violetred",
trans = "log10",
guide = guide_colourbar(direction = "horizontal",
title.position = "top")) +
# Set word size range & turn off legend
scale_size_continuous(range = c(3, 10),
guide = FALSE) +
# Use log-scale for x-axis
scale_x_log10() +
ggtitle(paste0("Top 100 words from ",
nrow(rt),
" #NLMITC19 tweets, by frequency"),
subtitle = "Word frequency (size) ~ Avg number of favorites (color)") +
labs(y = "Word frequency across all tweets",
x = "Avg number of favorites in tweets containing word (log scale)",
colour = "Avg num of favs (log)") +
# minimal theme & customizations
theme_minimal() +
theme(legend.position = c(0.20, 0.99),
legend.justification = c("right","top"),
panel.grid.major = element_line(colour = "whitesmoke"))
```
## Session information
```{r}
sessionInfo()
```
Owner
- Name: Eric Leung
- Login: erictleung
- Kind: user
- Location: New York, NY
- Website: https://erictleung.com
- Repositories: 169
- Profile: https://github.com/erictleung
Data science generalist. Sharing knowledge and optimizing tools for learning and growth. Open-source and open-data advocate. Community learner.
GitHub Events
Total
Last Year
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Eric Leung | e****c@e****m | 8 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0