Markdown is a lightweight markup language that provides a simple and readable way to write formatted text without using complex HTML or LaTeX. It is designed to make authoring content easy for everyone!
.md
extension.R Markdown is an extension of Markdown that incorporates R code chunks and allows you to create dynamic documents that integrate text, code, and output (such as tables and plots).
.Rmd
extension.How it all works:
Illustration by Allison Horst (www.allisonhorst.com)
Illustration by Allison Horst (www.allisonhorst.com)
R Scripts
Great for quick debugging, experiment
Preferred format if you are archiving your code to GitHub or data repository
More suitable for “production” tasks e.g. automating your data cleaning and processing, custom functions, etc.
Quarto
Great for report and presentation to showcase your research insights/process as it integrates code, narrative text, visualizations, and results.
Very handy when you need your report in multiple format, e.g. in Word and PPT.
File
> New File
> Quarto Document
HTML
as the end result for nowCreate
!You can change the final result of rendering in the YAML
section of your document.
From this point until session 4 and 5, we will use Quarto document for our hands on and exercise!
Generate a new R code chunk in your quarto document. Put the following code to load the CSV into a new tibble called fs_data
.
ggplot
is plotting package that is included inside tidyverse
package
works best with data in the long format, i.e., a column for all the dimensions/measures and another column for the value for each dimension/measure.
Charts built with ggplot must include the following:
Data - the dataframe/tibble to visualize.
Aesthetic mappings (aes) - describes which variables are mapped to the x, y axes, alpha (transparency) and other visual aesthetics.
Geometric objects (geom) - describes how values are rendered; as bars, scatterplot, lines, etc.
Our PI has asked us to generate visualizations to address these questions:
Univariate visualizations:
rank
?salary
of our faculty?Bivariate/multivariate visualizations:
salary
across faculty rank
yrs.service
and salary
. Don’t forget to label the graph!Q1
to Q5
). Group the results by rank
.Tip
A strategy I’d like to recommend: briefly read over the ggplot2
documentation and have them open on a separate tab. Figure out the type of variables you need to visualize (discrete or continuous) to quickly identify which visualization would make sense.
What’s the distribution of faculty for each rank
?
What’s the distribution of salary
of our faculty?
Compare the distribution of salary
across faculty rank
.
Image source: https://www.leansigmacorporation.com/box-plot-with-minitab/
Visualize and explore the distribution of yrs.service
and salary
. Do you see any trend or differences between discipline
? Don’t forget to label the graph!
We can layer two (or more) geom objects!
Use labs
to specify the title, axis labels, subtitles, captions, etc.
# scatterplot with trendline
fs_data %>%
ggplot(aes(x = yrs.service, y = salary,
color = discipline,
shape = discipline)) +
geom_jitter() +
geom_smooth(method = "lm") +
labs(x = "Years of Service",
y = "Salary (in USD)",
title ="A Distribution of years of service and salary",
subtitle = "Comparison between disciplines",
caption = "Salary is a 3-year average from 2020 to 2023 in USD")
The result:
Are there any outliers in the TEARS score in 2023? (hint: you would need the long data format for this to be easier!)
If we take a look at the end result that we want, there are 3 variables/columns (research.2023
, teaching.2023
, service.2023
) that we need to visualize in the X-axis. But as we already know, the x
parameter inside ggplot can only accept one column!
This code below will produce a very odd-looking graph which is totally not what we want.
This means we need to “squish” all the columns/variable that we want into a single column, so that we can assign that to the x
axis. Same goes to the values for each of that variable; we need them in a single column as well and we will assign that to the y
axis.
This shape is refers to “long” data shape.
Step 1: let’s transform the data shape into a long format and save it to a separate dataframe called tears_data
.
tears_data <- fs_data %>%
select(pid, research.2023, teaching.2023, service.2023) %>%
pivot_longer(
cols = c(research.2023, teaching.2023, service.2023),
names_to = "indicator.year",
values_to = "score"
)
print(tears_data)
# A tibble: 1,200 × 3
pid indicator.year score
<chr> <chr> <dbl>
1 5LJVA1YT research.2023 79.3
2 5LJVA1YT teaching.2023 81.9
3 5LJVA1YT service.2023 56.5
4 0K5IFNJ3 research.2023 87.1
5 0K5IFNJ3 teaching.2023 83.8
6 0K5IFNJ3 service.2023 80.7
7 PBTVSOUY research.2023 79.6
8 PBTVSOUY teaching.2023 76.7
9 PBTVSOUY service.2023 81.0
10 FJ32GPV5 research.2023 78.9
# ℹ 1,190 more rows
Step 2: Now that the variables that we want are all in a single column called indicator.year
, we can visualize this more easily!
The result:
Explore the mean for each feedback questions (Q1
to Q5
). Optional: Group the results by rank
.
Step 1: Notice that we need to put multiple variables Q1-Q5 in the x
axis again. We can use the strategy that we used on Task #5 earlier here, which is to squish all the columns that we want into a long data format.
fs_data %>%
select(pid, Q1:Q5, rank) %>%
pivot_longer(cols = c("Q1", "Q2", "Q3", "Q4", "Q5"),
names_to = "question",
values_to = "response")
# A tibble: 2,000 × 4
pid rank question response
<chr> <chr> <chr> <dbl>
1 5LJVA1YT prof Q1 3
2 5LJVA1YT prof Q2 3
3 5LJVA1YT prof Q3 5
4 5LJVA1YT prof Q4 4
5 5LJVA1YT prof Q5 4
6 0K5IFNJ3 prof Q1 4
7 0K5IFNJ3 prof Q2 5
8 0K5IFNJ3 prof Q3 4
9 0K5IFNJ3 prof Q4 5
10 0K5IFNJ3 prof Q5 4
# ℹ 1,990 more rows
Step 2: Then, once all the variables Q1-Q5 are we can use group_by
to group the values in the squished column, and use summarise
to use calculate the mean for each of these groupings and save it into mean_score
column.
fs_data %>%
select(pid, Q1:Q5, rank) %>%
pivot_longer(cols = c("Q1", "Q2", "Q3", "Q4", "Q5"),
names_to = "question",
values_to = "response") %>%
group_by(question, rank) %>%
summarise(mean_score = mean(response))
# A tibble: 15 × 3
# Groups: question [5]
question rank mean_score
<chr> <chr> <dbl>
1 Q1 assocprof 4.15
2 Q1 asstprof 4.40
3 Q1 prof 4.35
4 Q2 assocprof 3.74
5 Q2 asstprof 3.76
6 Q2 prof 3.71
7 Q3 assocprof 3.91
8 Q3 asstprof 3.91
9 Q3 prof 3.87
10 Q4 assocprof 4.12
11 Q4 asstprof 3.84
12 Q4 prof 3.89
13 Q5 assocprof 3.82
14 Q5 asstprof 4.06
15 Q5 prof 4.01
Step 3: Remember to save all of these wranglings into a new data frame! let’s call it fs_data_mean
We should be able to visualize fs_data_mean
using ggplot now. To break the graph into smaller plots, we can use facet_wrap
(refer to the cheatsheet for more details)
fs_data_mean %>%
ggplot(aes(x = question, y = mean_score, fill = question)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap( ~ rank)
yrs.since.phd
with score.2023
. Is there a trend or relationship there and does the trend differ between different rank
?salary
across different sex
with a violin plot.Make sure to give the graph proper title and labels on both axis!
Quarto cheatsheet also available here: https://posit.co/resources/cheatsheets/
Check out the R Graph gallery for inspiration and code samples!
Next session: statistical tests in R (we will be using Quarto doc for session 4 and 5)