dplyr
and tidyr
(part of the tidyverse
package)Environment
panel is empty (click on broom icon to clean it up)Console
and Plots
too.library(tidyverse)
and read_csv
portion in the previous sessionUse read_csv
from readr
package (part of tidyverse
) to load our data into a dataframe
Packages from tidyverse
. (click here to go to the tidyverse homepage)
Posit have created cheatsheets here! (you can have this open in another tab for reference for this session!)
dplyr cheatsheet | pdf version (I personally prefer this PDF version since it’s more visual)
The data is a synthetic dataset derived from Salaries
dataset in carData
package.
The scenario for this dataset: the college is implementing a new policy that aims to improve faculty’s teaching, research, and service performance score, or TEARS. The new policy, called TED (TEARS Enhancement Directive), is implemented for 3 years, starting from 2020 and concluding in 2023.
The data captures information about the faculty, their TEARS score, and their attitude towards the policy.
pid |
Unique ID to identify each observation |
rank |
Faculty’s rank (AsstProf, AssocProf, Prof) |
discipline |
A (“theoretical” department) or B (“applied” department) |
yrs.since.phd |
years since PhD |
yrs.service |
years of service ini college |
sex |
Male or Female |
salary |
Average annual salary for 3 years period (2020 to 2023) in USD |
teaching.2020 , research.2020 , service.2020 |
TEARS score for 2020, before implementation of TED |
teaching.2023 , research.2023 , service.2023 |
TEARS score for 2023, after implementation of TED |
Q1 to Q5 |
Faculty’s response to 5 likert scale feedback survey. The question was: Please indicate your agreement with the following statements, with 1 = strongly disagree, 4 = neutral, and 7 = strongly agree.
|
It’s good practice to do some preliminary checks on your data to get a better sense of it!
A few things that you can do:
Check for duplicates (hint: dplyr has a function for this!)
Check on overall distributions of the categorical data
Plot the distribution of the numerical/continuous data
Scenario: Our Principal Investigator (PI) has multiple research questions to answer. As the Research Assistant (RA), we have been asked to do the following 9 tasks:
Separate the answers to the questionnaires (Q1 to Q5) into a separate CSV called faculty-q1q5.csv
Retrieve only faculty who earned more than 100k, arrange it from highest to lowest, and save it into a separate CSV called faculty-100k.csv
Make the character columns such as rank
, discipline
, and sex
to be in lowercase.
Convert all the categorical columns such as rank
, discipline
, and sex
to Factor. Make sure rank
is ordered.
Reverse-code the answers to Q3
and Q4
. Save the reversed result to Q3r
and Q4r
respectively.
Create a new column called sex_bin
that dummy codes the sex
variable. Assign male
to 0 and female
to 1.
Create a new column called score.2020
and score.2023
that averages each faculty’s TEARS score for the respective year. Save all of these changes you made from step 3 onwards into a new CSV called faculty_eval_with_scores.csv
Generate a summary stats of salary
grouped by rank
and discipline
. The summary stats should include mean, median, max, min, std, and n (number of observations).
Reshape the TEARS scores (teaching
, research
, service
, and scores
) for both 2020 and 2023 into a long data shape to prepare it for visualizations and analysis. (More on data shape later)
A strategy I’d like to recommend: briefly read over the
dplyr
+tidyr
documentation, either the PDF or HTML version, and have them open on a separate tab so that you can refer to it quickly.
Separate the answers to the questionnaires (Q1 to Q5) into a separate CSV called faculty-q1q5.csv
The first few rows in the CSV:
# A tibble: 400 × 6
pid Q1 Q2 Q3 Q4 Q5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5LJVA1YT 3 3 5 4 4
2 0K5IFNJ3 4 5 4 5 4
3 PBTVSOUY 4 4 4 6 4
# ℹ 397 more rows
Retrieve only faculty who earned more than 100k, arrange it from highest to lowest, and save it into a separate CSV called faculty-100k.csv
The first few rows in the resulting CSV:
# A tibble: 257 × 18
pid rank discipline yrs.since.phd yrs.service sex salary teaching.2020
<chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 EVWK1KP3 Prof B 38 38 Male 231545 52.6
2 0NP9Q0DI Prof A 43 43 Male 205500 72.0
3 IIMZK4CE Prof A 29 7 Male 204000 69.1
# ℹ 254 more rows
# ℹ 10 more variables: teaching.2023 <dbl>, research.2020 <dbl>,
# research.2023 <dbl>, service.2020 <dbl>, service.2023 <dbl>, Q1 <dbl>,
# Q2 <dbl>, Q3 <dbl>, Q4 <dbl>, Q5 <dbl>
It is possible to complete the task like this:
But you’ll notice that it will require us to create another dataframe (salary_df
)!
It may be OK for this case because we only need to do 1-2 data wrangling task, but what if we need to do 2? or 3? or 5? what if we have lots of dataframe to wrangle?
Enter the pipe operator %>%
, which will allow us to “chain” functions or tasks.
Pipe operator can make things more efficient! Here is the code above re-written with pipe operator:
Since we declared the dataframe that we’d like to wrangle on at the start (fp_data
), we don’t have to specify the dataframe again on the wrangling functions select
, arrange
, and write_csv
.
Keyboard shortcut:
Ctrl
+Shift
+M
on Windows,Cmd
+Shift
+M
on Mac
Time: 5 minutes!
Retrieve only faculty with rank
“Prof”, has served at least 4 years, and is from discipline B. Keep only their pid
, rank
, discipline
, and yrs.service
into a new dataframe called fp_exec1
in ascending order of yrs.service
.
The first few rows of fp_exec1
:
# A tibble: 135 × 4
pid rank discipline yrs.service
<chr> <chr> <chr> <dbl>
1 M1H58TKZ Prof B 4
2 3QGK7KJ8 Prof B 4
3 T6ZTREEJ Prof B 4
# ℹ 132 more rows
Make all character columns such as rank
, discipline
, and sex
to be in lowercase.
The end result:
# A tibble: 400 × 3
rank sex discipline
<chr> <chr> <chr>
1 prof male b
2 prof male b
3 asstprof male b
# ℹ 397 more rows
Convert all the categorical columns such as rank
, discipline
, and sex
to Factor. Make sure rank
is ordered.
The rank column after the change:
tibble [400 × 1] (S3: tbl_df/tbl/data.frame)
$ rank: Ord.factor w/ 3 levels "asstprof"<"assocprof"<..: 3 3 1 3 3 2 3 3 3 3 ...
Reverse-code the answers to Q3
and Q4
. Save the reversed result to Q3r
and Q4r
respectively.
Reverse-coding: a technique used in questionnaire-based research to score some survey items in the opposite direction of their original scale to mitigate response biases and ensure response validity.
The original and reverse-coded columns:
# A tibble: 400 × 5
pid Q3 Q3r Q4 Q4r
<chr> <dbl> <dbl> <dbl> <dbl>
1 5LJVA1YT 5 3 4 4
2 0K5IFNJ3 4 4 5 3
3 PBTVSOUY 4 4 6 2
4 FJ32GPV5 3 5 4 4
# ℹ 396 more rows
Create a new column called sex_bin
that dummy codes the sex
variable. Assign male
to 0 and female
to 1.
Dummy coding: represent categorical variables as numerical variables, which in this case 0 and 1 since the
sex
variable contains only 2 values, male and female.
The original columns and the dummy-coded one:
# A tibble: 3 × 2
sex sex_bin
<fct> <dbl>
1 male 0
2 female 1
3 male 0
Create a new column called score.2020
that averages each faculty’s TEARS score for year 2020.
Since what we are doing here is a calculation across different columns in a row (i.e. calculate the mean across 3 columns, one row at a time!), we need to apply rowwise()
function.
The new column that was added:
# A tibble: 400 × 2
# Rowwise:
pid score.2020
<chr> <dbl>
1 5LJVA1YT 64.4
2 0K5IFNJ3 76.0
3 PBTVSOUY 65.9
# ℹ 397 more rows
Time: 5 minutes
This exercise have 3 parts:
score.2023
score_diff
that substract score.2023
value with score.2020
value.faculty_eval_with_scores.csv
The new columns:
# A tibble: 400 × 4
# Rowwise:
pid score.2020 score.2023 score_diff
<chr> <dbl> <dbl> <dbl>
1 5LJVA1YT 64.4 72.5 8.11
2 0K5IFNJ3 76.0 83.9 7.86
3 PBTVSOUY 65.9 79.1 13.2
# ℹ 397 more rows
Generate a summary stats of salary
grouped by rank
and discipline
. The summary stats should include mean, median, max, min, std, and n (number of observations).
fp_data %>%
group_by(rank, discipline) %>%
summarise(observation = n(),
mean_salary = mean(salary),
median_salary = median(salary),
highest = max(salary),
lowest = min(salary),
std_dev = sd(salary))
# A tibble: 6 × 8
# Groups: rank [3]
rank discipline observation mean_salary median_salary highest lowest std_dev
<ord> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 asstp… a 25 74387. 74000 85226 63100 5149.
2 asstp… b 43 84594. 84716 97032 68404 7148.
3 assoc… a 26 83061. 81950. 108413 62884 11051.
4 assoc… b 39 100502. 100944 126431 71065 11201.
5 prof a 132 119870. 112748. 205500 57800 27868.
6 prof b 135 133394. 132261 231545 67559 25910.
Reshape the pid
and TEARS scores (teaching
, research
, service
, and scores
) for both 2020 and 2023 into a long data shape to prepare it for visualizations and analysis. Save this into a new datafame called fp_data_long
.
Wide Data:
Each row is a unique observation.
Each column is a variable –> the more variables you have, the “wider” is the data
This format is more intuitive for humans!
Long data:
Each row is a unique observation.
There is a separate column indicating the variable or type of measurements
This format is more “understandable” by R, more suitable for visualizations.
Wide data:
ID Age Height Weight Income
1 1 25 170 65 50000
2 2 30 160 70 60000
3 3 35 175 80 75000
Long data:
# A tibble: 12 × 3
ID Variable Value
<int> <chr> <dbl>
1 1 Age 25
2 1 Height 170
3 1 Weight 65
4 1 Income 50000
5 2 Age 30
6 2 Height 160
7 2 Weight 70
8 2 Income 60000
9 3 Age 35
10 3 Height 175
11 3 Weight 80
12 3 Income 75000
The end result that we are looking for looks something like this example below. To achieve this, we can use pivot_longer()
from tidyr
package!
Here is what it would look like:
# A tibble: 2,400 × 3
pid indicator.year score
<chr> <chr> <dbl>
1 5LJVA1YT teaching.2020 59.7
2 5LJVA1YT teaching.2023 81.9
3 5LJVA1YT research.2020 71.8
4 5LJVA1YT research.2023 79.3
# ℹ 2,396 more rows
Time: 5 minutes
Generate a summary stats of yrs.since.phd
grouped by rank
and sex
. The summary stats should include mean, median, max, min, std, and n (number of observations). It should look something like this:
# A tibble: 6 × 8
# Groups: rank [3]
rank sex observation mean_years median_years longest shortest std_dev
<ord> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 asstprof female 12 6.42 5.5 11 4 2.19
2 asstprof male 56 6.11 6 11 3 1.71
3 assocprof female 11 15.2 13 26 10 5.60
4 assocprof male 54 15.6 11.5 53 6 10.6
5 prof female 19 23.7 23 39 12 7.41
6 prof male 248 28.8 28 60 11 10.3
Let’s say I have this column called wrong_column
that I want to remove:
# A tibble: 400 × 2
# Rowwise:
pid wrong_column
<chr> <chr>
1 5LJVA1YT random values
2 0K5IFNJ3 random values
3 PBTVSOUY random values
# ℹ 397 more rows
Remove the wrong column with subset -
:
# A tibble: 400 × 1
# Rowwise:
pid
<chr>
1 5LJVA1YT
2 0K5IFNJ3
3 PBTVSOUY
# ℹ 397 more rows
tidyr
and dplyr
cheatsheet, and more! https://posit.co/resources/cheatsheets/
Next Session: Introduction to Quarto and data visualizations with ggplot