dplyr and tidyr (part of the tidyverse package)Environment panel is empty (click on broom icon to clean it up)Console and Plots too.library(tidyverse) and read_csv portion in the previous sessionUse read_csv from readr package (part of tidyverse) to load our data into a dataframe
Packages from tidyverse. (click here to go to the tidyverse homepage)
Posit have created cheatsheets here! (you can have this open in another tab for reference for this session!)
dplyr cheatsheet | pdf version (I personally prefer this PDF version since it’s more visual)
The data is a synthetic dataset derived from Salaries dataset in carData package.
The scenario for this dataset: the college is implementing a new policy that aims to improve faculty’s teaching, research, and service performance score, or TEARS. The new policy, called TED (TEARS Enhancement Directive), is implemented for 3 years, starting from 2020 and concluding in 2023.
The data captures information about the faculty, their TEARS score, and their attitude towards the policy.
pid |
Unique ID to identify each observation |
rank |
Faculty’s rank (AsstProf, AssocProf, Prof) |
discipline |
A (“theoretical” department) or B (“applied” department) |
yrs.since.phd |
years since PhD |
yrs.service |
years of service ini college |
sex |
Male or Female |
salary |
Average annual salary for 3 years period (2020 to 2023) in USD |
teaching.2020, research.2020, service.2020 |
TEARS score for 2020, before implementation of TED |
teaching.2023, research.2023, service.2023 |
TEARS score for 2023, after implementation of TED |
Q1 to Q5 |
Faculty’s response to 5 likert scale feedback survey. The question was: Please indicate your agreement with the following statements, with 1 = strongly disagree, 4 = neutral, and 7 = strongly agree.
|
It’s good practice to do some preliminary checks on your data to get a better sense of it!
A few things that you can do:
Check for duplicates (hint: dplyr has a function for this!)
Check on overall distributions of the categorical data
Plot the distribution of the numerical/continuous data
Scenario: Our Principal Investigator (PI) has multiple research questions to answer. As the Research Assistant (RA), we have been asked to do the following 9 tasks:
Separate the answers to the questionnaires (Q1 to Q5) into a separate CSV called faculty-q1q5.csv
Retrieve only faculty who earned more than 100k, arrange it from highest to lowest, and save it into a separate CSV called faculty-100k.csv
Make the character columns such as rank, discipline, and sex to be in lowercase.
Convert all the categorical columns such as rank, discipline, and sex to Factor. Make sure rank is ordered.
Reverse-code the answers to Q3 and Q4. Save the reversed result to Q3r and Q4r respectively.
Create a new column called sex_bin that dummy codes the sex variable. Assign male to 0 and female to 1.
Create a new column called score.2020 and score.2023 that averages each faculty’s TEARS score for the respective year. Save all of these changes you made from step 3 onwards into a new CSV called faculty_eval_with_scores.csv
Generate a summary stats of salary grouped by rank and discipline. The summary stats should include mean, median, max, min, std, and n (number of observations).
Reshape the TEARS scores (teaching, research, service, and scores) for both 2020 and 2023 into a long data shape to prepare it for visualizations and analysis. (More on data shape later)
A strategy I’d like to recommend: briefly read over the
dplyr+tidyrdocumentation, either the PDF or HTML version, and have them open on a separate tab so that you can refer to it quickly.
Separate the answers to the questionnaires (Q1 to Q5) into a separate CSV called faculty-q1q5.csv
The first few rows in the CSV:
# A tibble: 400 × 6
pid Q1 Q2 Q3 Q4 Q5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5LJVA1YT 3 3 5 4 4
2 0K5IFNJ3 4 5 4 5 4
3 PBTVSOUY 4 4 4 6 4
# ℹ 397 more rows
Retrieve only faculty who earned more than 100k, arrange it from highest to lowest, and save it into a separate CSV called faculty-100k.csv
The first few rows in the resulting CSV:
# A tibble: 257 × 18
pid rank discipline yrs.since.phd yrs.service sex salary teaching.2020
<chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 EVWK1KP3 Prof B 38 38 Male 231545 52.6
2 0NP9Q0DI Prof A 43 43 Male 205500 72.0
3 IIMZK4CE Prof A 29 7 Male 204000 69.1
# ℹ 254 more rows
# ℹ 10 more variables: teaching.2023 <dbl>, research.2020 <dbl>,
# research.2023 <dbl>, service.2020 <dbl>, service.2023 <dbl>, Q1 <dbl>,
# Q2 <dbl>, Q3 <dbl>, Q4 <dbl>, Q5 <dbl>
It is possible to complete the task like this:
But you’ll notice that it will require us to create another dataframe (salary_df)!
It may be OK for this case because we only need to do 1-2 data wrangling task, but what if we need to do 2? or 3? or 5? what if we have lots of dataframe to wrangle?
Enter the pipe operator %>%, which will allow us to “chain” functions or tasks.
Pipe operator can make things more efficient! Here is the code above re-written with pipe operator:
Since we declared the dataframe that we’d like to wrangle on at the start (fp_data), we don’t have to specify the dataframe again on the wrangling functions select, arrange, and write_csv.
Keyboard shortcut:
Ctrl+Shift+Mon Windows,Cmd+Shift+Mon Mac
Time: 5 minutes!
Retrieve only faculty with rank “Prof”, has served at least 4 years, and is from discipline B. Keep only their pid, rank, discipline, and yrs.service into a new dataframe called fp_exec1 in ascending order of yrs.service.
The first few rows of fp_exec1:
# A tibble: 135 × 4
pid rank discipline yrs.service
<chr> <chr> <chr> <dbl>
1 M1H58TKZ Prof B 4
2 3QGK7KJ8 Prof B 4
3 T6ZTREEJ Prof B 4
# ℹ 132 more rows
Make all character columns such as rank, discipline, and sex to be in lowercase.
The end result:
# A tibble: 400 × 3
rank sex discipline
<chr> <chr> <chr>
1 prof male b
2 prof male b
3 asstprof male b
# ℹ 397 more rows
Convert all the categorical columns such as rank, discipline, and sex to Factor. Make sure rank is ordered.
The rank column after the change:
tibble [400 × 1] (S3: tbl_df/tbl/data.frame)
$ rank: Ord.factor w/ 3 levels "asstprof"<"assocprof"<..: 3 3 1 3 3 2 3 3 3 3 ...
Reverse-code the answers to Q3 and Q4. Save the reversed result to Q3r and Q4r respectively.
Reverse-coding: a technique used in questionnaire-based research to score some survey items in the opposite direction of their original scale to mitigate response biases and ensure response validity.
The original and reverse-coded columns:
# A tibble: 400 × 5
pid Q3 Q3r Q4 Q4r
<chr> <dbl> <dbl> <dbl> <dbl>
1 5LJVA1YT 5 3 4 4
2 0K5IFNJ3 4 4 5 3
3 PBTVSOUY 4 4 6 2
4 FJ32GPV5 3 5 4 4
# ℹ 396 more rows
Create a new column called sex_bin that dummy codes the sex variable. Assign male to 0 and female to 1.
Dummy coding: represent categorical variables as numerical variables, which in this case 0 and 1 since the
sexvariable contains only 2 values, male and female.
The original columns and the dummy-coded one:
# A tibble: 3 × 2
sex sex_bin
<fct> <dbl>
1 male 0
2 female 1
3 male 0
Create a new column called score.2020 that averages each faculty’s TEARS score for year 2020.
Since what we are doing here is a calculation across different columns in a row (i.e. calculate the mean across 3 columns, one row at a time!), we need to apply rowwise() function.
The new column that was added:
# A tibble: 400 × 2
# Rowwise:
pid score.2020
<chr> <dbl>
1 5LJVA1YT 64.4
2 0K5IFNJ3 76.0
3 PBTVSOUY 65.9
# ℹ 397 more rows
Time: 5 minutes
This exercise have 3 parts:
score.2023score_diff that substract score.2023 value with score.2020 value.faculty_eval_with_scores.csvThe new columns:
# A tibble: 400 × 4
# Rowwise:
pid score.2020 score.2023 score_diff
<chr> <dbl> <dbl> <dbl>
1 5LJVA1YT 64.4 72.5 8.11
2 0K5IFNJ3 76.0 83.9 7.86
3 PBTVSOUY 65.9 79.1 13.2
# ℹ 397 more rows
Generate a summary stats of salary grouped by rank and discipline. The summary stats should include mean, median, max, min, std, and n (number of observations).
fp_data %>%
group_by(rank, discipline) %>%
summarise(observation = n(),
mean_salary = mean(salary),
median_salary = median(salary),
highest = max(salary),
lowest = min(salary),
std_dev = sd(salary))# A tibble: 6 × 8
# Groups: rank [3]
rank discipline observation mean_salary median_salary highest lowest std_dev
<ord> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 asstp… a 25 74387. 74000 85226 63100 5149.
2 asstp… b 43 84594. 84716 97032 68404 7148.
3 assoc… a 26 83061. 81950. 108413 62884 11051.
4 assoc… b 39 100502. 100944 126431 71065 11201.
5 prof a 132 119870. 112748. 205500 57800 27868.
6 prof b 135 133394. 132261 231545 67559 25910.
Reshape the pid and TEARS scores (teaching, research, service, and scores) for both 2020 and 2023 into a long data shape to prepare it for visualizations and analysis. Save this into a new datafame called fp_data_long.
Wide Data:
Each row is a unique observation.
Each column is a variable –> the more variables you have, the “wider” is the data
This format is more intuitive for humans!
Long data:
Each row is a unique observation.
There is a separate column indicating the variable or type of measurements
This format is more “understandable” by R, more suitable for visualizations.
Wide data:
ID Age Height Weight Income
1 1 25 170 65 50000
2 2 30 160 70 60000
3 3 35 175 80 75000
Long data:
# A tibble: 12 × 3
ID Variable Value
<int> <chr> <dbl>
1 1 Age 25
2 1 Height 170
3 1 Weight 65
4 1 Income 50000
5 2 Age 30
6 2 Height 160
7 2 Weight 70
8 2 Income 60000
9 3 Age 35
10 3 Height 175
11 3 Weight 80
12 3 Income 75000
The end result that we are looking for looks something like this example below. To achieve this, we can use pivot_longer() from tidyr package!
Here is what it would look like:
# A tibble: 2,400 × 3
pid indicator.year score
<chr> <chr> <dbl>
1 5LJVA1YT teaching.2020 59.7
2 5LJVA1YT teaching.2023 81.9
3 5LJVA1YT research.2020 71.8
4 5LJVA1YT research.2023 79.3
# ℹ 2,396 more rows
Time: 5 minutes
Generate a summary stats of yrs.since.phd grouped by rank and sex. The summary stats should include mean, median, max, min, std, and n (number of observations). It should look something like this:
# A tibble: 6 × 8
# Groups: rank [3]
rank sex observation mean_years median_years longest shortest std_dev
<ord> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 asstprof female 12 6.42 5.5 11 4 2.19
2 asstprof male 56 6.11 6 11 3 1.71
3 assocprof female 11 15.2 13 26 10 5.60
4 assocprof male 54 15.6 11.5 53 6 10.6
5 prof female 19 23.7 23 39 12 7.41
6 prof male 248 28.8 28 60 11 10.3
Let’s say I have this column called wrong_column that I want to remove:
# A tibble: 400 × 2
# Rowwise:
pid wrong_column
<chr> <chr>
1 5LJVA1YT random values
2 0K5IFNJ3 random values
3 PBTVSOUY random values
# ℹ 397 more rows
Remove the wrong column with subset -:
# A tibble: 400 × 1
# Rowwise:
pid
<chr>
1 5LJVA1YT
2 0K5IFNJ3
3 PBTVSOUY
# ℹ 397 more rows
tidyr and dplyr cheatsheet, and more! https://posit.co/resources/cheatsheets/
Next Session: Introduction to Quarto and data visualizations with ggplot