Introductory R for Social Sciences - Session 2

Bella Ratmelia

Today’s Outline

Loading our data into RStudio environment
Data wrangling with dplyr and tidyr (part of the tidyverse package)

Checklist when you start RStudio

Load the project we created last session and open the R script file.
Make sure that Environment panel is empty (click on broom icon to clean it up)
Clear the Console and Plots too.
Re-run the library(tidyverse) and read_csv portion in the previous session

Refresher: Loading from CSV into a dataframe

Use read_csv from readr package (part of tidyverse) to load our data into a dataframe

# import tidyverse library
library(tidyverse)

# read the CSV and save into a dataframe called fp_data
fp_data <- read_csv("data/faculty_policy_eval.csv")

# "peek at the data, pay attention to the data types!
glimpse(fp_data)

Cleaning data for analysis

Why do it in R? Because it’s much efficient to do so in R, especially if your data is large (e.g. millions of rows, hundreds of columns) and you have repetitive clean up tasks.
Incorrect or inconsistent data can lead to false conclusions, so it’s important to clean and prep it correctly.
Having a clear understanding of the desired data shape is essential as real data often differs from what you imagine! Refer to codebook, actual questionnaire, appendix for guidance.
Data cleaning techniques differ based on the problems, data type, and the research questions you are trying to answer. Various methods are available, each with its own trade-offs.

About dplyr and tidyr

Packages from tidyverse. (click here to go to the tidyverse homepage)
Posit have created cheatsheets here! (you can have this open in another tab for reference for this session!)
- dplyr cheatsheet | pdf version (I personally prefer this PDF version since it’s more visual)
- tidyr cheatsheet | pdf version

About the data

The data is a synthetic dataset derived from Salaries dataset in carData package.

The scenario for this dataset: the college is implementing a new policy that aims to improve faculty’s teaching, research, and service performance score, or TEARS. The new policy, called TED (TEARS Enhancement Directive), is implemented for 3 years, starting from 2020 and concluding in 2023.

The data captures information about the faculty, their TEARS score, and their attitude towards the policy.

Explanatory notes on each column
`pid`	Unique ID to identify each observation
`rank`	Faculty’s rank (AsstProf, AssocProf, Prof)
`discipline`	A (“theoretical” department) or B (“applied” department)
`yrs.since.phd`	years since PhD
`yrs.service`	years of service ini college
`sex`	Male or Female
`salary`	Average annual salary for 3 years period (2020 to 2023) in USD
`teaching.2020`, `research.2020`, `service.2020`	TEARS score for 2020, before implementation of TED
`teaching.2023`, `research.2023`, `service.2023`	TEARS score for 2023, after implementation of TED
`Q1` to `Q5`	Faculty’s response to 5 likert scale feedback survey. The question was: Please indicate your agreement with the following statements, with 1 = strongly disagree, 4 = neutral, and 7 = strongly agree. feel adequately trained and informed about TED. TED implementation was effective and efficient. The communication about TED implementation was confusing. I found it challenging to understand the reasons behind TED. TED aligns well with our goal and values.

Prelim checks for your data

It’s good practice to do some preliminary checks on your data to get a better sense of it!

A few things that you can do:

Check for duplicates (hint: dplyr has a function for this!)
Check on overall distributions of the categorical data
Plot the distribution of the numerical/continuous data

Data wrangling activities specific to our data

Scenario: Our Principal Investigator (PI) has multiple research questions to answer. As the Research Assistant (RA), we have been asked to do the following 9 tasks:

Separate the answers to the questionnaires (Q1 to Q5) into a separate CSV called faculty-q1q5.csv
Retrieve only faculty who earned more than 100k, arrange it from highest to lowest, and save it into a separate CSV called faculty-100k.csv
Make the character columns such as rank, discipline, and sex to be in lowercase.
Convert all the categorical columns such as rank, discipline, and sex to Factor. Make sure rank is ordered.
Reverse-code the answers to Q3 and Q4. Save the reversed result to Q3r and Q4r respectively.

Create a new column called sex_bin that dummy codes the sex variable. Assign male to 0 and female to 1.
Create a new column called score.2020 and score.2023 that averages each faculty’s TEARS score for the respective year. Save all of these changes you made from step 3 onwards into a new CSV called faculty_eval_with_scores.csv
Generate a summary stats of salary grouped by rank and discipline. The summary stats should include mean, median, max, min, std, and n (number of observations).
Reshape the TEARS scores (teaching, research, service, and scores) for both 2020 and 2023 into a long data shape to prepare it for visualizations and analysis. (More on data shape later)

A strategy I’d like to recommend: briefly read over the dplyr + tidyr documentation, either the PDF or HTML version, and have them open on a separate tab so that you can refer to it quickly.

Let’s wrangle our data!

Task #1

Separate the answers to the questionnaires (Q1 to Q5) into a separate CSV called faculty-q1q5.csv

questionnaire_df <- select(fp_data, pid, Q1:Q5) # select columns
write_csv(questionnaire_df, "data-output/faculty-q1q5.csv")

The first few rows in the CSV:

# A tibble: 400 × 6
  pid         Q1    Q2    Q3    Q4    Q5
  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
1 5LJVA1YT     3     3     5     4     4
2 0K5IFNJ3     4     5     4     5     4
3 PBTVSOUY     4     4     4     6     4
# ℹ 397 more rows

Task #2

Retrieve only faculty who earned more than 100k, arrange it from highest to lowest, and save it into a separate CSV called faculty-100k.csv

The first few rows in the resulting CSV:

# A tibble: 257 × 18
  pid      rank  discipline yrs.since.phd yrs.service sex   salary teaching.2020
  <chr>    <chr> <chr>              <dbl>       <dbl> <chr>  <dbl>         <dbl>
1 EVWK1KP3 Prof  B                     38          38 Male  231545          52.6
2 0NP9Q0DI Prof  A                     43          43 Male  205500          72.0
3 IIMZK4CE Prof  A                     29           7 Male  204000          69.1
# ℹ 254 more rows
# ℹ 10 more variables: teaching.2023 <dbl>, research.2020 <dbl>,
#   research.2023 <dbl>, service.2020 <dbl>, service.2023 <dbl>, Q1 <dbl>,
#   Q2 <dbl>, Q3 <dbl>, Q4 <dbl>, Q5 <dbl>

Interlude: Pipe Operator ( %>% )

It is possible to complete the task like this:

salary_df <- filter(fp_data, salary > 100000)
salary_df_desc <- arrange(salary_df, desc(salary))
write_csv(salary_df_desc, "data-output/faculty-100k.csv")

But you’ll notice that it will require us to create another dataframe (salary_df)!

It may be OK for this case because we only need to do 1-2 data wrangling task, but what if we need to do 2? or 3? or 5? what if we have lots of dataframe to wrangle?

Enter the pipe operator %>%, which will allow us to “chain” functions or tasks.

Interlude: Pipe Operator ( %>% )

Pipe operator can make things more efficient! Here is the code above re-written with pipe operator:

fp_data %>% 
    filter(salary > 100000) %>% 
    arrange(desc(salary)) %>% 
    write_csv("data-output/faculty-100k.csv")

Since we declared the dataframe that we’d like to wrangle on at the start (fp_data), we don’t have to specify the dataframe again on the wrangling functions select, arrange, and write_csv.

Keyboard shortcut: Ctrl+Shift+M on Windows, Cmd+Shift+M on Mac

Group exercise 1 (solo attempts ok)

Time: 5 minutes!

Retrieve only faculty with rank “Prof”, has served at least 4 years, and is from discipline B. Keep only their pid, rank, discipline, and yrs.service into a new dataframe called fp_exec1 in ascending order of yrs.service.

fp_exec1 <- fp_data %>% 
    filter(rank == "Prof" & yrs.service >= 4 & discipline == "B") %>% 
    select(pid, rank, discipline, yrs.service) %>% 
    arrange(yrs.service)

The first few rows of fp_exec1:

# A tibble: 135 × 4
  pid      rank  discipline yrs.service
  <chr>    <chr> <chr>            <dbl>
1 M1H58TKZ Prof  B                    4
2 3QGK7KJ8 Prof  B                    4
3 T6ZTREEJ Prof  B                    4
# ℹ 132 more rows

Task #3

Make all character columns such as rank, discipline, and sex to be in lowercase.

cols_to_change <- c("rank", "sex", "discipline")
fp_data <- fp_data %>% 
    mutate(across(cols_to_change, tolower))

The end result:

# A tibble: 400 × 3
  rank     sex   discipline
  <chr>    <chr> <chr>     
1 prof     male  b         
2 prof     male  b         
3 asstprof male  b         
# ℹ 397 more rows

Task #4

Convert all the categorical columns such as rank, discipline, and sex to Factor. Make sure rank is ordered.

fp_data <- fp_data %>% 
    mutate(across(cols_to_change, as.factor))

#reordering
fp_data <- fp_data %>% 
    mutate(rank = factor(rank, 
                         levels = c("asstprof", "assocprof", "prof"), 
                         ordered = TRUE))

The rank column after the change:

tibble [400 × 1] (S3: tbl_df/tbl/data.frame)
 $ rank: Ord.factor w/ 3 levels "asstprof"<"assocprof"<..: 3 3 1 3 3 2 3 3 3 3 ...

Task #5

Reverse-code the answers to Q3 and Q4. Save the reversed result to Q3r and Q4r respectively.

Reverse-coding: a technique used in questionnaire-based research to score some survey items in the opposite direction of their original scale to mitigate response biases and ensure response validity.

fp_data <- fp_data %>% 
    mutate(Q3r = 8 - Q3) # scale is 1 to 7

fp_data <- fp_data %>% 
    mutate(Q4r = 8 - Q4)

The original and reverse-coded columns:

# A tibble: 400 × 5
  pid         Q3   Q3r    Q4   Q4r
  <chr>    <dbl> <dbl> <dbl> <dbl>
1 5LJVA1YT     5     3     4     4
2 0K5IFNJ3     4     4     5     3
3 PBTVSOUY     4     4     6     2
4 FJ32GPV5     3     5     4     4
# ℹ 396 more rows

Task #6

Create a new column called sex_bin that dummy codes the sex variable. Assign male to 0 and female to 1.

Dummy coding: represent categorical variables as numerical variables, which in this case 0 and 1 since the sex variable contains only 2 values, male and female.

fp_data <- fp_data %>% 
    mutate(sex_bin = if_else(sex == "male", 0, 1))

The original columns and the dummy-coded one:

# A tibble: 3 × 2
  sex    sex_bin
  <fct>    <dbl>
1 male         0
2 female       1
3 male         0

Task #7

Create a new column called score.2020 that averages each faculty’s TEARS score for year 2020.

Since what we are doing here is a calculation across different columns in a row (i.e. calculate the mean across 3 columns, one row at a time!), we need to apply rowwise() function.

fp_data <- fp_data %>% 
    rowwise() %>% 
    mutate(score.2020 = mean(c(teaching.2020, research.2020, service.2020)))

The new column that was added:

# A tibble: 400 × 2
# Rowwise: 
  pid      score.2020
  <chr>         <dbl>
1 5LJVA1YT       64.4
2 0K5IFNJ3       76.0
3 PBTVSOUY       65.9
# ℹ 397 more rows

Group exercise 2 (solo attempts ok)

Time: 5 minutes

This exercise have 3 parts:

Replicate Task #7 for 2023 TEARS scores! Save the average score into a new column called score.2023
Create a new column called score_diff that substract score.2023 value with score.2020 value.
Save all of these changes you made from step 3 up to this point into a new CSV called faculty_eval_with_scores.csv

fp_data %>% 
    rowwise() %>% 
    mutate(score.2020 = mean(c(teaching.2020, research.2020, service.2020))) %>% 
    mutate(score.2023 = mean(c(teaching.2023, research.2023, service.2023))) %>% 
    mutate(score_diff = score.2023 - score.2020) %>% 
    write_csv("data-output/faculty_eval_with_scores.csv")

The new columns:

# A tibble: 400 × 4
# Rowwise: 
  pid      score.2020 score.2023 score_diff
  <chr>         <dbl>      <dbl>      <dbl>
1 5LJVA1YT       64.4       72.5       8.11
2 0K5IFNJ3       76.0       83.9       7.86
3 PBTVSOUY       65.9       79.1      13.2 
# ℹ 397 more rows

Task #8

Generate a summary stats of salary grouped by rank and discipline. The summary stats should include mean, median, max, min, std, and n (number of observations).

fp_data %>% 
    group_by(rank, discipline) %>% 
    summarise(observation = n(), 
              mean_salary = mean(salary),
              median_salary = median(salary), 
              highest = max(salary),
              lowest = min(salary),
              std_dev = sd(salary))

# A tibble: 6 × 8
# Groups:   rank [3]
  rank   discipline observation mean_salary median_salary highest lowest std_dev
  <ord>  <fct>            <int>       <dbl>         <dbl>   <dbl>  <dbl>   <dbl>
1 asstp… a                   25      74387.        74000    85226  63100   5149.
2 asstp… b                   43      84594.        84716    97032  68404   7148.
3 assoc… a                   26      83061.        81950.  108413  62884  11051.
4 assoc… b                   39     100502.       100944   126431  71065  11201.
5 prof   a                  132     119870.       112748.  205500  57800  27868.
6 prof   b                  135     133394.       132261   231545  67559  25910.

Task #9

Reshape the pid and TEARS scores (teaching, research, service, and scores) for both 2020 and 2023 into a long data shape to prepare it for visualizations and analysis. Save this into a new datafame called fp_data_long.

Long vs Wide Data

Wide Data:

Each row is a unique observation.
Each column is a variable –> the more variables you have, the “wider” is the data
This format is more intuitive for humans!

Long data:

Each row is a unique observation.
There is a separate column indicating the variable or type of measurements
This format is more “understandable” by R, more suitable for visualizations.

Examples

Wide data:

  ID Age Height Weight Income
1  1  25    170     65  50000
2  2  30    160     70  60000
3  3  35    175     80  75000

Long data:

# A tibble: 12 × 3
      ID Variable Value
   <int> <chr>    <dbl>
 1     1 Age         25
 2     1 Height     170
 3     1 Weight      65
 4     1 Income   50000
 5     2 Age         30
 6     2 Height     160
 7     2 Weight      70
 8     2 Income   60000
 9     3 Age         35
10     3 Height     175
11     3 Weight      80
12     3 Income   75000

Back to Task #9…

The end result that we are looking for looks something like this example below. To achieve this, we can use pivot_longer() from tidyr package!

fp_data_long <- fp_data %>% select(pid, teaching.2020:service.2023) %>%
    pivot_longer(
        cols = c(teaching.2020, teaching.2023, research.2020, research.2023, service.2020, service.2023),
        names_to = "indicator.year",
        values_to = "score"
    )

Here is what it would look like:

# A tibble: 2,400 × 3
  pid      indicator.year score
  <chr>    <chr>          <dbl>
1 5LJVA1YT teaching.2020   59.7
2 5LJVA1YT teaching.2023   81.9
3 5LJVA1YT research.2020   71.8
4 5LJVA1YT research.2023   79.3
# ℹ 2,396 more rows

Group exercise 3 (solo attempts ok)

Time: 5 minutes

Generate a summary stats of yrs.since.phd grouped by rank and sex. The summary stats should include mean, median, max, min, std, and n (number of observations). It should look something like this:

# A tibble: 6 × 8
# Groups:   rank [3]
  rank      sex    observation mean_years median_years longest shortest std_dev
  <ord>     <fct>        <int>      <dbl>        <dbl>   <dbl>    <dbl>   <dbl>
1 asstprof  female          12       6.42          5.5      11        4    2.19
2 asstprof  male            56       6.11          6        11        3    1.71
3 assocprof female          11      15.2          13        26       10    5.60
4 assocprof male            54      15.6          11.5      53        6   10.6 
5 prof      female          19      23.7          23        39       12    7.41
6 prof      male           248      28.8          28        60       11   10.3

Bonus: Deleting rows from dataframe

Let’s say I have this column called wrong_column that I want to remove:

# A tibble: 400 × 2
# Rowwise: 
  pid      wrong_column 
  <chr>    <chr>        
1 5LJVA1YT random values
2 0K5IFNJ3 random values
3 PBTVSOUY random values
# ℹ 397 more rows

Remove the wrong column with subset -:

fp_data <- fp_data %>% 
    select(-wrong_column)

# A tibble: 400 × 1
# Rowwise: 
  pid     
  <chr>   
1 5LJVA1YT
2 0K5IFNJ3
3 PBTVSOUY
# ℹ 397 more rows

End of Session 2!

tidyr and dplyr cheatsheet, and more! https://posit.co/resources/cheatsheets/

Next Session: Introduction to Quarto and data visualizations with ggplot