Introductory R for Social Sciences - Session 1

Bella Ratmelia

Welcome!

Some preface before we begin:

  • Instructor(s) introduction

  • Session outline

  • Workshop format

Introduction

  • Bella Ratmelia, Data Services Librarian @ SMU Libraries

  • Teaching Assistants: Shannon (MQF) and Hector (MITB)

  • Workshop website: https://bellaratmelia.quarto.pub/intro-r-socsci/

Sessions Outline

Session 1

  • Introduction to R and RStudio

  • 4 basic data types in R

  • 3 basic data structures in R

  • Objects and Vectors

Session 2

  • Introduction to tibble/dataframe

  • Manipulating dataframe with dplyr and tidyr

Session 3

  • Visualizing data with ggplot

  • Introduction to Quarto.

Session 4

  • Inferential statistics in R: t-tests, chi-square, correlations, and ANOVA

  • Calculating Cronbach’s alpha in R (optional)

Session 5

  • Simple Linear Regression

  • Binary Logistic Regression

  • R Best Practices

Workshop Format

  • Live coding! Code along with me for the full tactile learning experience.

  • Occasional in-class exercises

  • (for IDIS100 only) weekly quiz (4 questions MCQ) after each session.

  • Don’t be afraid to ask for help!

Let’s begin!

What is R? What is R Studio?

R: The programming language and the software that interprets the R script

RStudio: An IDE (Integrated Development Environment) that we use to interact more easily with R language and scripts.

You will need to install both for this workshop. Go to https://posit.co/download/rstudio-desktop to download and install both if you have not done so.

7 Reasons to learn R

  1. R is free, open-source, and cross-platform.

  2. R does not involve lots of pointing and clicking - you don’t have to remember a complicated sequence of clicks to re-run your analysis.

  3. R code is great for reproducibility - when someone else (including your future self) can obtain the same results from the same dataset and same analysis.

  4. R is interdisciplinary and extensible

  5. R is scalable and works on data of all shapes and sizes (though admittedly, it is not best at some scenarios and other languages such as python would be preferred.)

  6. R produces high-quality and publication-ready graphics

  7. R has a large and welcoming community - which means there are lots of help available!

A Tour of RStudio

R Studio layout

Working Directory

  • Working directory -> where R will look for files (scripts, data, etc).

    • By default, it will be on your Desktop

    • Best practice is to use R Project to organize your files and data into projects.

    • When using R Project, the working directory = project folder.

Creating the project for this workshop

  1. Go to File > New project. Choose New directory, then New project

  2. Enter 2024-introductory-r as the name for this new folder (or “directory”) and choose where you want to put this folder, e.g. Desktop or Documents if you are on Windows

    • Note: Do not put your project inside OneDrive folder, as sometimes R will have trouble accessing the folder.
  3. This will be your working directory for the rest of the workshop!

  1. Next, let’s create 3 folders inside our working directory:

    • data - we will save our raw data here. It’s best practice to keep the data here untouched.

    • data-output - if we need to modify raw data, store the modified version here.

    • fig-output - we will save all the graphics we created here!

Let’s Code!

Create a new R script - File > New File > R script.

Note: RStudio does not autosave your progress, so remember to save from time to time!

R Objects and Values

In this line of code:

name <- "Anya Forger"
  • "Anya Forger" is a value. This can be either a character, numeric, or boolean data type. (more on this soon)

  • name is the object where we store this value. This is so that we can keep this value to be used later.

  • <- is the assignment operator to assign the value to the object.

    • You can also use =, but generally in R, <- is the convention.

    • Keyboard shortcut: Alt + - in Windows (Option + - in Mac)

Refresher: Quantitative Data Types

  • Non-Continuous Data

    • Nominal/Categorical: Non-ordered, non-numerical data, used to represent qualitative attribute.

      • Example: nationality, neighborhood, employment status
    • Ordinal: Ordered non-numerical data.

      • Example: Nutri-grade ratings, frequency of exercise (daily, weekly, bi-weekly)
    • Discrete: Numerical data that can only take specific value (usually integers)

      • Example: Shoe size, clothing size
    • Binary: Nominal data with only two possible outcome

      • Example: pass/fail, yes/no, survive/not survive
  • Continuous Data

    • Interval: Numerical data that can take any value within a range. It does not have a “true zero”.

      • Example: Celsius scale. Temperature of 0 C does not represent absence of heat.
    • Ratio: Numerical data that can take any value within a range. it has a “true zero”.

      • Example: Annual income. annual income of 0 represents no income.

Data Types in R

chara_type <- "Hello World" # Character
num_type <- 123.45 # Numeric (also sometimes called Double)
bool_type <- TRUE # Boolean/Logical (true/false)
int_type <- 25L # Integer (whole numbers)

You can use str or typeof to check the data type of an R object.

typeof(chara_type)
[1] "character"
typeof(num_type)
[1] "double"
str(bool_type) # will tell you the data type and the value inside
 logi TRUE
str(int_type)
 int 25

Arithmetic operations in R

You can do arithmetic operations in R, like so:

100 / 3
[1] 33.33333
11 ** 2
[1] 121

Boolean operations in R

Boolean operations in R (will be handy for later):

# AND operations (all sides needs to be TRUE for the result to be TRUE)
TRUE & FALSE 
[1] FALSE
# OR operations (only one side needs to be TRUE for the result to be TRUE)
TRUE | FALSE
[1] TRUE
# NOT operations, which is basically flipping TRUE to FALSE and vice versa
!TRUE 
[1] FALSE

Functions in R

Functions is a block of reusable code designed to do specific task. Function take inputs (a.k.a arguments or parameters), do their thing, and then return a result. (this result can either be printed out, or saved into an object!)

round(123.456, digits = 2)
[1] 123.46

Saving the result to an object:

rounded_num = round(123.456, digits = 2)
print(rounded_num)
[1] 123.46

in the example above, round is the function. 123.456 and digits = 2 are the arguments/parameters.

How do I find out more about a particular function?

You can call the help page / vignette in R by prepending ? to the function name.

E.g. if you want to find out more about the round function, you can run ?round in your R console (bottom left panel)

Data Structures in R: Vectors

  • Basic objects in R can only contain one value. But quite often you may want to group a bunch of values together and save it in a single object.

  • A vector is a data structure that can do this. It is the most common and basic data structure in R. (pretty much the workhorse of R!)

t1_courses <- c("IDIS110", "IDIS100", "PLE100", "PSYC111", "PSYC103")
str(t1_courses)
 chr [1:5] "IDIS110" "IDIS100" "PLE100" "PSYC111" "PSYC103"
print(t1_courses)
[1] "IDIS110" "IDIS100" "PLE100"  "PSYC111" "PSYC103"

Example of numeric vector:

t1_grades <- c(50, 70, 80, 95, 77)
str(t1_grades)
 num [1:5] 50 70 80 95 77

Vector Manipulations: Retrieve and update items

# retrieve the 1st item in the vector
t1_grades[1]
[1] 50
# retrieves the 1st item up to the 3rd item
t1_grades[1:3]
[1] 50 70 80
# update the value of the 1st item
t1_grades[1] <- 65
print(t1_grades[1])
[1] 65

Vector Manipulations: Retrieve items based on criteria

Let’s say we want to retrieve items that are larger than 75.

The code below will create a boolean vector called criteria that basically keep tracks on whether each items inside t1_grades fulfil our condition. The condition is “value must be > 75”. e.g. if item 1 fulfils our condition, then item 1 is ‘marked’ as TRUE. Otherwise, it will be FALSE

criteria <- t1_grades > 75 
print(criteria)
[1] FALSE FALSE  TRUE  TRUE  TRUE

This line of code applies the boolean vector criteria to t1_grades, and only retrieve items that fulfils the condition. i.e. items whose position is marked as TRUE by criteria vector

t1_grades[criteria]
[1] 80 95 77

You can shorten the code like this too:

t1_grades[t1_grades > 75]
[1] 80 95 77

Vector Manipulations: Handling NA values

  • NA values indicate null values, or the absence of a value (0 is still a value!)

  • Summary functions like mean needs you to specify in the arguments how you want it to be handled.

missing_vector <- c(21, 22, 23, 24, 25, NA, 27, 28, NA, 30)

# by default it will be confused and return NA
mean(missing_vector)
[1] NA
# indicate that we want the NA values to be removed entirely
mean(missing_vector, na.rm = TRUE)
[1] 25

Data Structures in R: Factors

  • Special data structure in R to deal with categorical data.

  • Can be ordered (ordinal) or unordered (nominal).

  • May look like a normal vector at first glance, so use str() to check.

Unordered (Nominal):

unordered_factor <- factor(c("SOA", "SOSS", "SCIS", "CIS", "YPHSOL")) 
str(unordered_factor)
 Factor w/ 5 levels "CIS","SCIS","SOA",..: 3 4 2 1 5

Ordered (Ordinal):

ordered_factor <- factor(c("Agree", "Disagree", "Neutral", "Disagree"), 
                         ordered = TRUE, 
                         levels = c("Disagree", "Neutral", "Agree")) 
str(ordered_factor)
 Ord.factor w/ 3 levels "Disagree"<"Neutral"<..: 3 1 2 1

Data Structures in R: Dataframe

  • De facto data structure for tabular data in R, and what we use for data processing, plotting, and statistics.

  • Similar to spreadsheets!

  • You can create it by hand like so:

t1_data <- data.frame(
  course_code = c("IDIS110", "IDIS100", "PLE100", "PSYC111", "PSYC103"),
  grade = c(50, 70, 80, 95, 77)
)
print(t1_data)
  course_code grade
1     IDIS110    50
2     IDIS100    70
3      PLE100    80
4     PSYC111    95
5     PSYC103    77

Alternatively, here is how to create one using the two vectors that we created earlier:

t1_data <- data.frame(course_code = t1_courses, grade = t1_grades)
print(t1_data)
  course_code grade
1     IDIS110    65
2     IDIS100    70
3      PLE100    80
4     PSYC111    95
5     PSYC103    77

Most of the time, our dataframe will be generated by loading from external data file such as CSV, SAV, or XLSX file. Let’s try loading one from a CSV!

[Interlude] Packages in R

  • Packages are a collections of R functions, datasets, etc. Packages extend the functionality of R.

    • (Closest analogy I can think of is that they’re equivalent of browser add-ons, in a way)
  • Popular packages: tidyverse, caret, shiny, etc.

  • Installation (you only need to do this once): install.packages("package name")

  • Loading packages (you need to run this everytime you restart RStudio): library(package name)

Loading data from CSV

  • Make sure to download and save faculty_policy_eval.csv into your data folder.

  • Check out the data dictionary/explanatory notes to learn more about the data, including the column names, data type inside each columns, etc.

  • We need to use readr package, which is part of tidyverse package. So please install tidyverse first if you have not done so.

Load the CSV and save the content into a tibble/dataframe called fp_data

library(tidyverse) #load tidyverse package

fp_data <- read_csv("data/faculty_policy_eval.csv")
head(fp_data) # print the first few rows

Other functions you can use to “peek” at the date frame:

dim(fp_data) # return a vector of number of rows and columns
names(fp_data) # inspect columns
str(fp_data) # inspect structure
summary(fp_data) # summary stats of data 
head(fp_data, n=5) #view the first 5 rows
tail(fp_data, n=5) # view the last 5 rows

Basic dataframe manipulations: Retrieving values

Some basic dataframe functions before we move on to data wrangling next week:

fp_data["rank"] # retrieve column by name (returns as tibble/dataframe)
fp_data$rank # another way to retrieve column by name (returns as vector)
fp_data[3] # get an entire column by index
fp_data[1, 4] # get a cell at this row, column coord 
fp_data[3, ] # get an entire row

End of Session 1!

Next Session: Data wrangling with dplyr and tidyr packages