Introductory R for Social Sciences - Session 1

Bella Ratmelia

Welcome!

Some preface before we begin:

Instructor(s) introduction
Session outline
Workshop format

Introduction

Bella Ratmelia, Data Services Librarian @ SMU Libraries
Teaching Assistants: Shannon (MQF) and Hector (MITB)
Workshop website: https://bellaratmelia.quarto.pub/intro-r-socsci/

Sessions Outline

Session 1

Introduction to R and RStudio
4 basic data types in R
3 basic data structures in R
Objects and Vectors

Session 2

Introduction to tibble/dataframe
Manipulating dataframe with dplyr and tidyr

Session 3

Visualizing data with ggplot
Introduction to Quarto.

Session 4

Inferential statistics in R: t-tests, chi-square, correlations, and ANOVA
Calculating Cronbach’s alpha in R (optional)

Session 5

Simple Linear Regression
Binary Logistic Regression
R Best Practices

Workshop Format

Live coding! Code along with me for the full tactile learning experience.
Occasional in-class exercises
(for IDIS100 only) weekly quiz (4 questions MCQ) after each session.
Don’t be afraid to ask for help!

Let’s begin!

What is R? What is R Studio?

R: The programming language and the software that interprets the R script

RStudio: An IDE (Integrated Development Environment) that we use to interact more easily with R language and scripts.

You will need to install both for this workshop. Go to https://posit.co/download/rstudio-desktop to download and install both if you have not done so.

7 Reasons to learn R

R is free, open-source, and cross-platform.
R does not involve lots of pointing and clicking - you don’t have to remember a complicated sequence of clicks to re-run your analysis.
R code is great for reproducibility - when someone else (including your future self) can obtain the same results from the same dataset and same analysis.
R is interdisciplinary and extensible
R is scalable and works on data of all shapes and sizes (though admittedly, it is not best at some scenarios and other languages such as python would be preferred.)
R produces high-quality and publication-ready graphics
R has a large and welcoming community - which means there are lots of help available!

A Tour of RStudio

R Studio layout

Working Directory

Working directory -> where R will look for files (scripts, data, etc).
- By default, it will be on your Desktop
- Best practice is to use R Project to organize your files and data into projects.
- When using R Project, the working directory = project folder.

Creating the project for this workshop

Go to File > New project. Choose New directory, then New project
Enter 2024-introductory-r as the name for this new folder (or “directory”) and choose where you want to put this folder, e.g. Desktop or Documents if you are on Windows
- Note: Do not put your project inside OneDrive folder, as sometimes R will have trouble accessing the folder.
This will be your working directory for the rest of the workshop!

Next, let’s create 3 folders inside our working directory:
- data - we will save our raw data here. It’s best practice to keep the data here untouched.
- data-output - if we need to modify raw data, store the modified version here.
- fig-output - we will save all the graphics we created here!

Let’s Code!

Create a new R script - File > New File > R script.

Note: RStudio does not autosave your progress, so remember to save from time to time!

R Objects and Values

In this line of code:

name <- "Anya Forger"

"Anya Forger" is a value. This can be either a character, numeric, or boolean data type. (more on this soon)
name is the object where we store this value. This is so that we can keep this value to be used later.
<- is the assignment operator to assign the value to the object.
- You can also use =, but generally in R, <- is the convention.
- Keyboard shortcut: Alt + - in Windows (Option + - in Mac)

Refresher: Quantitative Data Types

Non-Continuous Data
- Nominal/Categorical: Non-ordered, non-numerical data, used to represent qualitative attribute.
  - Example: nationality, neighborhood, employment status
- Ordinal: Ordered non-numerical data.
  - Example: Nutri-grade ratings, frequency of exercise (daily, weekly, bi-weekly)
- Discrete: Numerical data that can only take specific value (usually integers)
  - Example: Shoe size, clothing size
- Binary: Nominal data with only two possible outcome
  - Example: pass/fail, yes/no, survive/not survive

Continuous Data
- Interval: Numerical data that can take any value within a range. It does not have a “true zero”.
  - Example: Celsius scale. Temperature of 0 C does not represent absence of heat.
- Ratio: Numerical data that can take any value within a range. it has a “true zero”.
  - Example: Annual income. annual income of 0 represents no income.

Data Types in R

chara_type <- "Hello World" # Character
num_type <- 123.45 # Numeric (also sometimes called Double)
bool_type <- TRUE # Boolean/Logical (true/false)
int_type <- 25L # Integer (whole numbers)

You can use str or typeof to check the data type of an R object.

typeof(chara_type)

[1] "character"

typeof(num_type)

[1] "double"

str(bool_type) # will tell you the data type and the value inside

 logi TRUE

str(int_type)

 int 25

Arithmetic operations in R

You can do arithmetic operations in R, like so:

100 / 3

[1] 33.33333

11 ** 2

[1] 121

Boolean operations in R

Boolean operations in R (will be handy for later):

# AND operations (all sides needs to be TRUE for the result to be TRUE)
TRUE & FALSE

[1] FALSE

# OR operations (only one side needs to be TRUE for the result to be TRUE)
TRUE | FALSE

[1] TRUE

# NOT operations, which is basically flipping TRUE to FALSE and vice versa
!TRUE

[1] FALSE

Functions in R

Functions is a block of reusable code designed to do specific task. Function take inputs (a.k.a arguments or parameters), do their thing, and then return a result. (this result can either be printed out, or saved into an object!)

round(123.456, digits = 2)

[1] 123.46

Saving the result to an object:

rounded_num = round(123.456, digits = 2)
print(rounded_num)

[1] 123.46

in the example above, round is the function. 123.456 and digits = 2 are the arguments/parameters.

How do I find out more about a particular function?

You can call the help page / vignette in R by prepending ? to the function name.

E.g. if you want to find out more about the round function, you can run ?round in your R console (bottom left panel)

Data Structures in R: Vectors

Basic objects in R can only contain one value. But quite often you may want to group a bunch of values together and save it in a single object.
A vector is a data structure that can do this. It is the most common and basic data structure in R. (pretty much the workhorse of R!)

t1_courses <- c("IDIS110", "IDIS100", "PLE100", "PSYC111", "PSYC103")
str(t1_courses)

 chr [1:5] "IDIS110" "IDIS100" "PLE100" "PSYC111" "PSYC103"

print(t1_courses)

[1] "IDIS110" "IDIS100" "PLE100"  "PSYC111" "PSYC103"

Example of numeric vector:

t1_grades <- c(50, 70, 80, 95, 77)
str(t1_grades)

 num [1:5] 50 70 80 95 77

Vector Manipulations: Retrieve and update items

# retrieve the 1st item in the vector
t1_grades[1]

[1] 50

# retrieves the 1st item up to the 3rd item
t1_grades[1:3]

[1] 50 70 80

# update the value of the 1st item
t1_grades[1] <- 65
print(t1_grades[1])

[1] 65

Vector Manipulations: Retrieve items based on criteria

Let’s say we want to retrieve items that are larger than 75.

The code below will create a boolean vector called criteria that basically keep tracks on whether each items inside t1_grades fulfil our condition. The condition is “value must be > 75”. e.g. if item 1 fulfils our condition, then item 1 is ‘marked’ as TRUE. Otherwise, it will be FALSE

criteria <- t1_grades > 75 
print(criteria)

[1] FALSE FALSE  TRUE  TRUE  TRUE

This line of code applies the boolean vector criteria to t1_grades, and only retrieve items that fulfils the condition. i.e. items whose position is marked as TRUE by criteria vector

t1_grades[criteria]

[1] 80 95 77

You can shorten the code like this too:

t1_grades[t1_grades > 75]

[1] 80 95 77

Vector Manipulations: Handling NA values

NA values indicate null values, or the absence of a value (0 is still a value!)
Summary functions like mean needs you to specify in the arguments how you want it to be handled.

missing_vector <- c(21, 22, 23, 24, 25, NA, 27, 28, NA, 30)

# by default it will be confused and return NA
mean(missing_vector)

[1] NA

# indicate that we want the NA values to be removed entirely
mean(missing_vector, na.rm = TRUE)

[1] 25

Data Structures in R: Factors

Special data structure in R to deal with categorical data.
Can be ordered (ordinal) or unordered (nominal).
May look like a normal vector at first glance, so use str() to check.

Unordered (Nominal):

unordered_factor <- factor(c("SOA", "SOSS", "SCIS", "CIS", "YPHSOL")) 
str(unordered_factor)

 Factor w/ 5 levels "CIS","SCIS","SOA",..: 3 4 2 1 5

Ordered (Ordinal):

ordered_factor <- factor(c("Agree", "Disagree", "Neutral", "Disagree"), 
                         ordered = TRUE, 
                         levels = c("Disagree", "Neutral", "Agree")) 
str(ordered_factor)

 Ord.factor w/ 3 levels "Disagree"<"Neutral"<..: 3 1 2 1

Data Structures in R: Dataframe

De facto data structure for tabular data in R, and what we use for data processing, plotting, and statistics.
Similar to spreadsheets!
You can create it by hand like so:

t1_data <- data.frame(
  course_code = c("IDIS110", "IDIS100", "PLE100", "PSYC111", "PSYC103"),
  grade = c(50, 70, 80, 95, 77)
)
print(t1_data)

  course_code grade
1     IDIS110    50
2     IDIS100    70
3      PLE100    80
4     PSYC111    95
5     PSYC103    77

Alternatively, here is how to create one using the two vectors that we created earlier:

t1_data <- data.frame(course_code = t1_courses, grade = t1_grades)
print(t1_data)

  course_code grade
1     IDIS110    65
2     IDIS100    70
3      PLE100    80
4     PSYC111    95
5     PSYC103    77

Most of the time, our dataframe will be generated by loading from external data file such as CSV, SAV, or XLSX file. Let’s try loading one from a CSV!

[Interlude] Packages in R

Packages are a collections of R functions, datasets, etc. Packages extend the functionality of R.
- (Closest analogy I can think of is that they’re equivalent of browser add-ons, in a way)
Popular packages: tidyverse, caret, shiny, etc.
Installation (you only need to do this once): install.packages("package name")
Loading packages (you need to run this everytime you restart RStudio): library(package name)

Loading data from CSV

Make sure to download and save faculty_policy_eval.csv into your data folder.
Check out the data dictionary/explanatory notes to learn more about the data, including the column names, data type inside each columns, etc.
We need to use readr package, which is part of tidyverse package. So please install tidyverse first if you have not done so.

Load the CSV and save the content into a tibble/dataframe called fp_data

library(tidyverse) #load tidyverse package

fp_data <- read_csv("data/faculty_policy_eval.csv")
head(fp_data) # print the first few rows

Other functions you can use to “peek” at the date frame:

dim(fp_data) # return a vector of number of rows and columns
names(fp_data) # inspect columns
str(fp_data) # inspect structure
summary(fp_data) # summary stats of data 
head(fp_data, n=5) #view the first 5 rows
tail(fp_data, n=5) # view the last 5 rows

Basic dataframe manipulations: Retrieving values

Some basic dataframe functions before we move on to data wrangling next week:

fp_data["rank"] # retrieve column by name (returns as tibble/dataframe)
fp_data$rank # another way to retrieve column by name (returns as vector)
fp_data[3] # get an entire column by index
fp_data[1, 4] # get a cell at this row, column coord 
fp_data[3, ] # get an entire row

End of Session 1!

Next Session: Data wrangling with dplyr and tidyr packages