Chapter 1 Introduction to R programming

This notebook collects the scripts used for teaching in BIOF1001 for Introduction to R (1 hour teaching). You can get this Rmd file on Moodle or here (right-click and “save link as” to download).

R is a programming language, particularly popular for its power in statistical computing, elegant graphics, and also genomic data analysis. It is a free and open-source software, with active support and development from the community. Additionally, R is relatively easy to get started for scientific computing.

Note To learn and practice R programming, you need to install R and RStudio on your computer. You can follow the instructions for installation in the Appendix A chapter.

1.1 Data types

In R language setting (similar to some other programming languages), there are a few commonly used data types that are predefined in the built-in environment. You can use the class() and typeof() functions to check the class and data type of any variable. In total, R has five data types:

Numeric
Integers
Complex
Logical
Characters

1.1.1 nemeric (or double)

The numeric is for numeric values, as the most common and the default data type. The numeric datatype saves values in double precision (double number of bytes in memory), so the type is also double.

x <- c(1.0, 2.0, 5.0, 7.0)
x
#> [1] 1 2 5 7
class(x)
#> [1] "numeric"
typeof(x)
#> [1] "double"

1.1.2 integer

The integer is another data type used for the set of all integers. You can use the capital ‘L’ notation as a suffix to specify a particular value as the integer data type. Also, you can convert a value into an integer type using the as.integer() function.

y <- c(1L, 2L, 5L, 7L)
y
#> [1] 1 2 5 7
class(y)
#> [1] "integer"
typeof(y)
#> [1] "integer"

# Assign a integer value to y
y = 5
 
# is y an integer?
print(is.integer(y))
#> [1] FALSE

1.1.3 logical

In R, the logical data type takes either a value of true or false. A logical value is often generated when comparing variables.

z <- c(TRUE, TRUE, TRUE, FALSE)
z
#> [1]  TRUE  TRUE  TRUE FALSE
typeof(z)
#> [1] "logical"

1.1.4 character

In R, the character is a data type where you have all the alphabets and special characters. It stores character values or strings. Strings in R can contain alphabets, numbers, and symbols. The character type is usually denoted by wrapping the value inside single or double inverted commas.

w <- c("aa", "bb", "5", "7")
w
#> [1] "aa" "bb" "5"  "7"
typeof(w)
#> [1] "character"

1.1.5 Memeory usage

Each data type is explicitly defined, especially the size of the memory.

When initializing a certain data type, there are a few small bytes used to store basic information. Let’s look at the empty values.

object.size(numeric())
#> 48 bytes
object.size(integer())
#> 48 bytes
object.size(logical())
#> 48 bytes
object.size(character())
#> 48 bytes

To illustrate, let’s use 1000 elements below to show the memory size usage for each data type. As we will see, integer and logical use 4 bytes per element, which is only half of the memory usage by double (numeric) and character of 8 bytes.

object.size(rep(1, 1000))
#> 8048 bytes
object.size(rep(1L, 1000))
#> 4048 bytes
object.size(rep(TRUE, 1000))
#> 4048 bytes
object.size(rep("aa", 1000))
#> 8104 bytes

1.2 Data structures

Data structure is one of the most important features in programming. It involves how the data is organised and can be accessed and modified. Using an appropriate data structure may largely improve computing efficiency.

1.2.1 Vector

Vector is a basic data structure in R. It contains elements in the same data type (no matter double, integer, character or others). You can check the data type by using typeof() function and the length of the vector by length() function.

Since a vector has elements of the same type, this function will try and coerce elements to the same type, if they are different. Coercion is from lower to higher types, i.e., from logical to integer to double to a character.

See more introduction here.

x <- c(1, 2, 5, 7)
x
#> [1] 1 2 5 7
typeof(x)
#> [1] "double"

x <- rep(3, 5)
x
#> [1] 3 3 3 3 3
typeof(x)
#> [1] "double"

x <- 1:12 # integer
x
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12
typeof(x)
#> [1] "integer"

1.2.2 Matrix

Matrix is a two-dimensional data structure. It is in principle built based on vector but has more convenient built-in functions for computation. It has rows and columns, both of which can also have names. To check the dimensions, you can use the dim() function.

See more introduction here.

A <- matrix(1:12, nrow=3)
A
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    4    7   10
#> [2,]    2    5    8   11
#> [3,]    3    6    9   12

B <- matrix(1:12, nrow=3, byrow=TRUE)
B
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    2    3    4
#> [2,]    5    6    7    8
#> [3,]    9   10   11   12

colnames(A) <- c("C1","C2","C3","C4")
rownames(A) <- c("R1","R2","R3")
A
#>    C1 C2 C3 C4
#> R1  1  4  7 10
#> R2  2  5  8 11
#> R3  3  6  9 12

1.2.2.1 Index Vector and Matrix

To index vector, you can use logical or integer, or the element name if it has.

Note, when using an integer for indexing, the index starts from 1 in R, unlike most programming languages where the index starts from 0.

We can also use negative integers to return all elements except those specified. But we cannot mix positive and negative integers while indexing and real numbers if used are truncated to integers.

N.B., when using logical as the index, be careful if the index length is different from the vector.

x <- 1:12

x[3]
#> [1] 3

x[2:5]
#> [1] 2 3 4 5

x[c(2, 5, 6)]                   # index with integer
#> [1] 2 5 6

x[c(TRUE, FALSE, FALSE, TRUE)]  # index with logical value
#> [1]  1  4  5  8  9 12

Now, let’s index a matrix. It is very similar to vector, but it has both rows and columns.

A <- matrix(1:12, nrow=3)
colnames(A) <- c("C1","C2","C3","C4")
rownames(A) <- c("R1","R2","R3")
A
#>    C1 C2 C3 C4
#> R1  1  4  7 10
#> R2  2  5  8 11
#> R3  3  6  9 12

A[1, 2]
#> [1] 4

A[1, "C2"]
#> [1] 4

A[1, c(2, 3)]
#> C2 C3 
#>  4  7

A[1:2, c(2, 3)]
#>    C2 C3
#> R1  4  7
#> R2  5  8

Single row or column matrix will become a vector, unless using drop=FALSE

A[1, 2:4]
#> C2 C3 C4 
#>  4  7 10
dim(A[1, 2:4])
#> NULL

A[1, 2:4, drop=FALSE]
#>    C2 C3 C4
#> R1  4  7 10
dim(A[1, 2:4, drop=FALSE])
#> [1] 1 3

1.2.2.2 Modify values

A[1, 2:4] <- c(-3, -5, 20)
A
#>    C1 C2 C3 C4
#> R1  1 -3 -5 20
#> R2  2  5  8 11
#> R3  3  6  9 12

1.2.3 List

Different from vector that has all elements in the same data type, the list data structure can have components of mixed data types. More broadly, a list can contain a list of any data structure: value, vector, matrix, etc.

We can use str() function to view the structure of a list (or any object).

x <- list(2.5, TRUE, 1:3)
x
#> [[1]]
#> [1] 2.5
#> 
#> [[2]]
#> [1] TRUE
#> 
#> [[3]]
#> [1] 1 2 3

str(x)
#> List of 3
#>  $ : num 2.5
#>  $ : logi TRUE
#>  $ : int [1:3] 1 2 3

We can also have a name for each element:

x <- list("a" = 2.5, "b" = TRUE, "c" = 1:3)
x
#> $a
#> [1] 2.5
#> 
#> $b
#> [1] TRUE
#> 
#> $c
#> [1] 1 2 3
str(x)
#> List of 3
#>  $ a: num 2.5
#>  $ b: logi TRUE
#>  $ c: int [1:3] 1 2 3

1.2.3.1 Indexing list

Different from vector and matrix, for a list, you need to use double-layer square brackets, either by numeric index or name. Alternatively, you can also use $ symbol with the name.

x[[3]]
#> [1] 1 2 3

x[["c"]]
#> [1] 1 2 3

x$c
#> [1] 1 2 3

1.2.4 Data Frame

Data frame is widely used for rectangular data, where each column has the same data type (vector) but different columns can have different data types (like Excel)

As you guess, the data frame is a special type of list: A list of vectors with the same length.

df <- data.frame("SN" = 1:2, "Age" = c(21,15), 
                 "Name" = c("John","Dora"))
df
#>   SN Age Name
#> 1  1  21 John
#> 2  2  15 Dora

df$Age[2]
#> [1] 15

# View(df)

1.2.5 Factor vs vector

For vector, if it only contains pre-defined values (which can have specified orders), you may consider using factor.

Factor is a data structure used for fields that takes only predefined, finite number of values (categorical data)
The order of predefined values can be specified, instead of alphabetic by default

x = c("single", "married", "married", "single") # vector
x
#> [1] "single"  "married" "married" "single"

class(x)
#> [1] "character"
typeof(x)
#> [1] "character"

y = factor(c("single", "married", "married", "single"))
y
#> [1] single  married married single 
#> Levels: married single

class(y)
#> [1] "factor"
typeof(y)
#> [1] "integer"

1.2.5.1 Change the order of factor levels

z <- factor(c("single", "married", "married", "single") , levels=c("single", "married"))
z
#> [1] single  married married single 
#> Levels: single married

1.2.5.2 Smaller memory as categorical data type

x = c("single", "married", "married", "single") # vector
y = factor(c("single", "married", "married", "single"))

object.size(rep(x, 1000))
#> 32160 bytes
object.size(rep(y, 1000))
#> 16560 bytes

1.3 Read and write files (tables)

Besides operating in the R environment, we also want to read data from files or write results into files.

For tables, R has convenient built-in function read.table() and write.table().

1.3.1 Read file

We can use read.table() function to read tables, e.g., in comma separated values (csv) or tab separated values (tsv) formats. See full manuals: help(read.table) or ?read.table or its online manual

help("read.table")

?read.table

Here, let’s read an example file. Data is available on Moodle and on the github repository

df = read.table("./SRP029880.colData.tsv", sep="\t")
df
#>                source_name group
#> CASE_1 metastasized cancer  CASE
#> CASE_2 metastasized cancer  CASE
#> CASE_3 metastasized cancer  CASE
#> CASE_4 metastasized cancer  CASE
#> CASE_5 metastasized cancer  CASE
#> CTRL_1        normal colon  CTRL
#> CTRL_2        normal colon  CTRL
#> CTRL_3        normal colon  CTRL
#> CTRL_4        normal colon  CTRL
#> CTRL_5        normal colon  CTRL

1.3.2 Write file

The above file is loaded as a data frame. Here, let’s add an extra column to indicate if the sample is frozen or not, then save it into a file.

df$frozen <- c(1, 1, 0, 0, 0, 1, 1, 0, 0, 0)
write.table(df, "./SRP029880.colData.add_frozen.tsv", sep="\t", quote=FALSE)

1.4 Functions and Packages

As experienced above, we have used function multiple times, e.g., read.table and typeof.

As one more example mean() is a function here and it is from the base package

x <- 4:10
mean(x)
#> [1] 7

base::mean(x)
#> [1] 7

Generally speaking, A function is a set of statements organized together to perform a specific task. Many lines of codes are packed into one function & it’s reusable.

A function can be written in the same R file and loaded, or it can be distributed as part of a package. For using such functions, we need to install the corresponding package and load it.

1.4.1 Install packages

It depends on where the package is stored. Please refers to the documentation of the specific package you want to install and use.

CRAN (the Comprehensive R Archive Network): main platform
- For example: install.packages("ggplot2")
Bioconductor: primarily for biology related packages
- For example: BiocManager::install("DESeq2")

As an example, we can install the powerful plotting package ggplot2 from CRAN.

#install.packages("ggplot2")

1.4.2 Apply function repeatly

We may often want to use a certain function for multiple times, e.g., calculate the sample mean for many genes. There are multiple ways to achieve it, e.g., via a for loop. Here, we will introduce apply and its variants for this purpose.

See more introductions here.

1.4.2.1 apply, lapply, sapply and vapply

In short, the apply() function and its variants apply a certain function to each element of a vector, a matrix, or a list.

1.4.2.1.1 apply for matrix

The apply(X, MARGIN, FUN) function works for matrix (or array) for rows or columns. For example, calculating the median of each column:

my.matrx <- matrix(1:15, nrow = 5, ncol = 3)
my.matrx
#>      [,1] [,2] [,3]
#> [1,]    1    6   11
#> [2,]    2    7   12
#> [3,]    3    8   13
#> [4,]    4    9   14
#> [5,]    5   10   15

apply(my.matrx, 2, median)
#> [1]  3  8 13

1.4.2.1.2 lapply, sapply, and vapply for list or vector

The above apply function requires MARGIN, hence won’t work for vector or list. There are a few variants to support lists or vectors for different purposes.

From the manual, we can find out the arguments for these three functions:

lapply(X, FUN, …)
sapply(X, FUN, …, simplify = TRUE, USE.NAMES = TRUE)
vapply(X, FUN, FUN.VALUE, …, USE.NAMES = TRUE)

Let’s look at examples. The lapply works for vector and list and returns a list:

A<-c(1:9)
B<-c(1:12)
C<-c(1:15)
my.lst<-list(A,B,C)

lapply(my.lst, median)
#> [[1]]
#> [1] 5
#> 
#> [[2]]
#> [1] 6.5
#> 
#> [[3]]
#> [1] 8

If you want the return as a vector, you can use sapply() for simplified output:

sapply(my.lst, median)
#> [1] 5.0 6.5 8.0

If you want to check the data type of the return, you can further use the vapply function. An error will be raised if the data type does not match.

Note, the FUN.VALUE argument takes the value as a template, so the use of numeric(1) or any numeric value, e.g., 3 is the same.

vapply(my.lst, median, numeric(1))
#> [1] 5.0 6.5 8.0

vapply(my.lst, median, 3)
#> [1] 5.0 6.5 8.0

# try this
# vapply(my.lst, median, character(1))

1.4.3 Pattern match

Matching patterns between two vectors is a very common task, for example matching the gene names of two files or matching students IDs in two courses.

Two commonly used functions for pattern match are match() and %in%, see the match() documentation:

match returns a vector of the positions of (first) matches of its first argument in its second.
%in% is a more intuitive interface as a binary operator, which returns a logical vector indicating if there is a match or not for its left operand.
They work for any data type: numeric and character

1:6 %in% c(1,3,5,9)
#> [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE

match(c(3, 4, 5), c(5, 3, 1, 9))
#> [1]  2 NA  1

idx = match(c(3, 4, 5), c(5, 3, 1, 9))
c(5, 3, 1, 9)[idx]
#> [1]  3 NA  5

1.5 Flow Control

1.5.1 Logical operator

The common logical operators are shown in the following table.

Operator	Description	Associativity
!	Logical NOT	Left to right
&	Element-wise logical AND	Left to right
&&	Logical AND	Left to right
\|	Element-wise logical OR	Left to right
\|\|	Logical OR	Left to right

A few notes:

Zero is considered FALSE and non-zero numbers are taken as TRUE.
Operators & and | perform element-wise operation, returning length of the longer operand.
For operators && and ||, it depends on the R version. In R<=4.2.3, it examines only the first element of the operands, returning a single value, while in R=4.3.1, && and || only compare single values and will return errors if input vectors.

x <- c(TRUE, FALSE, 0, 6)
y <- c(FALSE, TRUE, FALSE, TRUE)

!x
#> [1] FALSE  TRUE  TRUE FALSE

x & y
#> [1] FALSE FALSE FALSE  TRUE

x | y
#> [1]  TRUE  TRUE FALSE  TRUE

# try it yourself
# x && y
# x || y

1.5.2 if-else statements

The if-else statement allows us to control when and how particular parts of our code are executed.

# In Q10 in Section 1.8 we have x_mean and x_sd

x_mean = 1.7
x_sd = 2.9

if ( (x_mean > 0.1) && (x_sd < 2.0) ) {
  trend = "increase"
} else{
  trend = "not_increase"
}
print(trend)
#> [1] "not_increase"

x_mean = 1.7
x_st = 2.9

trend = "not_increase"
if ( (x_mean > 0.1) && (x_sd < 2.0) ) {
  trend = "increase"
}
print(trend)
#> [1] "not_increase"

1.5.3 for-loop

Loops, including for loop and while loop are used in programming to repeat a specific block of code, commonly with if-else statements.

For example, calculate the sum from 1 to 10:

sum_val = 0
for (val in seq(1, 10)) {
  # print(val) 
  sum_val = sum_val + val
}
print(sum_val)
#> [1] 55

Or detect the differential expressed genes. Note, NA value may exist in the padj column:

is_DE = rep(FALSE, nrow(df_DEG))

for (i in 1:nrow(df_DEG)) { 
  if (df_DEG$padj[i] <= 0.05) {
    is_DE[i] = TRUE 
  }
}

print(sum(is_DE))

1.6 Plotting

1.6.1 datasets

Let’s use a built-in dataset for illustration: iris (4 flower features in 3 plants)

head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

summary(iris)
#>   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
#>  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
#>  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
#>  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
#>  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
#>  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
#>  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
#>        Species  
#>  setosa    :50  
#>  versicolor:50  
#>  virginica :50  
#>                 
#>                 
#>

There are two common ways of plotting:

the built-in plotting functions
the ggplot format

1.6.2 Basic plotting

1.6.2.1 Histogram

hist(iris$Sepal.Length)

1.6.2.2 Scatter plot

plot(x=iris$Sepal.Length, y=iris$Sepal.Width)

1.6.2.3 boxplot

x1 <- iris$Sepal.Length[iris$Species == "setosa"]
x2 <- iris$Sepal.Length[iris$Species == "versicolor"]
x3 <- iris$Sepal.Length[iris$Species == "virginica"]

boxplot(x1, x2, x3)

1.6.3 ggplot2

See more instructions: http://www.sthda.com/english/wiki/ggplot2-essentials

1.6.3.1 Install and load package

if (!requireNamespace("ggplot2", quietly = TRUE))
    install.packages("ggplot2")

library(ggplot2)

1.6.3.2 Histogram

ggplot(iris, aes(x=Sepal.Length)) + 
  geom_histogram(bins = 8)

1.6.3.3 Scatter plot

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) + 
    geom_point()

1.6.3.4 Box plot

ggplot(iris, aes(x=Species, y=Sepal.Length)) + 
    geom_boxplot()

1.7 Scientific computating

1.7.1 Orders of operators

See lecture slides.

If you are not sure about a certain ordering, use brackets!

5 * 2 > 4
#> [1] TRUE

5 * (2 > 4)
#> [1] 0

1.7.2 Functions for statistics

More theory and practice to come in next session

1.7.3 Correlation

cor(iris$Sepal.Length, iris$Petal.Length)
#> [1] 0.8717538

cor.test(iris$Sepal.Length, iris$Petal.Length)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  iris$Sepal.Length and iris$Petal.Length
#> t = 21.646, df = 148, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.8270363 0.9055080
#> sample estimates:
#>       cor 
#> 0.8717538

1.7.4 Hypothesis testing (t test)

x1 <- iris$Sepal.Length[iris$Species == "setosa"]
x2 <- iris$Sepal.Length[iris$Species == "versicolor"]
x3 <- iris$Sepal.Length[iris$Species == "virginica"]

t.test(x2, x3)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  x2 and x3
#> t = -5.6292, df = 94.025, p-value = 1.866e-07
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -0.8819731 -0.4220269
#> sample estimates:
#> mean of x mean of y 
#>     5.936     6.588

1.7.5 Regression

fit <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data=iris)
summary(fit) # show results
#> 
#> Call:
#> lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
#>     data = iris)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.82816 -0.21989  0.01875  0.19709  0.84570 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   1.85600    0.25078   7.401 9.85e-12 ***
#> Sepal.Width   0.65084    0.06665   9.765  < 2e-16 ***
#> Petal.Length  0.70913    0.05672  12.502  < 2e-16 ***
#> Petal.Width  -0.55648    0.12755  -4.363 2.41e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3145 on 146 degrees of freedom
#> Multiple R-squared:  0.8586, Adjusted R-squared:  0.8557 
#> F-statistic: 295.5 on 3 and 146 DF,  p-value: < 2.2e-16

This means the fitted regression is:

Sepal.Length ~ 1.856 + 0.65*Sepal.Width + 0.709*Petal.Length - 0.556*Petal.Width

We can check how good the regression is by plotting it out

y_pred <- fit$coefficients[1] + 
  fit$coefficients[2] * iris$Sepal.Width +
  fit$coefficients[3] * iris$Petal.Length +
  fit$coefficients[4] * iris$Petal.Width

cor(iris$Sepal.Length, y_pred)
#> [1] 0.926613

plot(iris$Sepal.Length, y_pred)

1.7.6 Resource links

This notebook adapts contents from these resources:

1.7.7 Coding styling

Elegant styling is crucial for maintenance and collaboration. Here is a highly recommended guideline: http://adv-r.had.co.nz/Style.html

1.8 Exercises

This list of exercises will serve as a 2-hour demonstration in the BIOF1001 course. For other learners, you can go through it as homework. The expected time is around two hours if you have read the R materials carefully together with trying them with your own R environment.

Note, before you get started, please make sure that you are familiar with the panels in RStudio. You may watch this YouTube Video or check this RStudio cheatsheet.

1.8.1 Part 1. Basics (~40min)

Q1: We have 25 students in BIOF1001, to store the final marks (0 to 100 with a precision of 0.1), what data type will we use?

Q2: For the grades (A+ to F), what data type and data structure can be used to keep the memory minimal?

Q3: Make a matrix with name my_matrix, shape of 5 rows and 2 columns and values from 3 to 12. The first row is 3 and 4. Hint: for making a vector from 3 to 12, you may use seq() or :.

Q4: Based on Q3, add the row names to Day1 to Day5 and column names to Lunch and Dinner.

Q5: Based on Q4, get a matrix with shape of 3x1 and values of 6, 8, 10 from the matrix my_matrix.

Q6: What will you get for my_matrix[c(TRUE, FALSE, FALSE, TRUE), ]? Hint: think of recycling if the index length is different from the query dimension (Over-flexibility comes with a price of wrong use).

Q7: If you have two vectors the BIOF1001 marks and grades and one character for teaching performance "good", and you want to store them into one variable, which data structure will you use?

Q8: Now, on your Desktop folder, make a subfolder with name R_exercises and download this file of differentailly expressed genes results Diff_Expression_results.tsv (or link to view) to the folder. Check your current work directory by getwd() function and change the work directory to the folder you just created. Hint: you may use setwd() to change the work directory or use the Session button of RStudio.
Q9: Related to Q8, use the read.table() function to load the file into a data frame with the variable name df_DEG. Hint: You may consider using the full path or just the file name if it’s in the same work directory. Please keep header=TRUE for the argument. Think how to find help page for a certain function.

Q10: Can you calculate the mean and standard deviation of the log2FoldChange? If the mean >0.1 and the standard deviation < 3, set the trend variable as “increase”, otherwise “not_increase”. What will happen if you add this trend variable to the data frame, and why? Hint: use mean() and st() functions for calculating mean and standard deviation.

1.8.2 Part 2. Making plotting (~40min)

Q11: Keep use the df_DEG from part 1 Q9. Now, make a histogram of the log2FoldChange with both basic plotting function hist() and ggplot2.
Q12: Make a plot with x-axis of log10(baseMean) and y-axis of log2FoldChange. Please try both the basic plot() function and ggplot2.
Q13: Now, manipulate the dataframe by adding two columns:
- Add a column log2FC_clip for clipping log2FoldChange to [-5, +5]
- Add a column is_DE for padj < 0.05

Q14: Try the summary() function with the above df_DEG data frame, and also table() function for the is_DE column.
Q15: Based on ggplot2, add the color by the newly added column is_DE.

Q16: Set the colors to “red” and “grey”, and make it in the order of TRUE and FALSE. Hint: use factor and set the levels parameter.
Q17: Save the generated figure into “My_DEG_results.pdf”. Use the ggsave() function. Please set width = 5, height = 4.

You are expected a figure like this

DEG figure

If you want to change labels on x-axis or y-axis and font size, etc., you can simply Google and find examples.

1.8.3 Part 3. For loop and repeating processing (~40min)

Q18: Load the following table from this file on GitHub and View it in RStudio.

It contains expressoon of 619 transcription factors from 7 Nasopharyngeal carcinoma (NPC) samples and 3 nasopharyngeal lymphatic hyperplasia (NLH) samples. https://github.com/StatBiomed/NegabinGLM/raw/main/data/NPC_NLH-Tcell-donorX.tsv (or link to view)

df_NPC = read.table("https://github.com/StatBiomed/NegabinGLM/raw/main/data/NPC_NLH-Tcell-donorX.tsv", sep="\t", header=TRUE)
Q19: Extract the column 5 (HES5) to 623 (LEK4) and make it into a matrix TF_mat Hint: how to index a data frame like index a matrix.

Q20: perform normalization. Divide the TF_mat by the df_NPC$total_counts and multiply by 1000000, and assign it to a new matrix named TF_mat_norm.

You may further consider transformation by log1p(), i.e., log(TF_mat_norm + 1).

Q21: calculate the log fold change on the first gene TP73 in TF_mat_norm between NPC (row 1 to 7) and NLH (row 8 to 10) and perform t-test return the p value and log fold change.

Q22: perform t-test on the all gene in TF_mat_norm between NPC (row 1 to 7) and NLH (row 8 to 10). Hint: think of for loop or apply() function.