Chapter 10 Genetic sequence

10.1 Case study 1: Genetic sequence analysis

10.1.1 Sequence motif and k-mer

Scenario: You are a biomedical data science trainee, and you want to use some real case studies to understand how to perform sequence analysis. From these practices, you hope to understand how the sickle cell disease happens and how the choice of gene editing site makes a potential therapy.

From the lecture, you already understood that:

the sickle cell disease happens due to a single nucleotide mutation on HBB ( coding the hemoglobin bata chain for adult use), causing a missense mutation on amino acid 7 from E to V. You can view more from ClinVar RCV000016574 and OMIM 603903.
the therapy was to re-activate another gene HBG (gamma chain) that was used in the fetus. The HBG gene was repressed by a transcription factor BCL11A (as a protein) that serves as a repressor to HBG.

Q1. How can we understand where the protein of the BCL11A gene prefers to bind on DNA?

Answers:

Using the binding motif database JASPAR: BCL11A’s motif is MA2324.1
It collected the sequences of many binding sites already, check the FASTA file or the HTML file.
It’s Motif can be viewed as this file

Q2. What is the meaning of position frequency matrix (PFM) and how to convert it into a position probability matrix (PPM, often used as position weight matrix PWM)?

MA2324_PFM <- matrix(c(
  343,    180,    125,   5724,    155,    143,   4495,
  5024,   120,    281,     92,   5855,   5964,    641,
  591,    165,   5747,    100,    141,     55,    385,
  307,   5800,    112,    349,    114,    103,    744),
  nrow = 4, byrow = TRUE
)
rownames(MA2324_PFM) = c("A", "C", "G", "T")
MA2324_PFM
#>   [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> A  343  180  125 5724  155  143 4495
#> C 5024  120  281   92 5855 5964  641
#> G  591  165 5747  100  141   55  385
#> T  307 5800  112  349  114  103  744

# TODO: define position probability matrix

HowTo:

MA2324_PWM = MA2324_PFM / sum(MA2324_PFM[, 1])
MA2324_PWM
#>         [,1]       [,2]       [,3]       [,4]       [,5]        [,6]       [,7]
#> A 0.05474860 0.02873105 0.01995211 0.91364725 0.02474062 0.022825219 0.71747805
#> C 0.80191540 0.01915403 0.04485235 0.01468476 0.93455706 0.951955307 0.10231445
#> G 0.09433360 0.02633679 0.91731844 0.01596169 0.02250599 0.008778931 0.06145251
#> T 0.04900239 0.92577813 0.01787709 0.05570630 0.01819633 0.016440543 0.11875499

Q3. Let’s make the Motif logo ourselves by using weblogo.threeplusone.com

You may use the first 500 lines of the sequences that we compiled: MA2324.1.sites-h500.txt

HowTo:

Copy and paste to weblogo.threeplusone.com
Check which position has the highest information content in bit.

Q4. Given the motif, how good is a certain sequence in terms of consistency? Take these two sequences as examples:

sequence 1: CGGACCA (>hg38_chr1:1000830-1000836(+))
sequence 2: CTGACCG (hg38_chr1:940785-940791(-))

# Use this One hot encoding function
Onehot_Encode <- function(sequence) {
  idx = match(strsplit(sequence, '')[[1]], c("A", "C", "G", "T"))
  seq_mat = matrix(0, 4, length(idx))
  for (i in 1:length(idx)) seq_mat[idx[i], i] = 1
  seq_mat
}

seq_vec1 = Onehot_Encode("CGGACCA")
seq_vec2 = Onehot_Encode("CTGACCG")
seq_vec2
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,]    0    0    0    1    0    0    0
#> [2,]    1    0    0    0    1    1    0
#> [3,]    0    0    1    0    0    0    1
#> [4,]    0    1    0    0    0    0    0

To calculate the consistency score of a certain sequence \(S\) to a given motif (\(M\) for the motif position weight matrix), you may recall what we introduced in the lecture (same as the reference papers [1, 2]):

\(P(S|M) = \prod_{i=1}^k{M[s_i, i]}\)

Alternatively, you can also transform the sequence via our one hot encoding. Now, please try and see if you can calculate the motif score for the above two sequences (possibly in log10-scale).

Example solution scripts:

Meaning of motif score function: the log-likelihood of seeing a binding site sequence given the binding motif.

Manual calculation step by step:

colSums(seq_vec1 * MA2324_PWM)
#> [1] 0.80191540 0.02633679 0.91731844 0.91364725 0.93455706 0.95195531 0.71747805

colSums(seq_vec2 * MA2324_PWM)
#> [1] 0.80191540 0.92577813 0.91731844 0.91364725 0.93455706 0.95195531 0.06145251

seq_score1 = sum(log10(colSums(seq_vec1 * MA2324_PWM)))
seq_score2 = sum(log10(colSums(seq_vec2 * MA2324_PWM)))

c(seq_score1, seq_score2)
#> [1] -1.946979 -1.468304

Q5. Given a segment of a long sequence, how to find the potential motif sites?

Scan the whole sequence
Find the best matched position

Here, we will take the gene HBG2 as an example, with the initial sequence segment as follows:

ref|NC_000011.10|:c5255019-5254782 Homo sapiens chromosome 11, GRCh38.p14 Primary Assembly
ATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTCCCCACACTATCTCAATGCAAATATCTGT
CTGAAACGGTCCCTGGCTAAACTCCACCCATGGGTTGGCCAGCCTTGCCTTGACCAATAGCCTTGACAAG
GCAAACTTGACCAATAGTCTTAGAGTATCCAGTGAGGCCAGGGGCCGGCGGCTGGCTAGGGATGAAGAAT
AAAAGGAAGCACCCTTCAGCAGTTCCAC

We may load the sequence into R by our Onehot_Encode function:

HBG2_segment <- paste0(
  "ATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTCCCCACACTATCTCAATGCAAATATCTGT",
  "CTGAAACGGTCCCTGGCTAAACTCCACCCATGGGTTGGCCAGCCTTGCCTTGACCAATAGCCTTGACAAG",
  "GCAAACTTGACCAATAGTCTTAGAGTATCCAGTGAGGCCAGGGGCCGGCGGCTGGCTAGGGATGAAGAAT",
  "AAAAGGAAGCACCCTTCAGCAGTTCCAC"
)
HBG2_segment
#> [1] "ATAAAAAAAATTAAGCAGCAGTATCCTCTTGGGGGCCCCTTCCCCACACTATCTCAATGCAAATATCTGTCTGAAACGGTCCCTGGCTAAACTCCACCCATGGGTTGGCCAGCCTTGCCTTGACCAATAGCCTTGACAAGGCAAACTTGACCAATAGTCTTAGAGTATCCAGTGAGGCCAGGGGCCGGCGGCTGGCTAGGGATGAAGAATAAAAGGAAGCACCCTTCAGCAGTTCCAC"

HBG2_seg_vec = Onehot_Encode(HBG2_segment)
# HBG2_seg_vec

Example solution scripts:

# TODO: try it yourself

Q6. With the same sequence above, count the frequency of each 3-mer:

List the all 3-mer
Count each of the 3-mer in the sequence

Example solution scripts:

# TODO: try it yourself

# Option 1: keep using one hot encoding matrix

# Option 2: substr(x, start, stop) 
# you may make a vector with k-mer as name and use the above substr as index
# this is related to dictionary data structure

k = 3
bases = c("A", "C", "G", "T")
kmer_mat = unique(t(combn(rep(bases, k), m = k)))
kmer = apply(kmer_mat, 1, paste, collapse='')

kmer_count = rep(0, length(kmer))
names(kmer_count) = kmer

# Keep trying

10.1.2 Functional mapping

In this part, we will see how the sequence features may help predict molecular phenotype, namely splicing efficiency. You may read more in the original paper Hou & Huang 2021 [4], if you are interested.

Let’s look at the data plicing_efficiency_pred.csv

columns 1 & 2: gene ID and gene name (1850 genes)
column 3: splicing efficiency (log scale)
columns 4 to 99: 96 octamer features (the normalized frequency in intron sequence)

Q7. Now, we can try using the multiple linear regression model to check how predictive the octamer frequencies are for splicing efficiency.

Load the data (scripts below)
Linear regression model
Using 10-fold cross-validation to evaluate the model

You may propose other alternative methods for assessing the association between octamer and splicing efficiency.

library(caret)

df_splicing <- read.csv("plicing_efficiency_pred.csv")

# Define training control
# We also want to have savePredictions=TRUE & classProbs=TRUE
set.seed(0) 
my_trControl <- trainControl(method = "cv", number = 10, 
                             savePredictions = TRUE)

# TODO: try it yourself to fill the rest

10.1.3 References

Crooks, G. E., Hon, G., Chandonia, J. M., & Brenner, S. E. (2004). WebLogo: a sequence logo generator. Genome research, 14(6), 1188-1190.
Schneider, T. D., & Stephens, R. M. (1990). Sequence logos: a new way to display consensus sequences. Nucleic acids research, 18(20), 6097-6100.
Canver, M. C., Smith, E. C., Sher, F., Pinello, L., Sanjana, N. E., Shalem, O., … & Bauer, D. E. (2015). BCL11A enhancer dissection by Cas9-mediated in situ saturating mutagenesis. Nature, 527(7577), 192-197.
Hou, R., & Huang, Y. (2022). Genomic sequences and RNA-binding proteins predict RNA splicing efficiency in various single-cell contexts. Bioinformatics, 38(12), 3231-3237.

Splicing efficiency prediction dataset at https://github.com/StatBiomed/scRNA-efficiency-prediction.

scripts to subset the data :

 df_Y = read.csv(paste0(
   "https://github.com/StatBiomed/scRNA-efficiency-prediction/", 
   "raw/refs/heads/main/data/estimated/sck562_all_stochastical_gamma.csv"
 ), row.names = 1)

 df_X0 = read.csv(paste0(
   "https://github.com/StatBiomed/scRNA-efficiency-prediction/", 
   "raw/refs/heads/main/data/features/humanproteincode_octamer_gene_level.csv"
 ))

 df_use = merge(df_Y, df_X0, by="gene_id", incomparables = NA)
 df_use$velocity_gamma = -log(df_use$velocity_gamma + 0.01)
 colnames(df_use)[3] = "splicing_efficiency"
 # df_use = df_use[-1600, ]

 df_use[, 4:ncol(df_use)] = df_use[, 4:ncol(df_use)] + 0.00001
 df_use[, 3:ncol(df_use)] = round(df_use[, 3:ncol(df_use)], digits=6)

 options(digits = 3)
 write.csv(df_use, "plicing_efficiency_pred.csv", 
           quote = FALSE, row.names = FALSE)

10.1.3.1 Potential resources

Scan motif

# For Q5
motif_score_all = rep(NA, ncol(HBG2_seg_vec)-ncol(MA2324_PWM))
for (i in 1:(ncol(HBG2_seg_vec) - ncol(MA2324_PWM))) {
  motif_score_all[i] = sum(log10(colSums(
    HBG2_seg_vec[, i:(i + ncol(MA2324_PWM) - 1)] * MA2324_PWM
    )))
}
motif_score_all[1:5]

sort(motif_score_all, decreasing = TRUE)[1:7]

# For Q4
# Potential motif score function
# motif_score <- function(sequence, motif_PWM) {
#   seq_vec = Onehot_Encode(sequence)
#   sum(log(colSums(seq_vec * motif_PWM)))
# }

K-mer counting

# For Q6
kmer_count = rep(0, length(kmer))
names(kmer_count) = kmer
for (i in 1:(nchar(HBG2_segment) - k + 1)) {
  a_kmer = substr(HBG2_segment, i, i+2)
  kmer_count[a_kmer] = kmer_count[a_kmer] + 1
}
kmer_count

Cross-validation for linear regression

# For Q7
# Train the model
cv_model <- train(splicing_efficiency ~ ., 
                  data = df_splicing[, 3:ncol(df_splicing)], 
                  method = "lm", trControl = my_trControl)

# Check model performance
print(cv_model)
cor(cv_model$pred$obs, cv_model$pred$pred)

ggplot(cv_model$pred, aes(x=obs, y=pred)) + 
  geom_point()