Chapter 11 Genomics and prediction
11.1 Case study 1: splicing fedility prediction
Scenario: As a computational biology researcher, you want to find out what sequence-related features function as a regulator for splicing fidelity. You conducted a deep RNA-seq experience and quantified the splicing error frequency (SEF) for 141 splicing error events in yeast. Based on your prior knowledge and curiosity, you plan to further define a set of sequence-related features and test how predictive they are.
11.1.1 Questions for Discussion
Q1. What sequence-related features you would think of?
Q2. Given a set of candidate features, what statistical methods can you use to determine if a feature genuinely contributes to splicing fidelity?
Q3. Given a candidate feature set, what could be the best subset of features for predicting the splicing fidelity and in what sense?
11.1.2 Hands-on with regression model
Now, you are given a dataset with 141 splicing error events containing Y for splicing error frequency (SEF) and 14 predictors X. Data set and template notebook are available on Moodle (recommended) and also on this GitHub Repo.
The information for columns:
- RPG: Ribosomal Protein Gene
- event: splicing error event type
- shift: nucleotide shift between canonical (ref, reference) and erroneous (alt, alternative) splice sites
- SEF: splicing error frequency
- intronL_ref: intron length for canonical (ref) intron
- 5ss_bpL_ref: length between 5 splice site to branch point
- DeltaG_intron_ref: the energy score in the secondary RNA structure, calculated by RNAFold. The lower the more stable.
- 5ss_motif_ref: the motif score of a sequence. The motif position weight matrix (PWM) is obtained by observed genes. The motif score is calculated as the log likelihood of each sequence given the PWM, followed by resealing to 100 for the maximum value. You can refer to slide 7 in lecture topic 1.
- intronL_alt: intron length for erroneous (alt) intron
df = read.table("df_SEF_prediction.tsv", sep="\t", header=1)
head(df)
#> gene RPG event shift SEF intronL_ref X5ss_bpL_ref
#> 1 TFC3 non_RPG down3ss 17 0.479869423 90 77
#> 2 SEC17 non_RPG down3ss 47 0.021937843 116 77
#> 3 ERD2 non_RPG up3ss 4 0.059303591 97 63
#> 4 LSM2 non_RPG up5ss 56 0.019908116 128 92
#> 5 LSM2 non_RPG up3ss 47 0.077444760 128 92
#> 6 LSM2 non_RPG down3ss 40 0.002844017 128 92
#> DeltaG_intron_ref DeltaG_5ss_bp_ref DeltaG_bp_3ss_ref DeltaG_3ss_ref
#> 1 -0.11444 -0.13377 0.0 -6.2
#> 2 -0.17155 -0.10779 -10.5 -20.3
#> 3 -0.16804 -0.16190 -3.9 -8.7
#> 4 -0.13906 -0.12065 -0.9 -7.0
#> 5 -0.13906 -0.12065 -0.9 -7.0
#> 6 -0.13906 -0.12065 -0.9 -7.0
#> X5ss_motif_ref bp_motif_ref X3ss_motif_ref intronL_alt DeltaG_intron_alt
#> 1 0.7213 0.7703 0.0769 107 -0.10841
#> 2 0.7695 0.9942 0.1461 163 -0.19202
#> 3 0.8495 0.8835 0.1496 93 -0.17527
#> 4 0.8305 0.6667 0.3677 184 -0.15217
#> 5 0.8305 0.6667 0.3677 81 -0.10864
#> 6 0.8305 0.6667 0.3677 168 -0.11667
#> DeltaG_3ss_alt X5ss_motif_alt X3ss_motif_alt
#> 1 -6.8 77.00 0.5223
#> 2 -13.0 70.99 0.2926
#> 3 -8.3 86.10 0.3329
#> 4 -7.0 64.20 0.3955
#> 5 -8.9 80.76 0.6179
#> 6 -4.3 80.76 0.3341
11.1.2.1 A single feature
Q4. Let’s start with a single feature, bp_motif_ref
, is there a linear relationship with the output variable SEF
Y? How do you determine it?
Hint: try scatter plot to visualise the data. See example we mentioned before https://statbiomed.github.io/BMDS-book/introR.html#scatter-plot-1
How about doing a log transform for SEF score? Consider adding a pseudo count, e.g., 0.0001.
11.1.2.2 All features with lm
Q5. Can you build a linear model to perform the prediction by using all features? How to obtain the assessment statistics?
Hint: consider using the lm
function. See examples we mentioned before
https://statbiomed.github.io/BMDS-book/introLinearReg.html#simple-linear-regression-with-lm-function
11.1.2.3 Assessment on test data
Q6. Can you perform the assessment on test data by splitting the data into training and test sets?
We didn’t introduce how perform training & test splitting in linear regression, but you can adapt the scripts that we introduced for logistic regression.
Hint: you can recall the logistic regress chapter for the training & test spiting: https://statbiomed.github.io/BMDS-book/introClassifier.html#logistic-regression-on-diabetes
You can also just use the codes below, but think what is the ratio between training and test sets here:
set.seed(0)
idx_train = sample(nrow(df), size=0.5*nrow(df), replace = FALSE)
df_train = df[idx_train, ]
df_test = df[-idx_train, ] # recall the meaning of negative symbol
Now, start training the model on the training set and predict on the test set.
Hint: follow the codes here: https://statbiomed.github.io/BMDS-book/introClassifier.html#logistic-regression-on-diabetes
For calculating R^2, another quick way is using the Pearson’s correlation coefficient R, as it gives equivalent values. We can do it with the following codes with the cor()
function.
11.1.2.4 Cross-validation for regression
Q7. Can you perform a 10-fold cross validation, calculate the R^2 between predicted and observed log(SEF), and plot it with ggplot?
Hint: We didn’t mention cross-validation for linear regression but again we can adapt the codes from the logistic regression chapter: https://statbiomed.github.io/BMDS-book/introClassifier.html#cross-validation
You can also use the codes below, but try to fill the missing part (with question marks ???
) yourself.
library(caret)
# Define training control
# We also want to have savePredictions=TRUE & classProbs=TRUE
set.seed(0)
my_trControl <- trainControl(method = "cv", number = 10,
savePredictions = TRUE)
# Train the model (uncomment the codes below)
# cv_model <- train(???, data = ???,
# method = "lm",
# trControl = my_trControl)
# Summarize the results
# mean(cv_model$resample$Rsquared)
# print(cv_model)
By using print(cv_model)
, we can see the Rsquared value, which is an average of all R^2 acoss K folds (each fold has its own R^2 on the test set).
Now, let’s manually calculate the R^2 between observed and predicted Y for all samples in an aggregated manner, despite that the Ys are predicted from different models in different folds.
11.1.2.5 A subset of features
Q8. Can you try a subset of features and see if you can improve the performance?
Hint: You can consider forward method as mentioned here by checking the \(p\) value and only keep the significant ones (think which significance level \(\alpha\) you would like to allow: https://statbiomed.github.io/BMDS-book/introLinearReg.html#multiple-regression-with-lm-function
11.1.3 Open question
Q9. Can you plot a scatter plot between log(SEF) vs X3ss_motif_ref, and colored by Ribosomal Protein Genes (RPG)? What did you see?
Hint: follow the example on the diabetes data: https://statbiomed.github.io/BMDS-book/introClassifier.html#load-pima-indians-diabetes-database
Q10. Is the SEF score mainly driven by the RPG label or is it genuinely related to the 3’ss motif? How to test it?
Hint: consider adding RPG
as a covariate. Is 3’ss motif still significant?
# Define your new model
lr_fm3 = as.formula(paste("SEF_log ~", paste(colnames(df)[c(2,6:19)], collapse=" + ")))
# Write you codes below to fit the model
Acknowledgement: this case study is adapted from the Figure 4 for the following research article:
- Aslanzadeh et al. Transcription rate strongly affects splicing fidelity and cotranscriptionality in budding yeast. Genome Research, 2018, 28 (2), 203-213