Chapter 4 Introduction to Classification

4.1 Logistic regression

In this chapter, we will introduce classification. There are multiple commonly used classification methods, for example, random forest, support vector machine. Here, we will focus on introducing logistic regression for the following reasons:

It usually performs well with good accuracy, especially after properly transforming the features to make them linear to the output variable.
It is based on a linear relationship between the explanatory variables and output variable, making it easy to interpret.
It has a good connection to a neural network, making it useful for your future study.

In a nutshell, logistic regression is a linear regression with a logistic transformation for probabilistic prediction. Let’s start with a linear decision boundary.

4.1.1 Linear decision boundary

Taking the iris dataset as an example, we will see that a linear boundary is a line in a two-dimensional space, and can be specified as a function over two variables, x1 (x-axis) and x2 (y-axis). Here, let’s take an example decision boundary: x2 = x2 - 2.3, or, namely, z = x1 - x2 - 2.3 = 0, and we can visualize it as follows.

## Example decision boundary as a line: x2 = x1 - 2.3; or z = x1 - x2 - 2.3 = 0
# Note, we use x2 as the y-axis. If z = x1 - x2 - 2.3 > 0, it's versicolor

ggplot(data = iris[1:100, ], 
       aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_point() + geom_abline(intercept = -2.3)

In high-dimensional space, a linear decision boundary will be a hyperplane. We can further write this linear decision model in a general form for \(p\)-dimensional features, as follows,

\[z = w_0 + w_1 * x_1 + w_2 * x_2 + ... w_p * x_p = 0\]

Then the classification will be make as positive if \(z>0\), or negative if \(z<0\).

Potential issues: there are two potential issues with the above linear classification. First, it cannot give a predicted probability of the label, which can be often useful for determine the confidence of the prediction. Second, it is not straightforward to define a loss with current predicted \(z\) and observed label \(y\).

To solve these issues, a logistic transformation function is introduced to squash the predicted value from (-\(\infty\), +\(\infty\)) to [0, 1], serving a the predicted probability. Therefore, the logistic regression can be written as follows,

\[P(y=1|X, W) = \sigma(z) = \sigma(w_0 + x_1 * w_1 + ... + x_p * w_p)\] where \(\sigma()\) denotes logistic function (also known as sigmoid function), defined as \[\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{e^z + 1}\]

4.1.2 Logistic and logit functions

Let’s play around with logistic and logit functions and then visualize them.

Logistic function

Let’s write our first function logestic() as follows.

# Write your first function
logistic <- function(y) {
  exp(y) / (1 + exp(y))
}

# Try it with different values:
logistic(0.1)
#> [1] 0.5249792
logistic(c(-3, -2, 0.5, 3, 5))
#> [1] 0.04742587 0.11920292 0.62245933 0.95257413 0.99330715

This is the equivalent to the built-in plogis() function in the stat package for the logistic distribution:

plogis(0.1)
#> [1] 0.5249792
plogis(c(-3, -2, 0.5, 3, 5))
#> [1] 0.04742587 0.11920292 0.62245933 0.95257413 0.99330715

Logit function

Now, let look at the logistic’s inverse function logit(), and let’s define it manually. Note, this function only support input between 0 and 1.

# Write your first function
logit <- function(x) {
  log(x / (1 - x))
}

# Try it with different values:
logit(0.4)
#> [1] -0.4054651
logit(c(0.2, 0.3, 0.5, 0.7, 0.9))
#> [1] -1.3862944 -0.8472979  0.0000000  0.8472979  2.1972246
logit(c(-1, 2, 0.4))
#> Warning in log(x/(1 - x)): NaNs produced
#> [1]        NaN        NaN -0.4054651

Again, the built-in stat package’s logistic distribution has an equivalent function qlogis(), though with a different name.

qlogis(0.4)
#> [1] -0.4054651
qlogis(c(0.2, 0.3, 0.5, 0.7, 0.9))
#> [1] -1.3862944 -0.8472979  0.0000000  0.8472979  2.1972246
qlogis(c(-1, 2, 0.4))
#> Warning in qlogis(c(-1, 2, 0.4)): NaNs produced
#> [1]        NaN        NaN -0.4054651

4.1.3 Visualise logistic and logit functions

Logistic function

# You can use seq() function to generate a vector
# Check how to use it by help(seq) or ?seq
x = seq(-7, 7, 0.3)
df = data.frame('x'=x, 'logistic'=plogis(x))

# You can plot by plot function
# plot(x=df$x, y=df$logistic, type='o')

# Or ggplot2
library(ggplot2)
ggplot(df, aes(x=x, y=logistic)) +
  geom_point() + geom_line()

Logit function

x = seq(0.001, 0.999, 0.01)
df = data.frame('x'=x, 'logit'=qlogis(x))

ggplot(df, aes(x=x, y=logit)) +
  geom_point() + geom_line()

4.2 Application on Diabetes

4.2.1 Load Pima Indians Diabetes Database

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consist of several medical predictors (independent variables) and one target outcome (dependent variable). Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

# Install the mlbench library for loading the datasets
if (!requireNamespace("mlbench", quietly = TRUE)) {
  install.packages("mlbench")
}

# Load data
library(mlbench)
data(PimaIndiansDiabetes)

# Check the first few lines
dim(PimaIndiansDiabetes)
#> [1] 768   9
head(PimaIndiansDiabetes)
#>   pregnant glucose pressure triceps insulin mass pedigree age diabetes
#> 1        6     148       72      35       0 33.6    0.627  50      pos
#> 2        1      85       66      29       0 26.6    0.351  31      neg
#> 3        8     183       64       0       0 23.3    0.672  32      pos
#> 4        1      89       66      23      94 28.1    0.167  21      neg
#> 5        0     137       40      35     168 43.1    2.288  33      pos
#> 6        5     116       74       0       0 25.6    0.201  30      neg

Now, let’s check two potential features: glucose and age, colored by the diabetes labels.

library(ggplot2)

ggplot(data=PimaIndiansDiabetes, aes(x=glucose, y=age)) +
  geom_point(aes(color=diabetes))

Before we start fitting models, let’s split the data into training and test sets in a 3:1 ratio. Let’s define it manually, though there are functions to do it automatically.

## split data into training and test sets
set.seed(0)

n_samples = nrow(PimaIndiansDiabetes)
idx_train = sample(n_samples, size = 0.75 * n_samples, replace = FALSE)
df_train = PimaIndiansDiabetes[idx_train, ]
df_test  = PimaIndiansDiabetes[-idx_train, ] # recall the meaning of negative symbol

4.2.2 Fit logistic regression

In logistic regression, the predicted probability of being class 1 is:

\[P(y=1|X, W) = \sigma(w_0, x_1 * w_1 + ... + x_p * w_p)\]

where the \(\sigma()\) denotes the logistic function.

In R, the built-in package stats already has functions to fit generalised linear model (GLM), including logistic regression, a type of GML.

Here, let’s start with the entire dataset to fit a logistic regression model.

Note, we will specify the model family as binomial, as the likelihood we are using in logistic regression is a Bernoulli likelihood, a special case of binomial likelihood when the total trial n=1.

## Define formula in different ways
# my_formula = as.formula(diabetes ~ glucose + age)
# my_formula = as.formula(paste(colnames(PimaIndiansDiabetes)[1:8], collapse= " + "))
# my_formula = as.formula(diabetes ~ .)

## Fit logistic regression
glm_res <- glm(diabetes ~ ., data=df_train, family = binomial)

## We can use the logLik() function to obtain the log likelihood
logLik(glm_res)
#> 'log Lik.' -281.9041 (df=9)

We can use summary() function to see more details about the model fitting, including accessing the p-values.

summary(glm_res)
#> 
#> Call:
#> glm(formula = diabetes ~ ., family = binomial, data = df_train)
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -8.044602   0.826981  -9.728  < 2e-16 ***
#> pregnant     0.130418   0.036080   3.615 0.000301 ***
#> glucose      0.032196   0.004021   8.007 1.18e-15 ***
#> pressure    -0.017158   0.006103  -2.811 0.004934 ** 
#> triceps     -0.003425   0.007659  -0.447 0.654752    
#> insulin     -0.001238   0.001060  -1.169 0.242599    
#> mass         0.104029   0.018119   5.741 9.39e-09 ***
#> pedigree     0.911030   0.344362   2.646 0.008156 ** 
#> age          0.012980   0.010497   1.237 0.216267    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 756.83  on 575  degrees of freedom
#> Residual deviance: 563.81  on 567  degrees of freedom
#> AIC: 581.81
#> 
#> Number of Fisher Scoring iterations: 5

## Access to the p-value
summary_glm <- summary(glm_res)
print("p-values can be access from the summary function:")
#> [1] "p-values can be access from the summary function:"
summary_glm$coefficients[, "Pr(>|z|)"]
#>  (Intercept)     pregnant      glucose     pressure      triceps      insulin 
#> 2.297922e-22 3.006816e-04 1.178551e-15 4.933980e-03 6.547515e-01 2.425994e-01 
#>         mass     pedigree          age 
#> 9.388126e-09 8.155617e-03 2.162670e-01

4.2.3 Assess on test data

Now, we can evaluate the accuracy of the model on the 25% test data.

## Train the full model on the training data
glm_train <- glm(diabetes ~ ., data=df_train, family = binomial)

## Predict the probability of being diabeties on test data
# We can also set a threshold, e.g., 0.5 for the predicted label
pred_prob = predict(glm_train, df_test, type = "response")
pred_label = pred_prob >= 0.5

## Observed label
obse_label = df_test$diabetes == 'pos'

## Calculate the accuracy on test data
# think how accuracy is defined
# we can use (TN + TP) / (TN + TP + FN + FP)
# we can also directly compare the proportion of correctness
accuracy = mean(pred_label == obse_label)
print(paste("Accuracy on test set:", accuracy))
#> [1] "Accuracy on test set: 0.796875"

4.2.4 Model selection and diagnosis

Similar to the multiple linear regression, selecting the predictive features are essential to ensure the prediction performance in test sets. In theory, there could 2^{p} combinations of feature usage for \(p\)-dimensional features, making it a challenging task for selecting the best model, especially when \(p\) is large.

Here, we use a strategy of “recursive feature elimination”. Namely, we will include all features first and remove features with relatively large p-value (that we fail to reject the null distribution that the feature has a coefficient of zero).

Besides using p-value, we will also use the test set that we split from the entire dataset, which a general strategy in assessing performance in machine learning.

4.2.4.1 Model2: New feature set by removing `triceps`

## Train the full model on the training data

glm_mod2 <- glm(diabetes ~ pregnant + glucose + pressure + 
                   insulin + mass + pedigree + age, 
                data=df_train, family = binomial)

logLik(glm_mod2)
#> 'log Lik.' -282.0038 (df=8)
summary(glm_mod2)
#> 
#> Call:
#> glm(formula = diabetes ~ pregnant + glucose + pressure + insulin + 
#>     mass + pedigree + age, family = binomial, data = df_train)
#> 
#> Coefficients:
#>               Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -8.0317567  0.8251403  -9.734  < 2e-16 ***
#> pregnant     0.1308094  0.0361230   3.621 0.000293 ***
#> glucose      0.0324606  0.0039854   8.145 3.80e-16 ***
#> pressure    -0.0175651  0.0060269  -2.914 0.003563 ** 
#> insulin     -0.0014402  0.0009593  -1.501 0.133291    
#> mass         0.1018155  0.0173811   5.858 4.69e-09 ***
#> pedigree     0.9000134  0.3428652   2.625 0.008665 ** 
#> age          0.0131238  0.0105147   1.248 0.211982    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 756.83  on 575  degrees of freedom
#> Residual deviance: 564.01  on 568  degrees of freedom
#> AIC: 580.01
#> 
#> Number of Fisher Scoring iterations: 5

## Predict the probability of being diabeties on test data
# We can also set a threshold, e.g., 0.5 for the predicted label
pred_prob2 = predict(glm_mod2, df_test, type = "response")
pred_label2 = pred_prob2 >= 0.5

accuracy2 = mean(pred_label2 == obse_label)
print(paste("Accuracy on test set with model2:", accuracy2))
#> [1] "Accuracy on test set with model2: 0.807291666666667"

4.2.4.2 Model3: New feature set by removing `triceps` and `insulin`

# Train the full model on the training data
glm_mod3 <- glm(diabetes ~ pregnant + glucose + pressure + 
                   mass + pedigree + age, 
                data=df_train, family = binomial)

logLik(glm_mod3)
#> 'log Lik.' -283.1342 (df=7)
summary(glm_mod3)
#> 
#> Call:
#> glm(formula = diabetes ~ pregnant + glucose + pressure + mass + 
#>     pedigree + age, family = binomial, data = df_train)
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -7.797803   0.802287  -9.719  < 2e-16 ***
#> pregnant     0.130990   0.035957   3.643  0.00027 ***
#> glucose      0.030661   0.003755   8.164 3.23e-16 ***
#> pressure    -0.017847   0.005953  -2.998  0.00272 ** 
#> mass         0.097356   0.016969   5.737 9.61e-09 ***
#> pedigree     0.824150   0.338299   2.436  0.01484 *  
#> age          0.015134   0.010426   1.452  0.14663    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 756.83  on 575  degrees of freedom
#> Residual deviance: 566.27  on 569  degrees of freedom
#> AIC: 580.27
#> 
#> Number of Fisher Scoring iterations: 5

# Predict the probability of being diabeties on test data
# We can also set a threshold, e.g., 0.5 for the predicted label
pred_prob3 = predict(glm_mod3, df_test, type = "response")
pred_label3 = pred_prob3 >= 0.5

accuracy3 = mean(pred_label3 == obse_label)
print(paste("Accuracy on test set with model3:", accuracy3))
#> [1] "Accuracy on test set with model3: 0.786458333333333"

4.3 Cross-validation

4.3.1 how to increase test sets

In the last section, we split the whole dataset into 75% for training and 25% for testing. However, when the dataset is small, the test set may not be big enough, hence introducing high variance in the assessment.

One way to reduce this variance in assessment is to perform cross-validation, where we split the data into K folds and use K-1 folds for training and the remaining fold for testing. This procedure will be repeated for fold 1 to fold K as testing fold, and all folds will be aggregated for joint assessment.

K is usually taken 3, 5, or 10. In the extreme case that K=n_sample, we call it leave-one-out cross-validation (LOOCV).

4.3.2 K-fold CV with caret package

Let’s load the dataset (again) first.

# Load data
library(mlbench)
data(PimaIndiansDiabetes)

The procedure of cross-validation is straightforward; therefore, we may manually implement from scratch (try it yourself with a for loop).

On the other hand, there are also packages supporting the use of cross-validation directly, including caret package. We will install it and use it for cross-validation here.

Note, to calculate the ROC curve later, we need to keep the predicted probability by using classProbs = TRUE

## Install the caret library for cross-validation
if (!requireNamespace("caret", quietly = TRUE)) {
  install.packages("caret")
}
library(caret)

## Define training control
# We also want to have savePredictions=TRUE & classProbs=TRUE
set.seed(0) 
my_trControl <- trainControl(method = "cv", number = 5, 
                             classProbs = TRUE,
                             savePredictions = TRUE)

## Train the model using train() function
cv_model <- train(diabetes ~ ., data = PimaIndiansDiabetes, 
                  method = "glm",
                  family=binomial(),
                  trControl = my_trControl)

## Summarize the results
print(cv_model)
#> Generalized Linear Model 
#> 
#> 768 samples
#>   8 predictor
#>   2 classes: 'neg', 'pos' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 615, 614, 615, 614, 614 
#> Resampling results:
#> 
#>   Accuracy   Kappa    
#>   0.7708344  0.4695353

We can also access to detailed prediction results after concatenating the K folds:

cv_res = cv_model$pred
head(cv_res)
#>   pred obs       neg        pos rowIndex parameter Resample
#> 1  neg neg 0.9656694 0.03433058        4      none    Fold1
#> 2  neg neg 0.8581071 0.14189290        6      none    Fold1
#> 3  neg pos 0.9508306 0.04916940        7      none    Fold1
#> 4  neg pos 0.6541361 0.34586388       17      none    Fold1
#> 5  neg pos 0.7675666 0.23243342       20      none    Fold1
#> 6  neg pos 0.6132685 0.38673152       26      none    Fold1

We can double check the accuracy:

CV_acc = mean(cv_res$pred == cv_res$obs)
print(paste("Accuracy via 5-fold cross-validation", CV_acc))
#> [1] "Accuracy via 5-fold cross-validation 0.770833333333333"

4.4 Metrics and ROC curve

4.4.1 Two types of error

In the above sections, we used the accuracy to perform model diagnosis, either only on one testing dataset or aggregating across multiple folds in cross- validation.

Accuracy is a widely used metric for model evaluation, based on the average error rate. However, this metric still has limitations when assessing the model performance, especially the following two:

When the samples are highly imbalanced, high accuracy may not mean a good model. For example, for a sample with 990 negative samples and 10 positive samples, a simple model by predicting all samples as negative will give an accuracy of 0.99. Thus, for highly imbalanced samples, we should be careful when interpreting the accuracy.
In many scenarios, our tolerance for false positive errors and false negative errors may be different, and we want to know both for a certain model. They are often called type I and II errors:

Type I error: false positive (rate)
Type II error: false negative (rate) - a joke way to remember them is that the II means two stripes in Negative.

Here, we use the diabetes dataset and its cross-validation results above to illustrate the two types of errors and the corresponding model performance evaluation.

## Let's start to define the values for the confusion matrix first
# Recall what the difference between & vs &&
# Read more: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Logic.html

TP = sum((cv_res$obs == 'pos') & (cv_res$pred == 'pos'))
FN = sum((cv_res$obs == 'pos') & (cv_res$pred == 'neg'))

FP = sum((cv_res$obs == 'neg') & (cv_res$pred == 'pos'))
TN = sum((cv_res$obs == 'neg') & (cv_res$pred == 'neg'))

print(paste('TP, FN, FP, TN:', TP, FN, FP, TN))
#> [1] "TP, FN, FP, TN: 151 117 59 441"

We can also use the table() function to get the whole confusion matrix. Read more about the table function for counting the frequency of each element. A similar way is the confusionMatrix() in the caret package.

## Calculate confusion matrix
confusion_mtx = table(cv_res[, c("obs", "pred")])
confusion_mtx
#>      pred
#> obs   neg pos
#>   neg 441  59
#>   pos 117 151

## Similar function confusionMatrix
# conf_mat = confusionMatrix(cv_res$pred, cv_res$obs)
# conf_mat$table

We can also plot out the confusion matrix

## Change to data.frame before using ggplot
confusion_df = as.data.frame(confusion_mtx)

ggplot(confusion_df, aes(pred, obs, fill= Freq)) +
  geom_tile() + geom_text(aes(label=Freq)) + 
  scale_fill_gradient(low="white", high="darkgreen")

Also, the false positive rate, false negative rate, and true negative rate. Note, the denominator is always the number of observed samples with the same label, namely, they are a constant for a specific dataset.

FPR = FP / sum(cv_res$obs == 'neg')
FNR = FN / sum(cv_res$obs == 'pos')
TPR = TP / sum(cv_res$obs == 'pos')

print(paste("False positive rate:", FPR))
#> [1] "False positive rate: 0.118"
print(paste("False negative rate:", FNR))
#> [1] "False negative rate: 0.436567164179104"
print(paste("True positive rate:",  TPR))
#> [1] "True positive rate: 0.563432835820896"

4.4.2 ROC curve

In the above assessment, we only used \(P>0.5\) to denote the predicted label as positive. We can imagine that if we have a lower cutoff, we will have more false positives and fewer false negatives. Indeed, in different scenarios, people may choose different levels of cutoff for their tolerance of different types of errors.

Let’s try a cutoff \(P>0.4\). Think about what the results will be.

## Original confusion matrix
table(cv_res[, c("obs", "pred")])
#>      pred
#> obs   neg pos
#>   neg 441  59
#>   pos 117 151

## New confusion matrix with cutoff 0.4
cv_res$pred_new = as.integer(cv_res$pos >= 0.4)
table(cv_res[, c("obs", "pred_new")])
#>      pred_new
#> obs     0   1
#>   neg 408  92
#>   pos  89 179

Therefore, we may want to assess the model performance by varying the cutoffs and obtain a more systematic assessment.

Actually, the Receiver operating characteristic (ROC) curve is what you need. It presents the TPR (sensitivity) vs the FPR (i.e., 1 - TNR or 1 - specificity) when varying the cutoffs.

To achieve this, we can calculate FPR and TPR manually by varying the cutoff through a for loop. Read more about for loop and you may try write your own, and here is an example from the cardelino package.

For simplicity, let’s use an existing tool implemented in the pROC package to obtain the key information for making ROC curves: sensitivities (TPR) and specificities (1-FPR) for a list of thresholds:

## Install the pROC library for plotting ROC curve
if (!requireNamespace("pROC", quietly = TRUE)) {
  install.packages("pROC")
}

# library(pROC)
roc_outs <- pROC::roc(cv_res$obs == 'pos', cv_res$pos)
#> Setting levels: control = FALSE, case = TRUE
#> Setting direction: controls < cases
print(paste("The AUC score of this ROC curve is", roc_outs$auc))
#> [1] "The AUC score of this ROC curve is 0.829082089552239"

roc_table <- as.data.frame(
  roc_outs[c("sensitivities", "specificities", "thresholds")]
)
head(roc_table)
#>   sensitivities specificities  thresholds
#> 1     1.0000000         0.000        -Inf
#> 2     1.0000000         0.002 0.001848584
#> 3     1.0000000         0.004 0.003167924
#> 4     1.0000000         0.006 0.004588208
#> 5     1.0000000         0.008 0.005750560
#> 6     0.9962687         0.008 0.007297416

With this roc_table at hand, we can find which threshold we should use for a certain desired FPR or TPR, for example, to have FPR = 0.25:

FPR = 1 - roc_table$specificities

idx = which.min(abs(FPR - 0.25))
roc_table[idx, ]
#>     sensitivities specificities thresholds
#> 444     0.7462687          0.75  0.3278307

print(paste("When FPR is closest to 0.25, the threshold is", roc_table[idx, 3]))
#> [1] "When FPR is closest to 0.25, the threshold is 0.327830708502212"
print(paste("When FPR is closest to 0.25, the TPR is", roc_table[idx, 1]))
#> [1] "When FPR is closest to 0.25, the TPR is 0.746268656716418"

By showing all thresholds, we can also directly use ggplot2 to make an ROC curve:

library(ggplot2)

## You can set the n.cuts to show the cutoffs on the curve
g = ggplot(roc_table, aes(x = 1 - specificities, y = sensitivities)) +   
  geom_line(color = "blue", size = 1) +
  geom_abline(linetype = "dashed", color = "grey") +
  labs(title = "ROC Curve",
       x = "FPR (1 - Specificity)",
       y = "TPR (Sensitivity)") +
  coord_equal()

## Display the plot with more annotations
g + 
  annotate("text", x=0.8, y=0.1, label=paste("AUC =", round(roc_outs$auc, 4))) +
  geom_point(x = 1 - roc_table[idx, 2], y = roc_table[idx, 1]) + 
  geom_text(x = 1 - roc_table[idx, 2], y = roc_table[idx, 1],
            label = round(roc_table[idx, 3], 3), color = 'black', hjust = -0.3)

Acknowledgement: This notebook is adapted and updated from STAT1005.

4.5 Exercise

Q1. In the ROC curve part, try another model with removing triceps and plot the ROC curve and calculate the AUC score. Is it higher or lower than using the full features?