This post aims to show how to construct the receiver operating characteristic (roc) curve without using predefined functions. It hopes to help you better understand how the roc curve is constructed and how to interpret it.
Setup
Code
%matplotlib inlinefrom typing import Tuple, Unionimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom sklearn import metricsfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_split
Load Data and Train Model
The dataset we will use for this blog is the famous Titanic dataset. For simplicity, we are going to use a subset of the data that does not contain missing values.
We train a simple logistic regression model by using only two columns as the independent variables - Fare and Age and try to predict whether a passenger can survive in the accident.
With the model setup, we can go into the core steps for constructing the roc curve. Constructing the roc curve includes 4 steps (this is adapted from lecture notes from Professor Spenkuch’s business analytics class).
Sort predicted probability of “positive” outcome for each observation.
For each observation, record false positive rate (fpr) and true positive rate (tpr) if that observation’s predicted probability were used as classification threshold.
Plot recorded pairs of tpr and fpr.
Connect the dots.
Let’s show how to do those step by step.
First, we can get the sorted probability of positive outcomes (prediction == 1) of the next two lines of code.
Second, we define a function to calculate the fpr and tpr for a given threshold.
def get_tpr_fpr_pair( y_proba: np.ndarray, y_true: Union[np.ndarray, pd.Series], threshold: float) -> Tuple[float, float]:"""Get the true positive rate and false positive rate based on a certain threshold""" y_pred = (y_proba >= threshold).astype(int) tn, fp, fn, tp = metrics.confusion_matrix(y_true, y_pred).ravel() tpr = tp / (tp + fn) fpr = fp / (fp + tn)return tpr, fpr
With the function defined above, we can loop through each element in the sorted probability of positive outcomes and use each element as the threshold for our get_tpr_fpr_pair function and store the result in two lists.
tpr_list = []fpr_list = []for t in y_pred_proba_asc: tpr, fpr = get_tpr_fpr_pair(y_pred_proba, y_test, threshold=t) tpr_list.append(tpr) fpr_list.append(fpr)
Finally, after we have the record for each pair of tpr and fpr, we can plot them to get the roc curve.