pp Technology for Humanity
Fallacy of Symptom-Based Inference (Diagnosis) and Prediction (Prognosis)

Prediction of the 10-year Cardiovascular Heart Disease Using the Framingham Heart Study Data


Chapter 4. Test - Split & Train

Machine learning is unique in its 2-step process of model testing: (1) train and (2) test. The dataset is split into two subsets, one for training and another for testing. A third subset may be required for evaluation.

Import libraries.


from sklearn.model_selection import train_test_split
# clarify what is y and what is x label
y = df_scaled['TenYearCHD']
X = df_scaled.drop(['TenYearCHD'], axis = 1)

Data is split into two subsets of X and y each, resulting in 4 subsets. Important: The lengh (number of records) of X_train & y_train must be same and in the same order. Also, variables of X_train and X_test must be same and in the same order. Once split, the subsets needs no more manipulation.

Divide train test: 80% - 20%.


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=29)

Make sure the length of X_train and y_train are the same.


len(X_train) # Independent variables in the train subset
len(y_train) # Target variable in the train subset
len(X_test) # Independent variables in the test subset

Output:
Out[81]: 2999
Out[82]: 2999
Out[83]: 750


# Checking balance of outcome variable


target_count = df_scaled.TenYearCHD.value_counts()
print('Class 0:', target_count[0])
print('Class 1:', target_count[1])
print('Proportion:', round(target_count[0] / target_count[1], 2), ': 1')
Output:
Class 0: 3178
Class 1: 571
Proportion: 5.57 : 1

Figure 1. Frequency counts of heart disease outcome

""" Valancing Instances
We can use the undersampling technique to reduce the instances of the overrepresented class in the dataset. In our scenario, these methods will bring down the number of fraudulent transactions in our data to a balanced ratio of approximately 50:50. By ensuring equal representation, we prevent classification algorithms from being biased towards the majority class. This prevents misleading results where it may appear that your algorithm is performing exceptionally well (overfitting), while it is actually just consistently predicting the majority class.
We can achieve a balance between the majority and minority classes by randomly selecting observations from the majority class and removing them. Another method is matching where X features of the minor class are matched with those of the major class. Matching is computationally difficult but useful when the length (instances) of the minor class is very small (< 100). """


# First seperate the dataset by class and randomly select 611 case from the major class.


# Shuffle df
shuffled_df = df_scaled.sample(frac=1, random_state=4)
# Put all the fraud class in a separate dataset.
CHD_df = shuffled_df.loc[shuffled_df['TenYearCHD'] == 1]
# Randomly select 611 observations from the non-fraud (majority class)
non_CHD_df = shuffled_df.loc[shuffled_df['TenYearCHD'] == 0].sample(n=611, random_state=42)

# Then combine the two datasets.
# Concatenate both dataframes again


normalized_df = pd.concat([CHD_df, non_CHD_df])

#Define y & X again


y = normalized_df['TenYearCHD']
X = normalized_df.drop(['TenYearCHD'], axis = 1)

Split to Train and Test subsets.
# divide train test: 80 % - 20 %


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=29)

# plot new count


sns.countplot(normalized_df, x='TenYearCHD', palette="OrRd")
plt.box(False)
plt.xlabel('Heart Disease No/Yes', fontsize=11)
plt.ylabel('Count',fontsize=11)
plt.show()

Figure 2 Balanced CHD outcomes

Model Pipeline

A model pipeline refers to a sequential and systematic process in which machine learning models are developed, trained, evaluated, and deployed. It is an essential framework that ensures the efficient transformation of raw data into valuable insights or predictions. The model pipeline begins by defining clear objectives and identifying relevant variables for analysis. Data preprocessing techniques such as cleaning, normalization, and feature extraction are then applied to enhance the quality of input data.

Once prepared, this dataset is split into training and testing sets to develop a suitable model architecture using various algorithms like decision trees or neural networks. During the training phase, the chosen algorithm learns patterns from labeled data through iterative optimization processes until it achieves desirable performance metrics such as accuracy or loss minimization. Afterward, the model's performance is evaluated on unseen test data to assess its generalizability and potential biases present in its predictions. If deemed satisfactory, the model proceeds towards deployment where it can make real-time predictions on new incoming data with minimal human intervention.

Throughout this entire process of building a robust model pipeline, diligent documentation helps maintain transparency regarding methodologies used and facilitate collaboration among multiple stakeholders involved in creating effective predictive solutions for complex problems across diverse industries such as healthcare diagnostics or financial forecasting.


# Import libraries


from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

# Testing the accuracy of each model using Pipeline


classifiers = [LogisticRegression(), SVC(), DecisionTreeClassifier(), KNeighborsClassifier(2)]
for classifier in classifiers:
    pipe = Pipeline(steps=[('classifier', classifier)])
    pipe.fit(X_train, y_train)
    print("The accuracy score of {0} is: {1:.2f}%".format(classifier,
        (pipe.score(X_test, y_test)*100)))

Accuracy:
Accuracy indicates how often the classifier is correct? Accuracy = (True Positive + True Negative)/total

The accuracy score of LogisticRegression() is: 68.35%
The accuracy score of SVC() is: 65.82%
The accuracy score of DecisionTreeClassifier() is: 54.01%
The accuracy score of KNeighborsClassifier(n_neighbors=2) is: 56.12%


Continue to Chapter 5. Model Evaluation