pp Technology for Humanity
Fallacy of Symptom-Based Inference (Diagnosis) and Prediction (Prognosis)

Prediction of the 10-year Cardiovascular Heart Disease Using the Framingham Heart Study Data


Technology for Humanity

Chapter 2. FHS Data


Step 1. Prerequisite

This application has been developed using Django (4.2.0) as the backend server, HTML & Javascript as the frontend. To replicate this application, all required libraries needs to be installed.

Step 2. Import libraries (or packages)

First, import libraries.


import os
from main import settings
os.chdir(settings.BASE_DIR) # To run interactive mode
from tabulate import tabulate
import pandas as pd
import seaborn as sns # interactive graph library

Step 3. Import data (in Excel csv format)

The original data can be downloaded from "https://www.kaggle.com/datasets/aasheesh200/framingham-heart-study-dataset". The data was saved in a local directory.


df = pd.read_csv('./data/framingham/dataset/framingham.csv')

The csv file was imported to Pandas dataframe. Pandas is a python library for database management and numerical analysis. Pandas can read/write data in many file formats, therefore, the portability is excellent. However, Pandas does not include the meta file in the dataset unlike some statistical packages (e.g., SPSS, R, Stata). Pandas is appropriate for manipulating relatively small dataset (n < a million). When the data size is huge (i.e., n > a million), other file formats (e.g., Polars, Arrow, Parquet, etc.) are preferred.

Table 1. Data description

Category Variable Code Description
Demographic male 1: Yes
0: No
Subject's sex
age 32 - 70 years Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)
education 1 - 4 no further information provided
Behavioral currentSmoker 1: Yes
0: No
whether or not the patient is a current smoker (Nominal)
cigsPerDay 0 - 60 cigarrttes the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.)
Medical history: BPMeds 1: Yes
0: No
whether or not the patient was on blood pressure medication (Nominal)
prevalentStroke 1: Yes
0: No
whether or not the patient had previously had a stroke (Nominal)
prevalentHyp 1: Yes
0: No
whether or not the patient was hypertensive (Nominal)
diabetes 1: Yes
0: No
whether or not the patient had diabetes (Nominal)
Current medical condition totChol 100 - 512 mg/dL total cholesterol level (Continuous)
sysBP 70 - 295 mmHg systolic blood pressure (Continuous)
BMI 45 - 145 mmHg Body Mass Index (Continuous)
BMI 15.5 - 56.8 kg/m^2 Body Mass Index (Continuous)
heartRate 44 - 143 pulse/min In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.
glucose 40 - 394 mg/dL glucose level (Continuous)
Target variable to predict TenYearCHD 1: Yes
0: No
10 year risk of coronary heart disease (CHD)

Step 3. Check data

Note: Variable names were shortened to reduce the widths.

df.columns.to_list()
keys = ['male', 'age', 'educ', 'Smoker', 'n_cigs','BPMeds', 'Stroke',
 'Hyp', 'dm', 'Chol', 'sysBP', 'diaBP', 'BMI', 'Rate', 'gluc', 'CHD']
print(tabulate(df.head(10), headers='keys'))

Table 2. Dataset (1st 10 records)

male age educ Smoker cigs Meds Stro Hyp dm Chol SBP DBP BMI Rate gluc CHD
1 39 4 0 0 0 0 0 0 195 106 7026.97 80 77 0
0 46 2 0 0 0 0 0 0 250 121 8128.73 95 76 0
1 48 1 1 20 0 0 0 0 245 127.5 8025.34 75 70 0
0 61 3 1 30 0 0 1 0 225 150 9528.58 65 103 1
0 46 3 1 23 0 0 0 0 285 130 8423.1 85 85 0
0 43 2 0 0 0 0 1 0 228 180 11030.3 77 99 0
0 63 1 0 0 0 0 0 0 205 138 7133.11 60 85 1
0 45 2 1 20 0 0 0 0 313 100 7121.68 79 78 0
1 52 1 0 0 0 0 1 0 260 141.5 8926.36 76 79 0
1 43 1 1 30 0 0 1 0 225 162 10723.61 93 88 0

# data shape: Check the number of records and variables

df.shape

Out[25]: (4240, 16)

data types - This is important!

df.dtypes

Table 3. Data types

Variable Type
male int64
age int64
education float64
currentSmoker int64
cigsPerDay float64
BPMeds float64
prevalentStroke int64
prevalentHyp int64
diabetes int64
totChol float64
sysBP float64
diaBP float64
BMI float64
heartRate float64
glucose float64
TenYearCHD int64
dtype object

# check for dupicates

duplicate_df = df[df.duplicated()]
duplicate_df.value_counts().sum()

Out[13]: 0

# checking for missing values

missing = df.isna().sum().to_frame(name='Missing')
print(tabulate(missing, headers=['Variable', 'Missing'], tablefmt="html"))
null = df[df.isna().any(axis=1)] # null displays the actual dataset with missing values
null # Printout is not displayed here.

Table 4. Frequency of missing values

Variable Missing
male 0
age 0
education 105
currentSmoker 0
cigsPerDay 29
BPMeds 53
prevalentStroke 0
prevalentHyp 0
diabetes 0
totChol 50
sysBP 0
diaBP 0
BMI 19
heartRate 1
glucose 388
TenYearCHD 0

Continue to Chapter 3. FHS Data Clean Up