Technology for Humanity

Prediction of the 10-year Cardiovascular Heart Disease Using the Framingham Heart Study Data

Chapter 2. FHS Data

Step 1. Prerequisite

This application has been developed using Django (4.2.0) as the backend server, HTML & Javascript as the frontend. To replicate this application, all required libraries needs to be installed.

Step 2. Import libraries (or packages)

First, import libraries.


import os
from main import settings
os.chdir(settings.BASE_DIR) # To run interactive mode
from tabulate import tabulate
import pandas as pd
import seaborn as sns # interactive graph library

Step 3. Import data (in Excel csv format)

The original data can be downloaded from "https://www.kaggle.com/datasets/aasheesh200/framingham-heart-study-dataset". The data was saved in a local directory.


df = pd.read_csv('./data/framingham/dataset/framingham.csv')

The csv file was imported to Pandas dataframe. Pandas is a python library for database management and numerical analysis. Pandas can read/write data in many file formats, therefore, the portability is excellent. However, Pandas does not include the meta file in the dataset unlike some statistical packages (e.g., SPSS, R, Stata). Pandas is appropriate for manipulating relatively small dataset (n < a million). When the data size is huge (i.e., n > a million), other file formats (e.g., Polars, Arrow, Parquet, etc.) are preferred.

Table 1. Data description

Category	Variable	Code	Description
Demographic	male	1: Yes 0: No	Subject's sex
	age	32 - 70 years	Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)
	education	1 - 4	no further information provided
Behavioral	currentSmoker	1: Yes 0: No	whether or not the patient is a current smoker (Nominal)
	cigsPerDay	0 - 60 cigarrttes	the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.)
Medical history:	BPMeds	1: Yes 0: No	whether or not the patient was on blood pressure medication (Nominal)
	prevalentStroke	1: Yes 0: No	whether or not the patient had previously had a stroke (Nominal)
	prevalentHyp	1: Yes 0: No	whether or not the patient was hypertensive (Nominal)
	diabetes	1: Yes 0: No	whether or not the patient had diabetes (Nominal)
Current medical condition	totChol	100 - 512 mg/dL	total cholesterol level (Continuous)
	sysBP	70 - 295 mmHg	systolic blood pressure (Continuous)
	BMI	45 - 145 mmHg	Body Mass Index (Continuous)
	BMI	15.5 - 56.8 kg/m^2	Body Mass Index (Continuous)
	heartRate	44 - 143 pulse/min	In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.
	glucose	40 - 394 mg/dL	glucose level (Continuous)
Target variable to predict	TenYearCHD	1: Yes 0: No	10 year risk of coronary heart disease (CHD)

Step 3. Check data

Note: Variable names were shortened to reduce the widths.

df.columns.to_list()
keys = ['male', 'age', 'educ', 'Smoker', 'n_cigs','BPMeds', 'Stroke',
 'Hyp', 'dm', 'Chol', 'sysBP', 'diaBP', 'BMI', 'Rate', 'gluc', 'CHD']
print(tabulate(df.head(10), headers='keys'))

Table 2. Dataset (1st 10 records)

male	age	educ	Smoker	cigs	Hyp	Chol	SBP	DBP	BMI	Rate	gluc	CHD
1	39	4	0	0	0	195	106	70	26.97	80	77	0
0	46	2	0	0	0	250	121	81	28.73	95	76	0
1	48	1	1	20	0	245	127.5	80	25.34	75	70	0
0	61	3	1	30	1	225	150	95	28.58	65	103	1
0	46	3	1	23	0	285	130	84	23.1	85	85	0
0	43	2	0	0	1	228	180	110	30.3	77	99	0
0	63	1	0	0	0	205	138	71	33.11	60	85	1
0	45	2	1	20	0	313	100	71	21.68	79	78	0
1	52	1	0	0	1	260	141.5	89	26.36	76	79	0
1	43	1	1	30	1	225	162	107	23.61	93	88	0

# data shape: Check the number of records and variables

df.shape

Out[25]: (4240, 16)

data types - This is important!

df.dtypes

Table 3. Data types

Variable	Type
male	int64
age	int64
education	float64
currentSmoker	int64
cigsPerDay	float64
BPMeds	float64
prevalentStroke	int64
prevalentHyp	int64
diabetes	int64
totChol	float64
sysBP	float64
diaBP	float64
BMI	float64
heartRate	float64
glucose	float64
TenYearCHD	int64
dtype	object

# check for dupicates

duplicate_df = df[df.duplicated()]
duplicate_df.value_counts().sum()

Out[13]: 0

# checking for missing values

missing = df.isna().sum().to_frame(name='Missing')
print(tabulate(missing, headers=['Variable', 'Missing'], tablefmt="html"))
null = df[df.isna().any(axis=1)] # null displays the actual dataset with missing values
null # Printout is not displayed here.

Table 4. Frequency of missing values

Variable	Missing
male	0
age	0
education	105
currentSmoker	0
cigsPerDay	29
BPMeds	53
prevalentStroke	0
prevalentHyp	0
diabetes	0
totChol	50
sysBP	0
diaBP	0
BMI	19
heartRate	1
glucose	388
TenYearCHD	0

Continue to Chapter 3. FHS Data Clean Up