Exploring the Prediction Model of the Self-Management of Chronic Conditions Using NHANES
Technology for Humanity
Chapter 1. Data Structure
Characteristics
NHANES datasets are unique as follows.
- Nationally Representative Sample: The sampling schemes are complex (e.g.,
multiple-stage stratified samples). National population statistics are possible using the sampling
weight assigned to each sample.
- Blacks Oversampled: Blacks are about 11% of the total population
in the nation, but they have been oversampled almost twice their proportion. National statistics,
therefore, should take into account the sampling weight.
- Longitudinal Data: NHANES conducts surveys in two-year cycles,
allowing for data collection over time. This allows researchers to track changes and trends in
health and nutrition over time.
- Comprehensive Health Data: NHANES collects data on a wide range
of topics related to health and nutrition, including demographics, diet and nutrition, physical activity,
chronic conditions, reproductive health, dental health, and environmental exposures.
- In-Depth Data Collection: NHANES uses standardized protocols to collect
data through in-person interviews, physical examinations, laboratory tests, and dietary recalls.
This ensures high-quality data that can be used for accurate analysis.
- Publicly Available Data: All NHANES data is publicly available online
through the CDC website. This allows for transparency and reproducibility of research findings.
- Large Sample Size: Each survey cycle includes approximately 10,000
participants of all ages. This large sample size allows for more precise estimates and subgroup analysis.
- Utilization of Biomarkers: In addition to self-reported data, NHANES
collects biomarker data from blood and urine samples. This provides objective measures of health outcomes
and exposure to environmental factors.
- Integration with Other Datasets: Some data are restricted to only
authorized researchers. With the restricted data, researchers can link NHANES data with other datasets
(e.g., mortality data and medical expenses) to conduct more comprehensive research.
Challenges
NHANES data were first collected before the age of machine learning and Python. Specifically, SAS
(Statistical Analysis Software) was (still is) used for data storage and analysis. It was because,
at that time, SAS could handle data with sampling weights. In the age of statistics, SAS was an excellent
analysis tool, but in the age of machine learning, it poses several challenges:
- Limited Metadata: SAS only stores numeric data, so metadata should be
stored separately in headers or data dictionaries. Metadata is data about data. It is a critical component
in machine learning. CDC has provided the metadata in the HTM format. Reconstruction of the metadata into
ML usable formats requires a significant amount of web scraping work.
- Static Dataset: It takes at least several years for a cycle of NHANES
data to be available to the public. For example, the last reliable and published data was from the
2017-2018 cycle, about 6 years ago. In the ML age, that kind of lag may be hard to be justifiable.
- Incompatibility with Machine Learning: SAS data sets cannot be used
directly in ML models, mostly because of the metadata issue. Metadata is a critical component in machine
learning to understand and process variables.
- Low label number: Not all NHANES data are labeled for prediction.
There are many reasons for this, including privacy concerns. The labels available are grouped into
categories such as diabetes, anemia, etc., and some labels have less than 10 people in the dataset.
My Approach to the Challenges
The challenges listed above might be daunting, but they are not impossible to tackle.
Here is my approach:
- Convert SAS data into more usable formats such as CSV, JSON, Pickle, and Arrow format.
- For the metadata issue, I wrote a Python library to incorporate metadata into the
PostgreSQL table. This metadata will be the basis of prompt engineering and fine-tuning in the future.
Task 1. How Many Files and Variables?
There are two components to NHANES data: SAS XPORT data and HTML metadata.
Data and metadata must be read together in order to use the data in machine learning. The number of
files and variabbles in areas of Demography, Dietary, Examination, Laboratory, and Questionnaires are
summarized below.
Table 1. Summary of NHANES data and metadata, 1999-2018
Area |
Files |
Variables |
Total |
Demography |
10 |
542 |
552 |
Dietary |
20 |
2,355 |
2,375 |
Examination |
168 |
17,109 |
17,277 |
Laboratory |
676 |
10,744 |
11,420 |
Questionnaire |
433 |
13,261 |
13,694 |
Total |
1,307 |
44,011 |
45,318 |
Note: All files have one common variable. "SEQN" is the unique identification number, with
which all files may be combined. Some files contain multiple records of the same SEQN. Examples are records of
vitamin and dietary supplements for each SEQN, and records of prescription drugs, etc. These files are not
included in the above summary. However, they are seperately compiled, when it's necessary.
Task 1. Data Download
NHANES data are stored as SAS XPORT files in the CDC website . The files are open to the public
and supposed available for 24/7, we prefer copies in a local machine for speedy processing. We wrote a download
function [which may be available upon request] and stored the data in a folder named nhanes.
Note: There's no warning or notification built into NHANES data about
updates or changes. There's a ListServ that notices changes, then the update must be processed manually.
The other way is to write a function to check regularly 'Date Published' before running the analysis.
The folders under nhanes are as follows. First 5 files are shown.
nhanes/
├─────1999-2000/
│ ├─────demo/
│ │ └─────DEMO.XPT
│ ├─────diet/
│ │ ├─────DRXFMT.XPT
│ │ ├─────DRXIFF.XPT
│ │ ├─────DRXTOT.XPT
│ │ ├─────DSBI.XPT
│ │ └─────DSII.XPT
│ ├─────exam/
│ │ ├─────AUX1.XPT
│ │ ├─────AUXAR.XPT
│ │ ├─────AUXTYM.XPT
│ │ ├─────BAX.XPT
│ │ └─────BIX.XPT
│ ├─────labo/
│ │ ├─────L02HBS.XPT
│ │ ├─────L02HPA_A.XPT
│ │ ├─────LAB02.XPT
│ │ ├─────LAB03.XPT
│ │ └─────LAB04.XPT
│ └─────ques/
│ ├─────ACQ.XPT
│ ├─────ALQ.XPT
│ ├─────AUQ.XPT
│ ├─────BAQ.XPT
│ └─────BPQ.XPT
├─────2001-2002/
│ ├─────demo/
│ │ └─────DEMO_B.XPT
│ ├─────diet/
│ │ ├─────DRXFMT_B.XPT
│ │ ├─────DRXIFF_B.XPT
│ │ ├─────DRXTOT_B.XPT
│ │ ├─────DSBI.XPT
│ │ └─────DSII.XPT
│ ├─────exam/
│ │ ├─────AUXAR_B.XPT
│ │ ├─────AUXTYM_B.XPT
│ │ ├─────AUX_B.XPT
│ │ ├─────BAX_B.XPT
│ │ └─────BIX_B.XPT
│ ├─────labo/
│ │ ├─────L02HBS_B.XPT
│ │ ├─────L02HPA_B.XPT
│ │ ├─────L02_B.XPT
│ │ ├─────L03_B.XPT
│ │ └─────L04VOC_B.XPT
│ └─────ques/
│ ├─────ACQ_B.XPT
│ ├─────ALQ_B.XPT
│ ├─────AUQ_B.XPT
│ ├─────BAQ_B.XPT
│ └─────BPQ_B.XPT
├─────2003-2004/
│ ├─────demo/
│ │ └─────DEMO_C.XPT
│ ├─────diet/
│ │ ├─────DR1IFF_C.XPT
│ │ ├─────DR1TOT_C.XPT
│ │ ├─────DR2IFF_C.XPT
│ │ ├─────DR2TOT_C.XPT
│ │ └─────DRXFCD_C.XPT
│ ├─────exam/
│ │ ├─────AUXAR_C.XPT
│ │ ├─────AUXTYM_C.XPT
│ │ ├─────AUX_C.XPT
│ │ ├─────BAX_C.XPT
│ │ └─────BIX_C.XPT
│ ├─────labo/
│ │ ├─────L02HBS_C.XPT
│ │ ├─────L02HPA_C.XPT
│ │ ├─────L02_C.XPT
│ │ ├─────L03_C.XPT
│ │ └─────L04PER_C.XPT
│ └─────ques/
│ ├─────ACQ_C.XPT
│ ├─────ALQ_C.XPT
│ ├─────AUQ_C.XPT
│ ├─────BAQ_C.XPT
│ └─────BPQ_C.XPT
├─────2005-2006/
│ ├─────demo/
│ │ └─────DEMO_D.XPT
│ ├─────diet/
│ │ ├─────DR1IFF_D.XPT
│ │ ├─────DR1TOT_D.XPT
│ │ ├─────DR2IFF_D.XPT
│ │ ├─────DR2TOT_D.XPT
│ │ └─────DRXFCD_D.XPT
│ ├─────exam/
│ │ ├─────AUXAR_D.XPT
│ │ ├─────AUXTYM_D.XPT
│ │ ├─────AUX_D.XPT
│ │ ├─────BMX_D.XPT
│ │ └─────BPX_D.XPT
│ ├─────labo/
│ │ ├─────ALB_CR_D.XPT
│ │ ├─────ALDUST_D.XPT
│ │ ├─────AL_IGE_D.XPT
│ │ ├─────AMDGYD_D.XPT
│ │ └─────B12_D.XPT
│ └─────ques/
│ ├─────ACQ_D.XPT
│ ├─────AGQ_D.XPT
│ ├─────ALQ_D.XPT
│ ├─────AUQ_D.XPT
│ └─────BHQ_D.XPT
└─────2007-2008/
├─────demo/
│ └─────DEMO_E.XPT
├─────diet/
│ ├─────DR1IFF_E.XPT
│ ├─────DR1TOT_E.XPT
│ ├─────DR2IFF_E.XPT
│ ├─────DR2TOT_E.XPT
│ └─────DRXFCD_E.XPT
├─────exam/
│ ├─────AUXAR_E.XPT
│ ├─────AUXTYM_E.XPT
│ ├─────AUX_E.XPT
│ ├─────BMX_E.XPT
│ └─────BPX_E.XPT
├─────labo/
│ ├─────ALB_CR_E.XPT
│ ├─────APOB_E.XPT
│ ├─────BFRPOL_E.XPT
│ ├─────BIOPRO_E.XPT
│ └─────CARB_E.XPT
└─────ques/
├─────ACQ_E.XPT
├─────ALQ_E.XPT
├─────AQQ_E.XPT
├─────AUQ_E.XPT
└─────BHQ_E.XPT
Continued to Chapter 1.1. Demographic