Fallacy of Symptom-Based Inference (Diagnosis) and Prediction (Prognosis)

Exploring the Prediction Model of the Self-Management of Chronic Conditions Using NHANES


Technology for Humanity

Chapter 1. Data Structure


Characteristics

NHANES datasets are unique as follows.

  1. Nationally Representative Sample: The sampling schemes are complex (e.g., multiple-stage stratified samples). National population statistics are possible using the sampling weight assigned to each sample.
  2. Blacks Oversampled: Blacks are about 11% of the total population in the nation, but they have been oversampled almost twice their proportion. National statistics, therefore, should take into account the sampling weight.
  3. Longitudinal Data: NHANES conducts surveys in two-year cycles, allowing for data collection over time. This allows researchers to track changes and trends in health and nutrition over time.
  4. Comprehensive Health Data: NHANES collects data on a wide range of topics related to health and nutrition, including demographics, diet and nutrition, physical activity, chronic conditions, reproductive health, dental health, and environmental exposures.
  5. In-Depth Data Collection: NHANES uses standardized protocols to collect data through in-person interviews, physical examinations, laboratory tests, and dietary recalls. This ensures high-quality data that can be used for accurate analysis.
  6. Publicly Available Data: All NHANES data is publicly available online through the CDC website. This allows for transparency and reproducibility of research findings.
  7. Large Sample Size: Each survey cycle includes approximately 10,000 participants of all ages. This large sample size allows for more precise estimates and subgroup analysis.
  8. Utilization of Biomarkers: In addition to self-reported data, NHANES collects biomarker data from blood and urine samples. This provides objective measures of health outcomes and exposure to environmental factors.
  9. Integration with Other Datasets: Some data are restricted to only authorized researchers. With the restricted data, researchers can link NHANES data with other datasets (e.g., mortality data and medical expenses) to conduct more comprehensive research.

Challenges

NHANES data were first collected before the age of machine learning and Python. Specifically, SAS (Statistical Analysis Software) was (still is) used for data storage and analysis. It was because, at that time, SAS could handle data with sampling weights. In the age of statistics, SAS was an excellent analysis tool, but in the age of machine learning, it poses several challenges:

  1. Limited Metadata: SAS only stores numeric data, so metadata should be stored separately in headers or data dictionaries. Metadata is data about data. It is a critical component in machine learning. CDC has provided the metadata in the HTM format. Reconstruction of the metadata into ML usable formats requires a significant amount of web scraping work.
  2. Static Dataset: It takes at least several years for a cycle of NHANES data to be available to the public. For example, the last reliable and published data was from the 2017-2018 cycle, about 6 years ago. In the ML age, that kind of lag may be hard to be justifiable.
  3. Incompatibility with Machine Learning: SAS data sets cannot be used directly in ML models, mostly because of the metadata issue. Metadata is a critical component in machine learning to understand and process variables.
  4. Low label number: Not all NHANES data are labeled for prediction. There are many reasons for this, including privacy concerns. The labels available are grouped into categories such as diabetes, anemia, etc., and some labels have less than 10 people in the dataset.

My Approach to the Challenges

The challenges listed above might be daunting, but they are not impossible to tackle. Here is my approach:

  1. Convert SAS data into more usable formats such as CSV, JSON, Pickle, and Arrow format.
  2. For the metadata issue, I wrote a Python library to incorporate metadata into the PostgreSQL table. This metadata will be the basis of prompt engineering and fine-tuning in the future.

Task 1. How Many Files and Variables?

There are two components to NHANES data: SAS XPORT data and HTML metadata. Data and metadata must be read together in order to use the data in machine learning. The number of files and variabbles in areas of Demography, Dietary, Examination, Laboratory, and Questionnaires are summarized below.

Table 1. Summary of NHANES data and metadata, 1999-2018
Area Files Variables Total
Demography 10 542 552
Dietary 20 2,355 2,375
Examination 168 17,109 17,277
Laboratory 676 10,744 11,420
Questionnaire 433 13,261 13,694
Total 1,307 44,011 45,318

Note: All files have one common variable. "SEQN" is the unique identification number, with which all files may be combined. Some files contain multiple records of the same SEQN. Examples are records of vitamin and dietary supplements for each SEQN, and records of prescription drugs, etc. These files are not included in the above summary. However, they are seperately compiled, when it's necessary.

Task 1. Data Download

NHANES data are stored as SAS XPORT files in the CDC website . The files are open to the public and supposed available for 24/7, we prefer copies in a local machine for speedy processing. We wrote a download function [which may be available upon request] and stored the data in a folder named nhanes.

Note: There's no warning or notification built into NHANES data about updates or changes. There's a ListServ that notices changes, then the update must be processed manually. The other way is to write a function to check regularly 'Date Published' before running the analysis.

The folders under nhanes are as follows. First 5 files are shown.



nhanes/
├─────1999-2000/
│     ├─────demo/
│     │     └─────DEMO.XPT
│     ├─────diet/
│     │     ├─────DRXFMT.XPT
│     │     ├─────DRXIFF.XPT
│     │     ├─────DRXTOT.XPT
│     │     ├─────DSBI.XPT
│     │     └─────DSII.XPT
│     ├─────exam/
│     │     ├─────AUX1.XPT
│     │     ├─────AUXAR.XPT
│     │     ├─────AUXTYM.XPT
│     │     ├─────BAX.XPT
│     │     └─────BIX.XPT
│     ├─────labo/
│     │     ├─────L02HBS.XPT
│     │     ├─────L02HPA_A.XPT
│     │     ├─────LAB02.XPT
│     │     ├─────LAB03.XPT
│     │     └─────LAB04.XPT
│     └─────ques/
│           ├─────ACQ.XPT
│           ├─────ALQ.XPT
│           ├─────AUQ.XPT
│           ├─────BAQ.XPT
│           └─────BPQ.XPT
├─────2001-2002/
│     ├─────demo/
│     │     └─────DEMO_B.XPT
│     ├─────diet/
│     │     ├─────DRXFMT_B.XPT
│     │     ├─────DRXIFF_B.XPT
│     │     ├─────DRXTOT_B.XPT
│     │     ├─────DSBI.XPT
│     │     └─────DSII.XPT
│     ├─────exam/
│     │     ├─────AUXAR_B.XPT
│     │     ├─────AUXTYM_B.XPT
│     │     ├─────AUX_B.XPT
│     │     ├─────BAX_B.XPT
│     │     └─────BIX_B.XPT
│     ├─────labo/
│     │     ├─────L02HBS_B.XPT
│     │     ├─────L02HPA_B.XPT
│     │     ├─────L02_B.XPT
│     │     ├─────L03_B.XPT
│     │     └─────L04VOC_B.XPT
│     └─────ques/
│           ├─────ACQ_B.XPT
│           ├─────ALQ_B.XPT
│           ├─────AUQ_B.XPT
│           ├─────BAQ_B.XPT
│           └─────BPQ_B.XPT
├─────2003-2004/
│     ├─────demo/
│     │     └─────DEMO_C.XPT
│     ├─────diet/
│     │     ├─────DR1IFF_C.XPT
│     │     ├─────DR1TOT_C.XPT
│     │     ├─────DR2IFF_C.XPT
│     │     ├─────DR2TOT_C.XPT
│     │     └─────DRXFCD_C.XPT
│     ├─────exam/
│     │     ├─────AUXAR_C.XPT
│     │     ├─────AUXTYM_C.XPT
│     │     ├─────AUX_C.XPT
│     │     ├─────BAX_C.XPT
│     │     └─────BIX_C.XPT
│     ├─────labo/
│     │     ├─────L02HBS_C.XPT
│     │     ├─────L02HPA_C.XPT
│     │     ├─────L02_C.XPT
│     │     ├─────L03_C.XPT
│     │     └─────L04PER_C.XPT
│     └─────ques/
│           ├─────ACQ_C.XPT
│           ├─────ALQ_C.XPT
│           ├─────AUQ_C.XPT
│           ├─────BAQ_C.XPT
│           └─────BPQ_C.XPT
├─────2005-2006/
│     ├─────demo/
│     │     └─────DEMO_D.XPT
│     ├─────diet/
│     │     ├─────DR1IFF_D.XPT
│     │     ├─────DR1TOT_D.XPT
│     │     ├─────DR2IFF_D.XPT
│     │     ├─────DR2TOT_D.XPT
│     │     └─────DRXFCD_D.XPT
│     ├─────exam/
│     │     ├─────AUXAR_D.XPT
│     │     ├─────AUXTYM_D.XPT
│     │     ├─────AUX_D.XPT
│     │     ├─────BMX_D.XPT
│     │     └─────BPX_D.XPT
│     ├─────labo/
│     │     ├─────ALB_CR_D.XPT
│     │     ├─────ALDUST_D.XPT
│     │     ├─────AL_IGE_D.XPT
│     │     ├─────AMDGYD_D.XPT
│     │     └─────B12_D.XPT
│     └─────ques/
│           ├─────ACQ_D.XPT
│           ├─────AGQ_D.XPT
│           ├─────ALQ_D.XPT
│           ├─────AUQ_D.XPT
│           └─────BHQ_D.XPT
└─────2007-2008/
      ├─────demo/
      │     └─────DEMO_E.XPT
      ├─────diet/
      │     ├─────DR1IFF_E.XPT
      │     ├─────DR1TOT_E.XPT
      │     ├─────DR2IFF_E.XPT
      │     ├─────DR2TOT_E.XPT
      │     └─────DRXFCD_E.XPT
      ├─────exam/
      │     ├─────AUXAR_E.XPT
      │     ├─────AUXTYM_E.XPT
      │     ├─────AUX_E.XPT
      │     ├─────BMX_E.XPT
      │     └─────BPX_E.XPT
      ├─────labo/
      │     ├─────ALB_CR_E.XPT
      │     ├─────APOB_E.XPT
      │     ├─────BFRPOL_E.XPT
      │     ├─────BIOPRO_E.XPT
      │     └─────CARB_E.XPT
      └─────ques/
            ├─────ACQ_E.XPT
            ├─────ALQ_E.XPT
            ├─────AQQ_E.XPT
            ├─────AUQ_E.XPT
            └─────BHQ_E.XPT

Continued to Chapter 1.1. Demographic