---
title: "Capstome Health Analytics Project using Big Autism Data"
author: "SOCR DSPA Training Project"
date: "`r format(Sys.time(), '%B %Y')`"
output:
html_document:
theme: spacelab
highlight: tango
toc: yes
number_sections: yes
toc_depth: 3
toc_float:
collapsed: no
smooth_scroll: yes
code_folding: hide
word_document:
toc: yes
toc_depth: '3'
---
This is an open-ended R&D projects that SOCR/DSPA Trainees can complete. Any creative solutions can be send to the instructor ([Ivo D. Dinov](https://www.socr.umich.edu/people/dinov/)). Use the Autism Brain Imaging Data Exchange (ABIDE) Data to design a meaningful biomedical study examining, characterizing, and contrasting normal and pathological (autism) brain neuro-development.
# Autism Brain Imaging Data Exchange (ABIDE) Data
These data consist of derived neuroimaging data, quality assessment (QA) metrics prefixed by *anat_* and *func_*, and manual quality assessment prefixed by *qc_*.
Automated QA Measures: These columns reflect automated metrics where outliers may be identified by a statistical procedure (e.g., $2\sigma$).
*Anatomical* measures:
- Contrast to Noise Ratio [*anat_cnr*]: mean of the gray matter values minus the mean of the white matter values, divided by the standard deviation of the air values 1.
- Entropy Focus Criterion [*anat_efc*]: Shannon’s entropy is used to summarize the principal directions distribution, higher energy indicating the distribution is more uniform (i.e., less noisy).
- Foreground to Background Energy Ratio [*anat_fber*]: Mean energy of image values (i.e., mean of squares) within the head relative to outside the head.
-Smoothness of Voxels [*anat_fwhm*]: The full-width half maximum (FWHM) of the spatial distribution of the image intensity values in terms of voxels (e.g., a value of 3 implies smoothness of 3 voxels).
- Percent of Artifact Voxels [*anat_qi1*]: The proportion of voxels with intensity corrupted by artifacts normalized by the number of voxels in the background.
- Signal to Noise Ratio [*anat_snr*]: The mean of image values within gray matter divided by the standard deviation of the image values within air (i.e., outside the head) 1.
*Functional* measures:
- Entropy Focus Criterion [*func_efc*]: Shannon’s entropy is used to summarize the principal directions distribution, higher energy indicating the distribution is more uniform (i.e., less noisy)
- Foreground to Background Energy Ratio [*func_fber*]: Mean energy of image values (i.e., mean of squares) within the head relative to outside the head. Uses mean functional.
- Smoothness of Voxels [*func_fwhm*]: The full-width half maximum (FWHM) of the spatial distribution of the image intensity values. Uses mean functional.
- Standardized DVARS [*func_dvars*]: The spatial standard deviation of the temporal derivative of the data, normalized by the temporal standard deviation and temporal autocorrelation.
- Fraction of Outlier Voxels [*func_outlier*]: The mean fraction of outliers found in each volume using [`3dTout` command in AFNI](http://afni.nimh.nih.gov/afni).
- Mean Distance to Median Volume [*func_quality*]: The mean distance (1 – spearman’s rho) between each time-point’s volume and the median volume using [AFNI’s 3dTqual command](http://afni.nimh.nih.gov/afni).
- Mean Framewise Displacement (FD) [*func_mean_fd*]: A measure of subject head motion, which compares the motion between the current and previous volumes. This is calculated by summing the absolute value of displacement changes in the x, y and z directions and rotational changes about those three axes. The rotational changes are given distance values based on the changes across the surface of a 50mm radius sphere.
- Number FD greater than 0.2mm [*func_num_fd*]: The number of frames or volumes with displacement greater than 0.2mm.
- Percent FD greater than 0.2mm [*func_perc_fd*]: The percent of frames or volumes with displacement greater than 0.2mm.
- Ghost to Signal Ratio [*func_gsr*]: A measure of the mean signal in the ‘ghost’ image (signal present outside the brain due to acquisition in the phase encoding direction) relative to mean signal within the brain.
- Manual QA measures: Manual inspection of the data was carried out by three independent raters.
More information and meta-data are available in the data-provenance DOCX in the [DSPA ABIDE Case-Study Folder](https://umich.instructure.com/courses/38100/files/folder/Case_Studies/17_ABIDE_Autism_CaseStudy).
## Load in the data
```{r warning=F, error=F, message=F}
# install.packages(magrittr)
library(magrittr)
# load ABIDE data (ABIDE_Aggregated_Data.csv)
ABIDE_data <- read.csv('https://umich.instructure.com/files/20935287/download?download_frd=1', header=T)
dim(ABIDE_data) # 1098 2145
attach(ABIDE_data)
```
## Data Modeling, EDA
```{r warning=F, error=F, message=F}
# Review the data element types
# colnames(ABIDE_data)
# Potential relevant Outcomes (Y)
table(ABIDE_data$researchGroup)
# Autism Control
# 528 570
table(ABIDE_data$subjectSex)
# Data Cleaning (QC)
#replaces the missing (-9999) IQ values with 30
ABIDE_data$iq <- replace(ABIDE_data$iq, ABIDE_data$iq<0, 30)
# Visualize the data
#table(ABIDE_data$iq)
library(plotly)
xLabel <- list(title = "Intelligence (IQ)")
yLabel <- list(title = "Frequency")
plot_ly(x = ~ABIDE_data$iq, type = "histogram") %>%
layout(xaxis = xLabel, yaxis = yLabel)
# MODEL the data
# Fit and plot linear models according to specified predictors and outcomes
fitPlot_LM_Model <- function (Y, X) {
# Y= outcome column name
# X= vector of predictor column names
### .......
# return (myPlot)
}
#### Run the Full model-fitting prospectively and display the prediction forecasts
```
# Predict recessions
```{r warning=F, error=F, message=F}
# Logit modeling
```
# Multiple Imputation of incomplete Data
Introduce some MCAR deletions. Impute the missing values and compare the (simulated-missing) data and models to their complete (original) data counterparts.
```{r warning=F, error=F, message=F}
# Introduce simulated MCAR missingness
# Imputation
# Rhat convergence statistics compares the variance between chains to the variance
# within chains (similar to the ANOVA F-test).
# Rhat Values ~ 1.0 indicate likely convergence,
# Rhat Values > 1.1 indicate that the chains should be run longer
# (use large number of iterations)
# Compare the results of the complete data (1979-2020) models to the imputed data model (1979-2020)
# Plot the resulting models and quantify model differences
```
# Mixture Distribution modeling
Using the [DSPA Chapter 3 for more elaborate data mixture distribution modeling](https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/03_DataVisualization.html#616_Mixture_distribution_data_modeling), develop some forward prediction models.
# Unsupervised clustering
Try some of the DSPA unsupervised clustering and classification techniques on the US macro-economic dataset.
# Venture beyond ...
Think out-of-the-box in this interactive-learning projects using the monthly US macro-economic data. Try to use the RMD source and the provided data to experiment with novel AI/ML techniques. Think of ways to **augment these data** (expand the time range and increase the feature richness).
# References
- [DSPA Techniques](https://dspa.predictive.space/).
- [Autism ABIDE Dataset](https://umich.instructure.com/courses/38100/files/folder/Case_Studies/17_ABIDE_Autism_CaseStudy).