Initial phenotypic cluster analysis#

We perform k-means clustering on the final scores of the following instruments:

Repetitive Behavior Scale - Revised (RBSR)
Developmental Coordination Disorder Questionnaire (DCDQ)
Child Behavior Checklist for ages 6 to 18 years (CBCL_6_18)
Social Communication Questionnaire (SCQ)

These instruments are selected because they have a final score feature. There are 17251 subjects after joining these features and removing subjects with missing values.

K-means on final scores#

K-means is run for \(k=2\) to \(k=15\), repeated 5 times each to calculate means and standard deviations for each metric.

_static/images/k_means_metrics.png — Metrics (inertia, silhouette score, Davies-Bouldin index, Calinski-Harabasz index) for k-means clustering on the final scores of the instruments against k.#

Interpretation:

In the inertia plot, a sharp drop from \(k=2\) to \(k=6\) then a gradual taper is observed, showing the classic elbow shape. This suggests that adding clusters beyond 6 yields diminishing returns in reducing within-cluster variance.
Tn the silhouette score plot, values are around 0.15–0.20 across \(k\), which suggest weak cluster separation and possible overlapping or continuous structure in the data. The little variation show there is no strongly preferred \(k\).
The lowest Davies-Bouldin index (lower is better) is observed at \(k=4\), suggesting this is the optimal \(k\).
In the Calinski-Harabasz index plot (higher is better), there is a steep decline through all \(k\).

The overall impression is that cluster structure is weak. The best candidate is \(k=4\), but a lack of distinct peaks in metrics do not strongly support any particular \(k\). There may be continuous structure or noise in the data.

Gaussian mixture model (GMM) on final scores#

We also run Gaussian mixture model (GMM) clustering for \(k=2\) to \(k=15\), repeated 5 times each to calculate means and standard deviations for each metric.

_static/images/gmm_metrics.png — Metrics (log likelihood, AIC, BIC, silhouette score, Davies-Bouldin index, Calinski-Harabasz index) for GMM clustering on the final scores of the instruments against k.#

K-means on question features#

_static/images/questions_k_means_metrics.png — Metrics (inertia, silhouette score, Davies-Bouldin index, Calinski-Harabasz index) for k-means clustering on the question features of the instruments against k.#

GMM on question features#

_static/images/questions_gmm_metrics.png — Metrics (log likelihood, AIC, BIC, silhouette score, Davies-Bouldin index, Calinski-Harabasz index) for GMM clustering on the question features of the instruments against k.#

Module contents#

asd_strat.commands.initial_pheno.initial_pheno(feature_set: ~typing.Annotated[str, <typer.models.ArgumentInfo object at 0x7f26fbff9d30>] = '.', spark_pathname: ~typing.Annotated[str, <typer.models.ArgumentInfo object at 0x7f26fbf4fd90>] = '.', cache_pathname: ~typing.Annotated[str, <typer.models.ArgumentInfo object at 0x7f26fbb1c190>] = '.', output_pathname: ~typing.Annotated[str, <typer.models.ArgumentInfo object at 0x7f26fbb1c410>] = '.') → None#

Runs clustering algorithms on phenotypic features from SPARK data and generates metrics plots.

Parameters:

feature_set – A string indicating which set of features to use for clustering.
spark_pathname – The SPARK data release directory pathname.
cache_pathname – The cache directory pathname.
output_pathname – The output directory pathname where plots will be saved.

asd_strat.commands.initial_pheno.plot_metrics_against_k(metrics_df: DataFrame, metric_keys: list[str], title: str) → Figure#

Generates a plot of specified metrics against different values of k, where each metric is displayed in its own subplot panel. The function allows for error bars based on the corresponding standard deviations.

Parameters:

metrics_df – A pandas DataFrame containing k values, the mean and standard deviation of each specified metric. The DataFrame must include columns named ‘k’, ‘mean_<metric>’, and ‘std_<metric>’ for all the specified metrics in metric_keys.
metric_keys – A list of strings, where each string is the name of a metric to be plotted. Metrics must have corresponding columns in metrics_df.
title – A string to specify the title of the generated figure.

Returns:

The plot figure.

asd_strat.commands.initial_pheno.run_gmm(df: DataFrame, repeats: int = 5) → DataFrame#

Runs Gaussian Mixture Model (GMM) clustering on the provided DataFrame and computes various clustering metrics.

Parameters:

df – A dataframe containing the features to be clustered.
repeats – Number of times to repeat the GMM clustering for each k value.

Returns:

A dataframe containing the clustering metrics for each k value.

asd_strat.commands.initial_pheno.run_k_means(df: DataFrame, repeats: int = 5) → DataFrame#

Runs k-means clustering on the provided DataFrame and computes various clustering metrics.

Parameters:

df – A dataframe containing the features to be clustered.
repeats – Number of times to repeat the k-means clustering for each k value.

Returns:

A dataframe containing the clustering metrics for each k value.

Initial phenotypic cluster analysis#

K-means on final scores#

Gaussian mixture model (GMM) on final scores#

K-means on question features#

GMM on question features#

Module contents#

This Page