Paper reproduction#

We attempt to reproduce the results of the Litman et al. paper titled Decomposition of phenotypic heterogeneity in autism reveals underlying genetic programs [1]. The code and script used in this paper was accessed from GitHub.

Feature selection#

The list of features used in the paper was obtained from using a code generation function injected into the preprocessing script [2] at line 127, before dataframes were integrated.

print("\n".join(["SCQ_FEATURES = ["] + [f"    Feat.SCQ_{col.upper()}," for col in scqdf.columns] + ["]"]))
print("\n".join(["BHC_FEATURES = ["] + [f"    Feat.BHC_{col.upper()}," for col in bhcdf.columns] + ["]"]))
print("\n".join(["BHS_FEATURES = ["] + [f"    Feat.BHS_{col.upper()}," for col in bhsdf.columns] + ["]"]))
print("\n".join(["RBSR FEATURES = ["] + [f"    Feat.RBSR_{col.upper()}," for col in rbsr.columns] + ["]"]))
print("\n".join(["CBCL_6_18_FEATURES = ["] + [f"    Feat.CBCL_6_18_{col.upper()}," for col in cbcl_2.columns] + ["]"]))

These features were joined using the init_and_join method.

Note

The ASD column was omitted from BHC, as it was already included by SCQ, mirroring how Duplicated columns are removed in the preprocessing script [2].

Preprocessing#

Inclusion criteria ascertained from the preprocessing script [2] involve:

  • An SCQ age of evaluation range of 4–18 years.

  • An BHC age of evaluation range of 4–18 years.

Data transformations include:

  • Mapping of male sex assigned at birth to 1, female sex assigned at birth to 0.

  • Serialising categorical variables into numerical values.

Exclusion criteria include:

  • Columns where more than 10% of the values are missing

  • Rows with any missing values.

The preprocessing metrics are shown below, including the male to female split:

Reproduction

Original Paper

Number of subjects

27688

9094

Number of features

108

247

Male subjects

21416

6818

Female subjects

6272

2276

Observations include a 3-fold increase in subject count, but a loss of 139 features. The male to female proportions have stayed roughly the same. These observed differences are likely due to the use of a more recent release of SPARK (2025-03-31 oppose to the original paper’s 2022-12-12). Raising the maximum missing value threshold from 10% to 30% only yields 6 more features, which indicates that there are many more subjects who have joined the study, but have not completed all the instruments.

When comparing the subject identifiers between the original and the reproduction, we find that 96.4% of the subjects in the original paper are also present in the reproduction. The individuals who are not present in the reproduction may have requested their data be removed, may have found to be erroneous.

Fit results#

Original paper

Original identifiers

All identifiers

Sample size

5392

8767

27688

Case weights

37%, 34%, 19%, 10%

31%, 27%, 25%, 16%

52%, 27%, 14%, 6%

Number of estimated parameters

10229

9049

Sacled relative entropy

0.9780

0.9783

Footnotes

Module contents#

asd_strat.commands.paper_reproduction.find_optimal_k(df, df_descriptor, Z_p, ds, instruments, cache_path)#
asd_strat.commands.paper_reproduction.get_original_subject_sp_ids() set[str]#

Retrieves the original subject identifiers used by the paper from a text file.

Returns:

A set of unique subject identifiers read from the file.

Return type:

set[str]

asd_strat.commands.paper_reproduction.paper_reproduction(use_original_ids: ~typing.Annotated[bool, <typer.models.ArgumentInfo object at 0x7f26fb957610>] = True, sample_frac: ~typing.Annotated[float, <typer.models.ArgumentInfo object at 0x7f26fb957750>] = True, spark_pathname: ~typing.Annotated[str, <typer.models.ArgumentInfo object at 0x7f26fb957890>] = '.', cache_pathname: ~typing.Annotated[str, <typer.models.ArgumentInfo object at 0x7f26fb9579d0>] = '.', output_pathname: ~typing.Annotated[str, <typer.models.ArgumentInfo object at 0x7f26fb957b10>] = '.') None#
asd_strat.commands.paper_reproduction.run_model(df, df_descriptor, Z_p, n_components: int = 4, n_init: int = 20, progress_bar: int = 0, verbose: int = 0) tuple[StepMix, Series]#
asd_strat.commands.paper_reproduction.run_preprocessing(spark_pathname: str, use_original_ids: bool = True, round_values: bool = True, sample_frac: float | None = 0.01) tuple[Hashable, Hashable, DataFrame, SPARK, list[Inst]]#

Reproduces the paper’s preprocessing.

Parameters:
  • spark_pathname – The SPARK data release directory pathname.

  • use_original_ids (bool, optional) – A flag indicating whether to filter only original subject IDs. Defaults to True.

  • round_values – A flag indicating whether to round the dataframe to the nearest integer. Defaults to True.

  • sample_frac – The proportion of the dataset to sample. Defaults to 0.05 (5%).

Returns:

A tuple containing the preprocessed dataframe, the dataset, and a list of instruments used during processing.

Return type:

tuple[pd.DataFrame, SPARK, list[Inst]]

asd_strat.commands.paper_reproduction.split_dataframe_columns_by_type(df: DataFrame, cat_unique_threshold: int = 10) tuple[list[str], list[str], list[str]]#

Classify the columns of a DataFrame as binary, categorical, or continuous.

Binary columns are those with unique non-null values subset of {0.0, 1.0}. Categorical columns have unique values less than or equal to the threshold. Continuous columns are the remaining ones.

Parameters:
  • df (pd.DataFrame) – The DataFrame whose columns are to be classified.

  • cat_unique_threshold (int) – The maximum number of unique values for a column to be considered categorical.

Returns:

A tuple of three lists: (binary columns, continuous columns, categorical columns)

Return type:

tuple[list[str], list[str], list[str]]