bioalpha.singlecell.preprocessing.subsample

bioalpha.singlecell.preprocessing.subsample(data: AnnData | ndarray, fraction: float | None = None, n_obs: int | None = None, random_state: None | int | RandomState = 0, copy: bool = False, method: Literal['geosketching', 'random'] = 'geosketching', use_rep: str | None = None, subset_path: str | None = None, **kwargs) AnnData | Tuple[ndarray, ndarray] | None

Subsample to a fraction of the number of observations.

Parameters:
  • data (Union[AnnData, ndarray]) – The (annotated) data matrix of shape n_obs x n_vars. Rows correspond to cells and columns to genes.

  • fraction (Optional[float], default = None) – Subsample to this fraction of the number of observations.

  • n_obs (Optional[int], default = None) – Subsample to this number of observations. Not compatible with fraction.

  • random_state (AnyRandom, default = 0) – Random seed to change subsampling.

  • copy (bool, default = False) – Whether to modify copied input object.

  • method (Literal["geosketching", "random"], default = "geosketching") – Which method for subsampling. “geosketching” require an ndarray.

  • use_rep (Optional[str], default = None) – Use the indicated representation. "X" or any key for .obsm is valid. If None, the representation is chosen automatically: For .n_vars < 50, .X is used, otherwise “X_pca” is used. If “X_pca” is not present, it’s computed with default parameters. Ignore when input is AnnData instance and method="random".

  • subset_path (str, default = None) – H5ADMap data do not support inplace subset, so subset_path will be passed into .diet_subset function of H5ADMap data. This parameter will be ignored if copy=True.

  • kwargs (dict) – Any additional arguments will be passed to _sctools.sampling.geosketching.

Return type:

Returns X[obs_indices], obs_indices if data is array-like, otherwise subsamples the passed into adata (copy == False) or returns a subsampled copy of it (copy == True).