bioalpha.singlecell.preprocessing.highly_variable_genes
- bioalpha.singlecell.preprocessing.highly_variable_genes(adata: AnnData | H5ADMap, layer: str | None = None, n_top_genes: int | None = None, min_disp: float | None = 0.5, max_disp: float | None = inf, min_mean: float | None = 0.0125, max_mean: float | None = 3, span: float | None = 0.3, n_bins: int = 20, flavor: Literal['seurat', 'cell_ranger', 'seurat_v3'] = 'seurat', subset: bool = False, inplace: bool = True, batch_key: str | None = None, check_values: bool = True, obs_mask: str | None = None, var_mask: str | None = None) DataFrame | None
Annotate highly variable genes.
Expects logarithmized data, except when flavor=’seurat_v3’, in which count data is expected.
For the dispersion-based methods (
Seurat
andCell Ranger
), the normalized dispersion is obtained by scaling with the mean and standard deviation of the dispersions for genes falling into a given bin for mean expression of genes. This means that for each bin of mean expression, highly variable genes are selected.For
Seurat v3
, a normalized variance for each gene is computed. First, the data are standardized (i.e., z-score normalization per feature) with a regularized standard deviation. Next, the normalized variance is computed as the variance of each gene after the transformation. Genes are ranked by the normalized variance.- Parameters:
adata (
AnnData
) – The annotated data matrix of shapen_obs
xn_vars
. Rows correspond to cells and columns to genes.layer (Optional[
str
], default =None
) – If provided, useadata.layers[layer]
for expression values instead ofadata.X
.n_top_genes (Optional[
int
], default =None
) – Number of highly-variable genes to keep. Mandatory ifflavor="seurat_v3"
.min_disp (Optional[
float
], default =0.5
) – Ifn_top_genes
unequalsNone
, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored ifflavor="seurat_v3"
.max_disp (Optional[
float
], default =inf
) – Ifn_top_genes
unequalsNone
, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored ifflavor="seurat_v3"
.min_mean (Optional[
float
], default =0.0125
) – Ifn_top_genes
unequalsNone
, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored ifflavor="seurat_v3"
.max_mean (Optional[
float
], default =3
) – Ifn_top_genes
unequalsNone
, this and all other cutoffs for the means and the normalized dispersions are ignored. Ignored ifflavor="seurat_v3"
.span (Optional[
float
], default =0.3
) – The fraction of the data (cells) used when estimating the variance in the loess model fit ifflavor="seurat_v3"
.n_bins (
int
, default =20
) – Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. If just a single gene falls into a bin, the normalized dispersion is artificially set to 1. You’ll be informed about this if you setsettings.verbosity = 4
.flavor (Literal[
"seurat"
,"cell_ranger"
,"seurat_v3"
], default ="seurat"
) – Choose the flavor for identifying highly variable genes. For the dispersion based methods in their default workflows, Seurat passes the cutoffs whereas Cell Ranger passesn_top_genes
.subset (
bool
, default =False
) – Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes.inplace (
bool
, default =True
) – Whether to place calculated metrics in .var or return them.batch_key (Optional[
str
], default =None
) – If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. For all flavors, genes are first sorted by how many batches they are a HVG. For dispersion-based flavors ties are broken by normalized dispersion. If flavor = ‘seurat_v3’, ties are broken by the median (across batches) rank based on within-batch normalized variance.check_values (
bool
, default =True
) – Check if counts in selected layer are integers. A Warning is returned if set to True. Only used if flavor=’seurat_v3’.obs_mask (Optional[
str
], default =None
) – Ifobs_mask
is notNone
, filter cells byadata.obs[obs_mask]
.var_mask (Optional[
str
], default =None
) – Ifobs_mask
is notNone
, filter genes byadata.obs[obs_mask]
.
- Returns:
Depending on
inplace
returns calculated metrics orupdates
.var
with the following fieldshighly_variable (
bool
) – boolean indicator of highly-variable genesmeans: means per gene
dispersions: For dispersion-based flavors, dispersions per gene
dispersions_norm: For dispersion-based flavors, normalized dispersions per gene
variances: For
flavor='seurat_v3'
, variance per genevariances_norm: For
flavor='seurat_v3'
, normalized variance per gene, averaged in the case of multiple batches
highly_variable_rank (
float
) – Forflavor='seurat_v3'
, rank of the gene according to normalized variance, median rank in the case of multiple batcheshighly_variable_nbatches (
int
) – If batch_key is given, this denotes in how many batches genes are detected as HVGhighly_variable_intersection (
bool
) – If batch_key is given, this denotes the genes that are highly variable in all batches