bioalpha.h5ad_map.H5ADMap.concatenate
- H5ADMap.concatenate(*args, **kwargs)
Concatenate along the observations axis.
The
uns
,varm
andobsm
attributes are ignored.Currently, this works only in ‘memory’ mode.
Note
For more flexible and efficient concatenation, see:
concat()
.- Parameters:
adatas – AnnData matrices to concatenate with. Each matrix is referred to as a “batch”.
join – Use intersection (‘inner’) or union (‘outer’) of variables.
batch_key – Add the batch annotation to
obs
using this key.batch_categories – Use these as categories for the batch annotation. By default, use increasing numbers.
uns_merge –
Strategy to use for merging entries of uns. These strategies are applied recusivley. Currently implemented strategies include:
None: The default. The concatenated object will just have an empty dict for uns.
”same”: Only entries which have the same value in all AnnData objects are kept.
”unique”: Only entries which have one unique value in all AnnData objects are kept.
”first”: The first non-missing value is used.
”only”: A value is included if only one of the AnnData objects has a value at this path.
index_unique – Make the index unique by joining the existing index names with the batch category, using index_unique=’-’, for instance. Provide None to keep existing indices.
fill_value –
Scalar value to fill newly missing values in arrays with. Note: only applies to arrays and sparse matrices (not dataframes) and will only be used if join=”outer”.
Note
If not provided, the default value is 0 for sparse matrices and np.nan for numpy arrays. See the examples below for more information.
- Returns:
The concatenated
AnnData
, where adata.obs[batch_key] stores a categorical variable labeling the batch.- Return type:
AnnData
Notes
Warning
If you use join=’outer’ this fills 0s for sparse data when variables are absent in a batch. Use this with care. Dense data is filled with NaN. See the examples.
Examples
Joining on intersection of variables.
>>> adata1 = AnnData( ... np.array([[1, 2, 3], [4, 5, 6]]), ... dict(obs_names=['s1', 's2'], anno1=['c1', 'c2']), ... dict(var_names=['a', 'b', 'c'], annoA=[0, 1, 2]), ... ) >>> adata2 = AnnData( ... np.array([[1, 2, 3], [4, 5, 6]]), ... dict(obs_names=['s3', 's4'], anno1=['c3', 'c4']), ... dict(var_names=['d', 'c', 'b'], annoA=[0, 1, 2]), ... ) >>> adata3 = AnnData( ... np.array([[1, 2, 3], [4, 5, 6]]), ... dict(obs_names=['s1', 's2'], anno2=['d3', 'd4']), ... dict(var_names=['d', 'c', 'b'], annoA=[0, 2, 3], annoB=[0, 1, 2]), ... ) >>> adata = adata1.concatenate(adata2, adata3) >>> adata AnnData object with n_obs × n_vars = 6 × 2 obs: 'anno1', 'anno2', 'batch' var: 'annoA-0', 'annoA-1', 'annoA-2', 'annoB-2' >>> adata.X array([[2, 3], [5, 6], [3, 2], [6, 5], [3, 2], [6, 5]]) >>> adata.obs anno1 anno2 batch s1-0 c1 NaN 0 s2-0 c2 NaN 0 s3-1 c3 NaN 1 s4-1 c4 NaN 1 s1-2 NaN d3 2 s2-2 NaN d4 2 >>> adata.var.T b c annoA-0 1 2 annoA-1 2 1 annoA-2 3 2 annoB-2 2 1
Joining on the union of variables.
>>> outer = adata1.concatenate(adata2, adata3, join='outer') >>> outer AnnData object with n_obs × n_vars = 6 × 4 obs: 'anno1', 'anno2', 'batch' var: 'annoA-0', 'annoA-1', 'annoA-2', 'annoB-2' >>> outer.var.T a b c d annoA-0 0.0 1.0 2.0 NaN annoA-1 NaN 2.0 1.0 0.0 annoA-2 NaN 3.0 2.0 0.0 annoB-2 NaN 2.0 1.0 0.0 >>> outer.var_names Index(['a', 'b', 'c', 'd'], dtype='object') >>> outer.X array([[ 1., 2., 3., nan], [ 4., 5., 6., nan], [nan, 3., 2., 1.], [nan, 6., 5., 4.], [nan, 3., 2., 1.], [nan, 6., 5., 4.]]) >>> outer.X.sum(axis=0) array([nan, 25., 23., nan]) >>> import pandas as pd >>> Xdf = pd.DataFrame(outer.X, columns=outer.var_names) >>> Xdf a b c d 0 1.0 2.0 3.0 NaN 1 4.0 5.0 6.0 NaN 2 NaN 3.0 2.0 1.0 3 NaN 6.0 5.0 4.0 4 NaN 3.0 2.0 1.0 5 NaN 6.0 5.0 4.0 >>> Xdf.sum() a 5.0 b 25.0 c 23.0 d 10.0 dtype: float64
One way to deal with missing values is to use masked arrays:
>>> from numpy import ma >>> outer.X = ma.masked_invalid(outer.X) >>> outer.X masked_array( data=[[1.0, 2.0, 3.0, --], [4.0, 5.0, 6.0, --], [--, 3.0, 2.0, 1.0], [--, 6.0, 5.0, 4.0], [--, 3.0, 2.0, 1.0], [--, 6.0, 5.0, 4.0]], mask=[[False, False, False, True], [False, False, False, True], [ True, False, False, False], [ True, False, False, False], [ True, False, False, False], [ True, False, False, False]], fill_value=1e+20) >>> outer.X.sum(axis=0).data array([ 5., 25., 23., 10.])
The masked array is not saved but has to be reinstantiated after saving.
>>> outer.write('./test.h5ad') >>> from anndata import read_h5ad >>> outer = read_h5ad('./test.h5ad') >>> outer.X array([[ 1., 2., 3., nan], [ 4., 5., 6., nan], [nan, 3., 2., 1.], [nan, 6., 5., 4.], [nan, 3., 2., 1.], [nan, 6., 5., 4.]])
For sparse data, everything behaves similarly, except that for join=’outer’, zeros are added.
>>> from scipy.sparse import csr_matrix >>> adata1 = AnnData( ... csr_matrix([[0, 2, 3], [0, 5, 6]], dtype=np.float32), ... dict(obs_names=['s1', 's2'], anno1=['c1', 'c2']), ... dict(var_names=['a', 'b', 'c']), ... ) >>> adata2 = AnnData( ... csr_matrix([[0, 2, 3], [0, 5, 6]], dtype=np.float32), ... dict(obs_names=['s3', 's4'], anno1=['c3', 'c4']), ... dict(var_names=['d', 'c', 'b']), ... ) >>> adata3 = AnnData( ... csr_matrix([[1, 2, 0], [0, 5, 6]], dtype=np.float32), ... dict(obs_names=['s5', 's6'], anno2=['d3', 'd4']), ... dict(var_names=['d', 'c', 'b']), ... ) >>> adata = adata1.concatenate(adata2, adata3, join='outer') >>> adata.var_names Index(['a', 'b', 'c', 'd'], dtype='object') >>> adata.X.toarray() array([[0., 2., 3., 0.], [0., 5., 6., 0.], [0., 3., 2., 0.], [0., 6., 5., 0.], [0., 0., 2., 1.], [0., 6., 5., 0.]], dtype=float32)