Problem in buildNhoodGraph on a large dataset #290

moinfar · 2023-09-01T14:20:42Z

Describe the bug
Hi,
First, thanks for providing and maintenance of the package.
I am using Milo on a dataset of 588k cells. I am trying to run Milo in Jupyter environment using the rpy2 interface (like here).
Everything is fine when I subset the cells to 10% (59k cells). However, when I run it on the whole data with 588k cells, I face an error on the buildNhoodGraph function. It looks like an overflow problem, but I am not sure. Unfortunately, the data is too big for me to upload. But please let me know if I can provide anything else.

Minimum code example
Minimum example to reproduce the error

# python
import os
import scanpy as sc
import anndata as ad
import numpy as np
import pandas as pd
import anndata as ad
import anndata2ri
import rpy2
from rpy2.robjects import r

sc.settings.set_figure_params(dpi=300, frameon=False)

anndata2ri.activate()
%load_ext rpy2.ipython

%%R
library(miloR)
library(igraph)
library(dplyr)
library(ggplot2)
library(scater)
library(patchwork)

MILO_K = 150

# python
oroginal_adata = sc.read("some/path.h5ad")
adata = oroginal_adata
# adata = sc.pp.subsample(oroginal_adata, fraction=0.1, copy=True)

sc.pp.neighbors(adata, n_neighbors=10, use_rep='latent')
sc.tl.umap(adata)

adata_no_knn = adata.copy()
adata_no_knn.obsp = None
adata_no_knn.uns.pop("neighbors")
adata_no_knn

knn_adjacency = adata.obsp["connectivities"]

%%R magic_args="-i adata_no_knn"
LATENT_DIM = ncol(reducedDim(adata_no_knn, 'latent'))
milo <- Milo(adata_no_knn)
milo

%%R magic_args="-i knn_adjacency"
milo_graph <- buildFromAdjacency(knn_adjacency, k=MILO_K, is.binary=TRUE)
graph(milo) <- miloR::graph(milo_graph)
milo <- buildGraph(milo, k=MILO_K, d=LATENT_DIM, reduced.dim='latent')
milo <- makeNhoods(milo, prop = 0.1, k = MILO_K, d=LATENT_DIM, refined = TRUE, reduced_dims='latent')
plotNhoodSizeHist(milo)

%%R
meta_data = data.frame(colData(milo))[,c("sample_id", "Site", "condition")]
milo <- countCells(milo, meta.data = meta_data, sample="sample_id")

## Calculate distances between cells in neighbourhoods
## for spatial FDR correction
milo <- calcNhoodDistance(milo, d=LATENT_DIM, reduced.dim='latent')

design_df <- distinct(meta_data)
rownames(design_df) <- design_df$sample_id
da_results <- testNhoods(milo, design = ~ Site + condition, design.df = design_df, reduced.dim='latent')

%%R
tryCatch({
    milo <- buildNhoodGraph(milo)
}, error=function(err) {
    print(err)
})

Full error traceback

Error in if (any(i < 0L)) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In int2i(as.integer(i), n) : NAs introduced by coercion to integer range
Error in if (any(i < 0L)) { : missing value where TRUE/FALSE needed

Session info
Output of sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS/LAPACK: /dss/dsshome1/03/di93zoz/miniconda3/envs/multigrate2/lib/libopenblasp-r0.3.23.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    tools     stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] Matrix_1.5-4.1              patchwork_1.1.3            
 [3] scater_1.28.0               scuttle_1.10.2             
 [5] SingleCellExperiment_1.22.0 SummarizedExperiment_1.30.2
 [7] Biobase_2.60.0              GenomicRanges_1.52.0       
 [9] GenomeInfoDb_1.36.1         IRanges_2.34.1             
[11] S4Vectors_0.38.1            BiocGenerics_0.46.0        
[13] MatrixGenerics_1.12.3       matrixStats_1.0.0          
[15] ggplot2_3.4.3               dplyr_1.1.2                
[17] igraph_1.5.1                miloR_1.8.1                
[19] edgeR_3.42.4                limma_3.56.2               

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0          viridisLite_0.4.2        
 [3] vipor_0.4.5               farver_2.1.1             
 [5] viridis_0.6.4             bitops_1.0-7             
 [7] ggraph_2.1.0              RCurl_1.98-1.12          
 [9] tweenr_2.0.2              digest_0.6.33            
[11] rsvd_1.0.5                lifecycle_1.0.3          
[13] statmod_1.5.0             magrittr_2.0.3           
[15] compiler_4.3.1            rlang_1.1.1              
[17] utf8_1.2.3                labeling_0.4.2           
[19] S4Arrays_1.0.5            graphlayouts_1.0.0       
[21] DelayedArray_0.26.7       RColorBrewer_1.1-3       
[23] abind_1.4-5               BiocParallel_1.34.2      
[25] withr_2.5.0               purrr_1.0.2              
[27] grid_4.3.1                polyclip_1.10-4          
[29] fansi_1.0.4               beachmat_2.16.0          
[31] colorspace_2.1-0          scales_1.2.1             
[33] gtools_3.9.4              MASS_7.3-60              
[35] cli_3.6.1                 crayon_1.5.2             
[37] generics_0.1.3            DelayedMatrixStats_1.22.5
[39] ggbeeswarm_0.7.2          ggforce_0.4.1            
[41] stringr_1.5.0             splines_4.3.1            
[43] zlibbioc_1.46.0           parallel_4.3.1           
[45] XVector_0.40.0            vctrs_0.6.3              
[47] BiocSingular_1.16.0       BiocNeighbors_1.18.0     
[49] ggrepel_0.9.3             irlba_2.3.5.1            
[51] beeswarm_0.4.0            packrat_0.9.1            
[53] locfit_1.5-9.8            tidyr_1.3.0              
[55] glue_1.6.2                codetools_0.2-19         
[57] cowplot_1.1.1             stringi_1.7.12           
[59] gtable_0.3.4              ScaledMatrix_1.8.1       
[61] munsell_0.5.0             tibble_3.2.1             
[63] pillar_1.9.0              GenomeInfoDbData_1.2.10  
[65] R6_2.5.1                  sparseMatrixStats_1.12.2 
[67] tidygraph_1.2.3           lattice_0.21-8           
[69] Rcpp_1.0.11               gridExtra_2.3            
[71] pkgconfig_2.0.3

P.S
Some minor differences between 10% and 100%:

Computing Milo on 100% of the data takes so long (maybe 10 hours) to reach the mentioned line.
I get this warning when subsetting to 10% of the data on da_results <- testNhoods(...):

In addition: Warning message:
In testNhoods(milo, design = ~Site + condition, design.df = design_df,  :
  Sample names in design matrix and nhood counts are not matched. Reordering

The text was updated successfully, but these errors were encountered:

MikeDMorgan · 2023-09-18T07:59:55Z

Hi @moinfar - two things to start with: 1) you don't need to run the nhood refinement with 100% of the data - that is entirely redundant and a sure-fire way to kill all of your statistical power, 2) use the graph-based nhood refinement and spatial FDR correction - it's orders of magnitude faster.

Finally, there is a milopy implementation maintained by @emdann that you may find useful: https://github.com/emdann/milopy

moinfar · 2023-09-18T11:35:53Z

Hi @MikeDMorgan,

Thanks for your reply.

Sorry, I think I am missing something. As far as I understood, the nhood refinement proportion is given as a parameter in the makeNhoods function. I use the following code in both mentioned runs:

milo <- makeNhoods(milo, prop = 0.1, k = MILO_K, d=LATENT_DIM, refined = TRUE, reduced_dims='latent')

The difference between the two runs is that I subset the whole data (initial adata) to 10% in the working one, while I use the whole data in the other. The latter results in the given error. Am I missing something?

Thanks. Can you please explain this in more detail? Or some links to the documentation. For graph-based nhood refinement, I think I can pass refinement_scheme="graph". Right? But, I am not sure if spatial FDR correction is something other than what I do now.

Thanks for mentioning the milopy implementation. I will take a look at it.
I would also appreciate it if you could help me with the original error I faced when using the main (588k) data.

MikeDMorgan · 2023-09-18T11:48:41Z

Hi @moinfar I see - sorry from your post I inferred you set p=1.0 - the original nhood refinement time complexity for the jth nhood is $O(n^2_j)$ so scales ~quadratically with the number of cells. The newer algorithms scale much better and run in a fraction of the time. The most up-to-date version on Bioconductor has the relevant function documentation. The manuscript is currently in the works.

I haven't seen that particular error before - it suggests there are NA of Inf values in your nhoods() matrix. Could you check that the scanpy adjacency matrix is binary.

ManarHashemTaha · 2024-04-24T09:43:44Z

Did you solve the error @moinfar ?

moinfar · 2024-04-25T08:01:29Z

@ManarHashemTaha
Unfortunately not.
I used the newest algorithm, but the problem still persisted.
I checked the scanpy neighborhood graph, and it was OK (no problem in scanpy results).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem in buildNhoodGraph on a large dataset #290

Problem in buildNhoodGraph on a large dataset #290

moinfar commented Sep 1, 2023 •

edited

Loading

MikeDMorgan commented Sep 18, 2023

moinfar commented Sep 18, 2023 •

edited

Loading

MikeDMorgan commented Sep 18, 2023

ManarHashemTaha commented Apr 24, 2024

moinfar commented Apr 25, 2024 •

edited

Loading

Problem in buildNhoodGraph on a large dataset #290

Problem in buildNhoodGraph on a large dataset #290

Comments

moinfar commented Sep 1, 2023 • edited Loading

MikeDMorgan commented Sep 18, 2023

moinfar commented Sep 18, 2023 • edited Loading

MikeDMorgan commented Sep 18, 2023

ManarHashemTaha commented Apr 24, 2024

moinfar commented Apr 25, 2024 • edited Loading

moinfar commented Sep 1, 2023 •

edited

Loading

moinfar commented Sep 18, 2023 •

edited

Loading

moinfar commented Apr 25, 2024 •

edited

Loading