The engine uses approximate nearest neighbor search based on Locality Sensitive Hashing. It hashes large expression data-sets of cells into lower dimensions using random hyperplane projections and searches for neighbors in the reduced space. For nearby cells, it ranks them based on their cosine similarities based significance test which adjusts according to noise and bias of expression data-sets. Locality Sensitive Hashing is a mapping of samples in a higher space to buckets in such a way that closer cells land up in the same buckets with high probability. These buckets are defined in a much lower dimension. Further after getting a rough estimate of closer cells, it uses robust statistical analysis to find the significance of match according to the noise and bias of query and reference expression.
A GPU, or a Graphics Processing Unit, is a specialized high performance computing chip. Its inherited parallel architecture facilitates asynchronous execution of independent processes in parallel to each other. The engine uses GPU to project each cell into lower dimensional buckets in parallel to each other. For the input query also, each cell’s neighbors are computed in parallel to each other. Other trivial processes like sorting etc are also carried out on the GPU.
The engine takes in an xlsx file. The file must have gene symbol in its first column. The subsequent columns contain the expression value for query cells for those particular genes. There is no header row in the file, the first row corresponds to the first gene itself. For Example: For a query of 5 cells, the input file should have 6 Columns (One for each cell and one for the gene ids). For a sample query file, click here
For complete details, see the 'How to Use' section.
For a list of genes used by the engine, click here. The complete engine uses only these genes, in the same order.
The engine has been made to handle such kind on input files. Any gene not listed in the engine’s gene directory is simply discarded. If somehow, the user fails to maintain the order of the genes, it is reordered by the engine. If somehow, the user has information for a lesser number of genes, the engine augments it by assuming them to be zero.
To ensure secrecy, Cell Atlas Search comes with a completely independent stand alone software too. This can be downloaded from link. The stand alone version runs completely on a GPU enabled machine.
The engine tests compute the significance of the computed cosine similarity value using permutation test. P values capture the significance of a similarity value for that sample among a random set. These p values are adjusted for different studies to obtain False Detection Rate(FDR). These are listed in results as the Adjusted P Values.
On hitting the submit button, the engine will start the process of finding the nearest neighbors of your query cells. This may take some time depending upon the dataset chosen, the number of nearest neighbors requested and the number of cells in the query. The system thus generated a custom url for each query fired upon the engine, which the user can copy. This custom URL can be used to see the results once they are generated. Moreover, this URL can be used to come back to the result in future. The URL shall be valid for 48 hours from its birth.
Search is fast and accurate. A complete run on single cell dataset for a bulk query of 4 cells takes around 11-14 seconds to process. A query if 30 cells would take around 1 minutes.
Retry after some time. If the problem persists, email the developers with the query details and the team shall revert back to you.
Currently this facility is not available on the web server, however, one can include his own expression data-set in the standalone version of the tool. User’s custom dataset can be converted into searchable data for the standalone version. In order to make your data globally searchable, you can send the count data of genes to us or upload your data to public repositories like GEO and SRA and update us by email.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
All Modern web browsers like Google Chrome,Firefox,Safari etc
The server does not produce the two plots when the number of samples in the query is less. Particularly, for one to have the two plots, the number of samples in the query must be greater than 5.
A query cell is dropped in the results table when the pipeline is not able to fetch a significant nearest neighbor to it. Some of the possible reasons for it could be :
Although we have tried hard to produce meaningful results, one may sometimes get an irrelevent results, primaritly due to one of the following reasons :