median computation : gene_median_dictionary_gc95M
For median "gene_median_dictionary_gc95M"
Did you compute the median on the pretrained dataset or the fine tund dataset
But while diving into the code, I see
gene_median_file : Path
| Path to pickle file containing dictionary of non-zero median
| gene expression values across Genecorpus-30M.
it has been computed on genecoprus-30M
and
For median computation, did you compute it along the cell wise or gene wise ?
@ctheodoris
Thanks for your question. Analogously to the gc30m, the medians were recomputed for gc95m. The appropriate medians should be used for the appropriate model. They should NOT be recalculated for fine tuning datasets. For method, see here:
Thank you so much.
I have one more question.
The "gene_token_dict" has a length of 20275 which I believe are protein coding genes. ?
For the gc95m dictionary, that is correct. The gc30m dictionary contains additional non-protein coding genes. You can check the Ensembl IDs in the dictionary to see which genes are represented.