median computation : gene_median_dictionary_gc95M

#486
by shakeel604 - opened

For median "gene_median_dictionary_gc95M"

Did you compute the median on the pretrained dataset or the fine tund dataset
But while diving into the code, I see

gene_median_file : Path
| Path to pickle file containing dictionary of non-zero median
| gene expression values across Genecorpus-30M.

it has been computed on genecoprus-30M

and

For median computation, did you compute it along the cell wise or gene wise ?
@ctheodoris

Thanks for your question. Analogously to the gc30m, the medians were recomputed for gc95m. The appropriate medians should be used for the appropriate model. They should NOT be recalculated for fine tuning datasets. For method, see here:

https://huggingface.co/ctheodoris/Geneformer/blob/main/examples/pretraining_new_model/obtain_nonzero_median_digests.ipynb

ctheodoris changed discussion status to closed

Thank you so much.

I have one more question.

The "gene_token_dict" has a length of 20275 which I believe are protein coding genes. ?

For the gc95m dictionary, that is correct. The gc30m dictionary contains additional non-protein coding genes. You can check the Ensembl IDs in the dictionary to see which genes are represented.

Sign up or log in to comment