We use quantile transformation in order to compute hgu133plus2-like expression values. The hgu133plus2 reference was constructed from 1000 random samples. This step is automatically taken after submission.
URSA(HD) expects a two column text file where the first column has Entrez ids or HGNC gene names and the second column has the corresponding quantified expression values. Refer to example files:
Theoretically, URSAHD should make “no calls” for a unique diseases that is not included in the training set. The SVM margins from each URSAHD disease model would be very small and thus not informative for the Bayesian network - leading to posterior probabilities close to the prior. That being said, we do believe that most diseases are related to a certain extent. So in practice, the wide disease coverage of URSAHD training set could lead to detecting related-disease signals in this “novel” disease sample.
In order to utilize the tissue relationships, gene expression experiments were annotated to a term or terms in the Brenda Tissue Ontology. After an initial substring text-mining of sample descriptions in GEO, term-to-experiment pairs were manually verified based on their sample descriptions and associated publication(s) to exclude incorrect or ambiguous pairs. The associated publication (original paper) was examined only when the sample descriptions were ambiguous. Sample annotations were then propagated based on the tissue ontology. Note that experiments weren’t necessarily annotated to their most specific term in the ontology although such attempts were made.
Manual tissue annotations are available here: manual_annotations_ursa.csv