The diagnosis and prognosis of cancer are among the more
challenging tasks that oncology medicine deals with. With the main aim
of fitting the more appropriate treatments, current personalized medicine
focuses on using data from heterogeneous sources to estimate the evolu-
tion of a given disease for the particular case of a certain patient. In recent
years, next-generation sequencing data have boosted cancer prediction by
supplying gene-expression information that has allowed diverse machine
learning algorithms to supply valuable solutions to the problem of cancer
subtype classification, which has surely contributed to better estimation
of patient’s response to diverse treatments. However, the efficacy of these
models is seriously affected by the existing imbalance between the high
dimensionality of the gene expression feature sets and the number of sam-
ples available for a particular cancer type. To counteract what is known
as the curse of dimensionality, feature selection and extraction methods
have been traditionally applied to reduce the number of input variables
present in gene expression datasets. Although these techniques work by
scaling down the input feature space, the prediction performance of tradi-
tional machine learning pipelines using these feature reduction strategies
remains moderate. In this work, we propose the use of the Pan-Cancer
dataset to pre-train deep autoencoder architectures on a subset com-
posed of thousands of gene expression samples of very diverse tumor
types. The resulting architectures are subsequently fine-tuned on a col-
lection of specific breast cancer samples. This transfer-learning approach
aims at combining supervised and unsupervised deep learning models
with traditional machine learning classification algorithms to tackle the
problem of breast tumor intrinsic-subtype classification.