Geri Dön

Investigating the effect of different feature selection strategies for classification of gene expression signatures of tumor cells

Tümör hücrelerin gen ifade imzalarinin siniflandirilmasina ilişkin farkli özellik seçim stratejilerinin etkisinin incelenmesi

  1. Tez No: 529682
  2. Yazar: ABUBAKHARI SSERWADDA
  3. Danışmanlar: YRD. DOÇ. DR. YUSUF YASLAN
  4. Tez Türü: Yüksek Lisans
  5. Konular: Biyoistatistik, Biostatistics
  6. Anahtar Kelimeler: Belirtilmemiş.
  7. Yıl: 2017
  8. Dil: İngilizce
  9. Üniversite: İstanbul Teknik Üniversitesi
  10. Enstitü: Fen Bilimleri Enstitüsü
  11. Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
  12. Bilim Dalı: Bilgisayar Mühendisliği Bilim Dalı
  13. Sayfa Sayısı: 70

Özet

Bu tez çalışması, göğüs kanseri mikrodizi gen ifade veri kümelerinde uzak metastaz tahminini ele almaktadır. Bunu başarmak için, bu meme kanseri mikrodizi veri kümelerinde gen seçimi ve sınıflandırılma algoritmalarından faydalanılmıştır. Veri kümeleri, binlerce genin eşzamanlı analizini mümkün kılan mikrodizi teknolojisi ile ölçülen çok sayıda genin sayısal ifade seviyelerini içermektedir. Bu tezde, çalışılan veri kümelerinden anlamlı ve güvenilir tahminler üretmek için gerçekleştirdiğimiz, veri kümelerinin elde edilmesi, veri örneklerinin etiketlenmesi, veri önişleme ve normalleştirme, diferansiyel ifade analiz prosedürleri, makine öğrenme gen seçimi ve sınıflandırma algoritmalarının uygulanması, bireysel gen seçimi algoritmalarından elde edilen gen alt kümelerinin sınıflandırma performansının analizi konularını ele aldık. Bunun yanında, kullandığımız farklı bireysel makine öğrenimi ve diferansiyel ifade öznitelik seçimi tekniklerinde yaygın olan genlerin örtüşen gen alt gruplarından kaynaklanan sınıflandırma doğruluğu konularını açıkladık.

Özet (Çeviri)

Different researchers identify informative gene subsets for the same cancer diseases but with few overlaps among the subsets, probably due to use of differing individual feature selection algorithms. Inorder to minimize this challenge, we tried 3 different methods for feature/gene selection, these are; Random Forest (RF) feature importance, LASSO and Differential Expression Analysis (DEA). We ranked the genes and selected top 300 genes from each method. Random Forest is used as the classification algorithm for all the 3 feature selection methods we deployed. Experiments were run for 10 times and we reported the resulting mean and standard deviations. We noted the resulting classification performance accuracies on independent deployment of each of the 3 individual feature selection methods and they were generally poor ranging in 50-60%, almost random classification. We later proposed strategies to combine overlapping genes selected by different machine learning and differential expression analysis feature selection techniques, ensuring maximization of the general classification accuracies of the breast cancer microarray datasets. In line with the problem we are tackling, for example researchers independently proposed individual feature selection algorithms and built relatively large gene subsets of 70,76 gene sizes to be used in classification and in clinical prediction of breast cancer metastasis but with just one or a few overlapping genes amongst the two gene subsets yet on the similar datasets, posing a risk of unreliability of the selected genes proposed for clinical prediction of different cancer cases. In the data preprocessing stage, after extracting the raw datasets from the NCBI GEO database, we normalized the datasets with RMA (Robust Multi-array Average), which outputs log2 transformed expression values, thereby having a more normal distribution, and we eliminated the background noise that is not due to biological experiments, then we calculated the Present, Marginal and Absent (P, M and A) calls that labels“expressed”genes. We eliminated genes that are either statistically Absent or Marginal and only retained genes that are statistically present, and by so doing we were able to greatly reduce on number of genes from forty or fifty thousand in different datasets to around two to one thousands in all the five different datasets we used. We deployed a t statistical test, basing on p values and log ratios to identify differentially expressed genes in the two class cases. We used R programming language environment and Bioconductor package to perform all these statistical and mathematical analyses on these breast cancer datasets so as to have meaningful decisions. We formed four subsets that resulted from intersections of overlapping genes from three independent feature selection algorithms and a fifth gene subset as the union of genes selected by atleast two feature selection methods. In particular, we obtained overlapping genes from LASSO and DE(differential expression analysis), Random forest(RF) and LASSO, RF and DE, and RF, DE, LASSO and the union of all intersections. We analyzed classification accuracy resulting from deployment of each of these five independent new subsets. Generally, the gene subset resulting from the intersection of LASSO and DE yielded the best classification accuracy in majority of the datasets, moreover with higher certainty of the efficiency of the genes in this subset in classification prediction and clinical diagnosis of breast cancer metastasis as they are differentially expressed in either classes of the datasets. Due to the random nature of the random forest algorithm, it doesn't select exactly similar genes for a fixed size of subset in different iterations, unlike LASSO and DE. We greatly encourage future researchers in related fields, to take use of LASSO and DE linear model techniques for gene selection. We have biologically named the informative genes in our final subsets using the online Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.8, and carried out Gene-annotation enrichment analysis too. A number of genes in our final subsets are studied to be related to breast cancer metastasis in several publications on related research and other cancer study resources like the American Cancer Society. Some of the genes are known to be basic prognostic, others for clinical trial and others are observational genes for breast cancer metastasis. Our identified top biomarkers for breast cancer metastasis among others include; TP53(WRAP53), A4(ANXA4), 6(PTPN6), 1(FXR1), (EIF4G1), GCH1, DHX15, ORC4, MCM and PDLIM5.

Benzer Tezler

  1. Uyum gösteren cepheler: Bir meta analizi

    Adaptive facades: A meta analysis

    SELİN KARAAĞAÇ

    Yüksek Lisans

    Türkçe

    Türkçe

    2020

    Mimarlıkİstanbul Teknik Üniversitesi

    Mimarlık Ana Bilim Dalı

    DOÇ. DR. İKBAL ÇETİNER

  2. Kent ve kentli kimliğinin günümüz konut lansmanları üzerinden okunması: İstanbul'daki son dönem kapalı konut siteleri

    Reading of the city and the citizen identity on contemporary housing launches: Recent gated communities in Istanbul

    HÜMEYRA KILIÇ

    Yüksek Lisans

    Türkçe

    Türkçe

    2015

    Mimarlıkİstanbul Teknik Üniversitesi

    Kentsel Tasarım Ana Bilim Dalı

    DOÇ. DR. HATİCE AYATAÇ

  3. Aktivite bazlı kalite maliyetleme sistemi

    Activity based quality costs system

    BEYTULLAH ÖMER MUTLUGÜN

    Yüksek Lisans

    Türkçe

    Türkçe

    1996

    İşletmeİstanbul Üniversitesi

    YRD. DOÇ. DR. NECDET ÖZÇAKAR

  4. Otomasyon yönetiminde insan faktörü ve Türk Otomotiv Sektöründe bir uygulama

    Human factors in automation management and an application in Turkish Automative Industry

    HALUK KÜÇÜK

    Yüksek Lisans

    Türkçe

    Türkçe

    1995

    Mühendislik Bilimleriİstanbul Teknik Üniversitesi

    DOÇ.DR. İ. HAKKI BİÇER