Yüksek boyutlu verilerde eksik veri değer atama yöntemlerinin sınıflandırma performansına etkisinin simülasyonla karşılaştırılması

Comparison the effects of missing data imputation methods on classification performance in high dimensional data through simulation

PDF İndir

Tez No: 828536
Yazar: BUĞRA VAROL
Danışmanlar: PROF. DR. İMRAN KURT ÖMÜRLÜ
Tez Türü: Doktora
Konular: Biyoistatistik, Biostatistics
Anahtar Kelimeler: Aşırı öğrenme makineleri, Eksik veri, Değer atama, Sınıflandırma, Simülasyon, Extreme Learning Machines, Missing Data, Imputation, Classification, Simulation
Yıl: 2023
Dil: Türkçe
Üniversite: Aydın Adnan Menderes Üniversitesi
Enstitü: Sağlık Bilimleri Enstitüsü
Ana Bilim Dalı: Biyoistatistik Ana Bilim Dalı
Bilim Dalı: Belirtilmemiş.
Sayfa Sayısı: 114

Özet

Amaç: Bu çalışmanın amacı, türetilmiş yüksek boyutlu verilerde farklı eksik veri değer atama yöntemlerinin eksik verileri en az hata ile tahmin etmeleri ve aşırı öğrenme makineleriyle (ELM) sınıflandırma performansına etkilerinin incelenmesidir. Gereç ve Yöntem: Çalışmada farklı veri yapılarına, eksik veri oranlarına ve korelasyon düzeylerine göre n=150 gözlemden oluşan iki kategorili bağımlı değişken ve p=500 bağımsız değişkenden oluşan rastgele veriler türetilerek rastgele eksik (MAR) mekanizmalı eksik değerler oluşturuldu. Eksik veri değer atama yöntemlerinden; ortalama, medyan, rastgele, k-en yakın komşu (KNN), rastgele orman ile değer atama (I-RF), sınıflandırma ve regresyon ağaçları tabanlı zincirleme denklemlerle çok değişkenli değer atama (MICE-CART) yöntemlerinin yanı sıra yüksek boyutlu veriler için geliştirilen düzenlileştirilmiş regresyonun doğrudan kullanımı (DURR) ve düzenlileştirilmiş regresyonun dolaylı kullanımı (IURR) yöntemleri ile eksik değerler atandı. 1000 döngü ile yapılan simülasyonlar sonunda yöntemlerin, ELM ile sınıflandırma skorlarının referansa yakınlığına göre eksik değer tahmin performansları değerlendirildi. Bulgular: Simülasyon bulguları incelendiğinde, uygulanan aşamalı kümeleme analizine göre, değişen eksik oranları ve korelasyon düzeyleri için birbirine yakın performans gösteren yöntemlerin aynı kümede yer aldıkları tespit edildi. Eksik verili değişkenlerin veri setindeki belirli bir değişken seti ile ilişkili olduğu algoritmada, tüm korelasyon düzeyleri için düşük eksik oranlarında I-RF, MICE-CART, DURR, IURR ve bunları takiben KNN yöntemlerinin; yüksek eksik oranlarında ise DURR ve IURR yöntemlerinin referansa yakın ve benzer performans gösterdiği belirlendi. Verilerin tamamen rastgele türetildiği ikinci simülasyon algoritmasında ise tüm korelasyon düzeyleri ve eksik oranları için yöntemlerin performanslarının birbirine yakın olduğu görüldü. Sonuç: Veriler tamamen rastgele türetildiğinde, çalışmamızda kullanılan yöntemlerin tahmin performansları değişkenler arasındaki ilişkiden ve eksik oranından etkilenmemektedir. Ancak eksik verili değişkenlerin veri setindeki belirli bir değişken seti ile ilişkili olduğu durumlarda, özellikle DURR ve IURR yöntemleri diğer yöntemlere kıyasla daha etkili olmaktadır. Bu yöntemler değişkenler arasındaki ilişkiden ve eksik veri oranındaki değişimden diğer yöntemlere göre daha az etkilenmektedir.

Özet (Çeviri)

Objective: This study aims to examine the performance of different missing data imputation methods in accurately estimating missing data in derived high-dimensional datasets and their impact on classification performance using extreme learning machines (ELM). Materials and Methods: In this study, random datasets were generated consisting of n=150 observations with binary dependent variables and p=500 independent variables, considering different data structures, missing data rates, and levels of correlation. Random missing values were created using the missing at random (MAR) mechanism. The missing data imputation methods used in the study included mean, median, random, k-nearest neighbors (KNN), missing value imputation with random forests (I-RF), multiple imputations by chained equations with classification and regression trees (MICE-CART), as well as the direct use of regularized regression (DURR) and the indirect use of regularized regression (IURR) methods developed explicitly for high-dimensional data. Missing values were imputed using these methods. After 1000 iterations of simulations, the performance of the methods in estimating missing values was evaluated based on their proximity of the classification scores obtained using ELM to the reference. Findings: Upon examining the simulation results, according to the applied hierarchical clustering analysis, it was determined that the methods that perform close to each other according to the varying missing rates and correlation levels were in the same cluster. it was observed that in algorithm where variables were associated with a specific set of variables in the dataset, the I-RF, MICE-CART, DURR, IURR, followed by KNN methods exhibited better performance and close to each other and the reference at low missing rates, while the DURR and IURR methods stood out at high missing rates. In the second simulation algorithm, where the data were completely randomly generated, the performances of all methods were found to be close to each other across different correlation levels and missing rates. Conclusion: When the data are completely randomly generated, the prediction performance of the methods used in our study is not affected by the relationships between variables and the missing rates. However, in cases where missing variables are associated with a specific set of variables in the dataset, particularly the DURR and IURR methods prove more effective than the others. These methods were less affected by the relationship between the variables and the variation of the missing rates compared to other methods.

Benzer Tezler

Tez No
827860
Derin öğrenme ve büyük veri analitiği yöntemleriKullanarak Covid-19 yayılımının ileriye dönük tahmini
Forecasting the spread of covid-19 using deep learning and big data analytics methods
CYLAS KIGANDA
Yüksek Lisans
İngilizce
2023
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Gazi Üniversitesi
Bilgisayar Bilimleri Ana Bilim Dalı
PROF. DR. MUHAMMET ALİ AKCAYOL
Tez No
765502
Data-driven prediction and emergency control of transient stability in power systems towards a risk-based optimal power flow operation
Güç sistemlerinde risk tabanlı optimal güç akışı işletimineyönelik geçici hal kararlılığın veri güdümlü tahmini veacil durum kontrolü
SEVDA JAFARZADEH
Doktora
İngilizce
2022
Elektrik ve Elektronik Mühendisliği İstanbul Teknik Üniversitesi
Elektrik Mühendisliği Ana Bilim Dalı
PROF. VEYSEL MURAT İSTEMİHAN GENÇ
Tez No
864178
Early detection of distributed denial of service attacks
Dağıtık hizmet engelleme saldırılarının erken tespiti
KAĞAN ÖZGÜN
Yüksek Lisans
İngilizce
2024
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
DOÇ. DR. AYŞE TOSUN KÜHN
DR. ÖĞR. ÜYESİ MEHMET TAHİR SANDIKKAYA
Tez No
895465
Vector-driven: A new projection and backprojection algorithm based on vector mapping
Vector-driven: Vektör haritalamasına dayalı yeni bir projeksiyonve ters projeksiyon algoritması
İSMAİL MELİK TÜRKER
Yüksek Lisans
İngilizce
2024
Elektrik ve Elektronik Mühendisliği İstanbul Teknik Üniversitesi
Elektronik ve Haberleşme Mühendisliği Ana Bilim Dalı
DOÇ. DR. İSA YILDIRIM
Tez No
940901
Gaziantep'te PM2.5 konsantrasyonunun zamansal ve mekânsaltahminine yönelik transfer öğrenme destekli hibrit yapay zeka modelleri
Spatio-temporal estimation of PM2.5 concentrations in gaziantepusing transfer learning-based hybrid artificial intelligence models
TÜRKAN ZENGİN GÖMLEKSİZ
Yüksek Lisans
Türkçe
2025
Meteoroloji İstanbul Teknik Üniversitesi
İklim Bilimi ve Meteoroloji Mühendisliği Ana Bilim Dalı
PROF. DR. HÜSEYİN TOROS

Geri Dön