Sınıflandırmada kullanılan veri madenciliği yöntemlerinin performanslarının veri seti özelliklerine göre karşılaştırılması

Comparison of performance of data mining methods used for classification in terms of data characteristics

PDF İndir

Tez No: 630646
Yazar: GÖRKEM CEYHAN
Danışmanlar: PROF. DR. İSMAİL KARAKAYA
Tez Türü: Doktora
Konular: Eğitim ve Öğretim, Education and Training
Anahtar Kelimeler: Yapay Sinir Ağları, Rastgele Orman, Destek Vektör Makinesi, Sınıflandırma ve Regresyon Ağaçları, Lojistik Regresyon, PISA 2015, Artificial Neural Networks, Random Forest Algorithm, Support Vector Machine, Classification and Regression Trees, Logistic Regression, PISA 2015
Yıl: 2020
Dil: Türkçe
Üniversite: Gazi Üniversitesi
Enstitü: Eğitim Bilimleri Enstitüsü
Ana Bilim Dalı: Eğitim Bilimleri Ana Bilim Dalı
Bilim Dalı: Belirtilmemiş.
Sayfa Sayısı: 237

Özet

Bu çalışmanın amacı, PISA (2015) fen başarıları puanlarına göre Yapay Sinir Ağları, Rastgele Orman Algoritması, Destek Vektör Makinesi, Sınıflandırma ve Regresyon Ağaçları ve Lojistik Regresyon yöntemlerinin sınıflandırma performanslarının bağımlı değişkenin kategori sayısı, bağımsız değişken sayısı ve örneklem büyüklüğü açısından incelenmesidir. Araştırmada PISA (2015) uygulamasına katılan 15 yaş grubundaki öğrencilere ait veriler arasından bütün ülkelere uygulanmayan anketlere ilişkin değişkenlere bağlı olarak ilgili öğrencilerin veri setinden çıkarılması ile geriye kalan 169326 öğrenciye ait veri kullanılmıştır. Sınıflama modellerinde kullanılacak bağımsız değişkenler belirlenirken bağımlı değişkenin sürekli puanlarına göre hesaplanan korelasyon analizi ve VIF değerlerinden yararlanılmıştır. Elde edilen sonuçlar doğrultusunda bağımsız değişken olarak seçilen değişkenler matematik başarı puanı, ekonomik, sosyal ve kültürel statü indeksi, epistomolojik inançlar, evde bulunan kültürel eşyalar, feni sevme, evdeki eğitim kaynakları, evde sahip olunan eşyalar, çevresel farkındalık, öğrenme süresi ve fen özyeterlik inancıdır. Yöntemlerin performansları öncelikle bağımlı değişkenin kategori sayısına göre incelenmiştir. Bunun için bağımlı değişkenin 2, 3 ve 6 adet sınıfa sahip olduğu durumlara ait sınıfların öğrenci yüzdeleri göz önünde tutularak, rastgele seçim yöntemiyle veri setinden büyüklüğü 5000 olan 25'er adet çalışma grubu seçilmiştir. Ardından 10 bağımsız değişkenin yer aldığı sınıflandırma modellerine YSA, RO, DVM, SVRA ve LR yöntemleri uygulanmıştır. Sonuçlar, bütün yöntemlerin sınıflandırma performanslarının bağımlı değişkenin sınıf sayısının azalması durumunda artış gösterdiğini göstermektedir. Yöntemlerin performansları bağımsız değişken sayısının değişimine göre incelenirken, yöntemlerin en iyi sınıflandırma performansını gösterdiği bağımlı değişkenin 2 kategoriye sahip olma durumu ele alınmıştır. Sırasıyla 10, 7 ve 4 bağımsız değişkenin yer aldığı sınıflandırma modellerine YSA, RO, DVM, SVRA ve LR yöntemleri uygulanmıştır. Bütün yöntemlerin sınıflandırma performanslarının bağımsız değişkenin sayısına göre anlamlı bir değişim göstermediği tespit edilmiştir. Ardından fen başarı puanı ile yüksek ilişkiye sahip olan matematik başarı puanı değişkeni modellerden çıkarılmış ve bağımsız değişkenin 9, 6 ve 3 olması koşulunda sınıflandırma modellerine YSA, RO, DVM, SVRA ve LR yöntemleri tekrar uygulanmıştır. Elde edilen sonuçlara göre bağımlı değişkenle yüksek korelasyona sahip matematik başarı puanı değişkeninin modellerde yer almadığı durumda bağımsız değişken sayısı arttıkça bütün yöntemlerin sınıflandırma performansının da artış gösterdiği tespit edilmiştir. Yöntemlerin performansları örneklem büyüklüğüne göre incelenirken, yöntemlerin en iyi sınıflandırma performansını gösterdiği bağımlı değişkenin 2 kategorili ve 10 bağımsız değişkenin yer aldığı sınıflandırma modelleri ele alınmıştır. Veri setinden 100, 250, 500, 1000, 2500 ve 5000 örneklem büyüklüğünün her biri için rastgele seçim yöntemiyle 25'er adet çalışma grubu oluşturulmuştur. Sonuçlar, bütün yöntemlerin sınıflandırma performanslarının örneklem büyüklüğüne göre değiştiğini göstermektedir. YSA, DVM ve LR yöntemlerinin 500 ve daha fazla örneklem büyüklüğünde 100 ve 250 örneklem büyüklüğüne göre daha yüksek ve birbirine benzer değerler ürettiği dolayısıyla daha iyi bir sınıflama performansı sergilediğini sonucuna ulaşılmıştır. RO ve SVRA yöntemlerinin ise 1000 ve üzeri örneklem büyüklüklerinde 100, 250 ve 500 örneklem büyüklüğüne göre daha yüksek bir sınıflama performansına sahip olduğu tespit edilmiştir. Öte yandan bütün koşullar altında yöntemlerin birbirlerine göre performansları da karşılaştırılmış ve elde edilen sonuçlar doğrultusunda YSA, DVM ve LR yöntemlerinin sınıflandırma performanslarının RO ve SVRA yöntemlerine göre daha iyi olduğu sonucuna ulaşılmıştır. Ayrıca YSA, DVM ve LR yöntemlerinin birbirlerine göre benzer RO yönteminin ise SVRA yöntemine göre daha iyi sınıflandırma performansı gösterdiği belirlenmiştir.

Özet (Çeviri)

This study aims to examine the classification performances of Artificial Neural Networks, Random Forest Algorithm, Support Vector Machine, Classification, and Regression Trees and Logistic Regression methods according to PISA (2015) science achievement scores in terms of the number of classes of the dependent variable, number of independent variables and sample size. The population of the research is all 15-year-old students who participated in PISA (2015) application. The target universe consists of the remaining 169326 students with the exclusion of the students from the data set, depending on the variables related to the questionnaires that are not applied to all countries. While determining the independent variables to be used in the classification models, the correlation analysis in which the continuous scores of the dependent variable were used, and VIF values were taken into consideration. In line with the results, the variables chosen as independent variables obtained are mathematics achievement score, economic, social and cultural status index, epistemological beliefs, cultural possessions at home, enjoyment of science, home educational resources, home possessions, environmental awareness, learning time, and science self-efficacy. The performances of the methods were first examined according to the number of classes of the dependent variable. For this purpose, considering the student percentages of the classes in which the dependent variable has 2, 3, and 6 classes, 25 samples were selected from the target universe, the size of which was 5000 by the random selection method. Afterward, AAN, RF, SVM, CART, and LR methods were applied to the classification models with ten independent variables. The results show that the classification performance of all methods increases when the number of classes of the dependent variable decreases. The performances of the methods according to the number of independent variables were examined in the case that the dependent variable was in 2 categories, in which the methods showed the best classification performance. ANN, RF, SVM, CART, and LR methods were applied to classification models with 10, 7, and 4 independent variables, respectively. It was found that the classification performances of all methods did not show a significant change according to the number of independent variables. Then, the mathematics achievement score variable, which has a high correlation with the science achievement score, was removed from the models and ANN, RF, DVM, CART, and LR methods were applied to the classification models provided that the independent variable was 9, 6, and 3. According to the results, when the math achievement score with a high correlation with the dependent variable is not included in the models, as the number of independent variables increases, the classification performance of all methods also increases. While the performances of the methods were examined according to the sample size, the classification models with the two-class dependent variable and ten independent variables, in which the methods showed the best classification performance, were tested. Twenty-five samples were created using the random selection method for each sample size of 100, 250, 500, 1000, 2500, and 5000 from the target universe. Results showed that the classification performance of all methods varies according to the sample size. It can be said that the ANN, SVM, and LR methods produced higher values compared to 100 and 250 sample sizes in 500 and more sample sizes; therefore, they showed a better classification performance. Besides, RF and CART methods produced higher values compared to 100, 250, and 500 sample sizes in 1000 and more sample sizes. On the other hand, under all conditions, the performances of the methods were compared, and it was concluded that the classification performance of ANN, SVM, and LR methods were better than those of RF and CART methods. Also, it was found that ANN, DVM, and LR methods were performed similarly to each other, and the RF method showed better classification performance than the CART method.

Benzer Tezler

Tez No
927419
Sağlık harcamaları belirleyicilerinin veri madenciliği yöntemleri ile modellenmesi
The modeling of determinants of health expenditures using data mining methods
MELİHA MELİŞ GÜNALTAY
Doktora
Türkçe
2024
Sağlık Yönetimi Ankara Üniversitesi
Sağlık Yönetimi Ana Bilim Dalı
PROF. DR. GÜLBİYE YAŞAR
Tez No
953712
Enhancing botnet detection using federated learning in iot networks
Iot ağlarinda federe öğrenme yöntemini kullanarak botnet tespitinin geliştirilmesi
NİLÜFER USLAN
Yüksek Lisans
İngilizce
2025
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
PROF. DR. ŞERİF BAHTİYAR
Tez No
869634
Bağımsız denetim görüşlerinin tahmin edilmesinde veri madenciliği yöntemlerinin karşılaştırılması: Borsa İstanbul'da bir uygulama
Comparison of data mining methods for audit opinions prediction: An application in Borsa Istanbul
ZAFER KARDEŞ
Doktora
Türkçe
2024
İşletme Afyon Kocatepe Üniversitesi
İşletme Ana Bilim Dalı
PROF. DR. TUĞRUL KANDEMİR
Tez No
804910
Yapay zeka yöntemleri ile uzaktan eğitimdeki sorunların tespiti ve öğrencilerin akademik performanslarının tahmin edilmesi
Detecting the problems in distance education and predicting the academic performance of students by using artificial intelligence methods
HALİT IRMAK
Doktora
Türkçe
2023
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Üniversitesi
Enformatik Ana Bilim Dalı
DOÇ. DR. ZÜMRÜT ECEVİT SATI
Tez No
763872
Dizel makinanın makina öğrenmesi yöntemi kullanılarak modellenmesi ve karar-destek mekanizması oluşturulması
Machine learning method based marine diesel engine modelling and decision-support system setting
TOLGA ŞAHİN
Doktora
Türkçe
2022
Makine Mühendisliği İstanbul Teknik Üniversitesi
Makine Mühendisliği Ana Bilim Dalı
PROF. DR. CEVAT ERDEM İMRAK

Geri Dön