Yükseköğretim kurumlarındaki öğrenci terkini tahmin etmeye yönelik makine öğrenmesi modellerinin incelenmesi ve açıklanabilirliği

Analysis and explainability of machine learning models for predicting student dropout in higher education

PDF İndir

Tez No: 885424
Yazar: ESRA SİLER KARABACAK
Danışmanlar: DOÇ. DR. YUSUF YASLAN
Tez Türü: Yüksek Lisans
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2024
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Lisansüstü Eğitim Enstitüsü
Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
Bilim Dalı: Bilgisayar Mühendisliği Bilim Dalı
Sayfa Sayısı: 65

Özet

Bu çalışmada, yükseköğretim kurumlarındaki öğrenci terki problemini ele almak amacıyla çeşitli makine öğrenmesi modelleri incelenmiş ve bu modellerin açıklanabilirliği üzerinde durulmuştur. Çalışmada kullanılan veri seti, Tecnologico de Monterrey'den elde edilen 2014-2020 yılları arasındaki 121,584 öğrenciye ait 143,326 kayıt içermektedir. Bu veriler lise-üniversite düzeyi karışık hazırlandığından, sadece 64,641 üniversite öğrencisinin verisi kullanılmıştır. Veri setindeki dengesizlik problemi nedeniyle, SMOTE algoritması kullanılarak veri dengelenmiştir. Çalışmada kullanılan yöntemler arasında gömme teknikleri ve çeşitli makine öğrenmesi algoritmaları bulunmaktadır. Gömme teknikleri kullanılarak öğrenci verisinin temsil gücü arttırılarak tahmin performansına etkisi ölçülmüştür. Kullanılan sınıflandırma algoritmaları arasında Çok Katmanlı Algılayıcı (MLP), Lojistik Regresyon (LR), Karar Ağaçları (DT), K-en Yakın Komşu (KNN), Naïve Bayes (NB), AdaBoost, XGBoost ve Rastgele Orman (RF) yer almaktadır. Bu algoritmalar geçmiş çalışmalarda sıklıkla kullanılan algoritmalar arasından seçilmiştir. Deneysel sonuçlar, gömme teknikleri ile zenginleştirilmiş veri seti ve ham veri seti üzerinde karşılaştırmalı olarak sunulmuştur. Sınıfsal dengesizlik olan veri setlerinde değerlendirme kriteri olarak doğruluğun yanı sıra F1-Skoru ve ROC AUC gibi ölçümlerin modellerin sınıf ayrım kalitesini göstermesi sebebiyle değerlendirme göz önünde bulundurulmuştur. Sonuçlar, XGBoost ve Rastgele Orman (RF) modellerinin hem gömme teknikleri ile zenginleştirilmiş veri seti hem de ham veri seti üzerinde en yüksek performansı gösterdiğini ortaya koymuştur. XGBoost ve RF modelleri için, 5-katlamalı çapraz doğrulama ve test sonuçları detaylı olarak analiz edilmiştir. Modellerin açıklanabilirliğini sağlamak amacıyla, LIME algoritması kullanılmıştır. LIME algoritması, bireysel tahminlerin nedenlerini açıklamak için kullanılmıştır. Açıklanabilirlik, modellerin daha iyi anlaşılmasını ve eğitim süreçlerinin şeffaflığını artırmaktadır. Çalışmanın sonuçları, açıklanabilir makine öğrenmesi algoritmalarının öğrenci terki problemini çözmede etkili olduğunu göstermektedir. Ayrıca gömme tekniği kullanılarak verinin zenginleştirilmesinin ümit vaadettiği görülmektedir. Gelecek çalışmalarda, daha geniş veri setleri ve farklı gömme teknikleri kullanılarak model performansının artırılması hedeflenmektedir. Ayrıca, açıklanabilirlik yöntemlerinin daha fazla kullanılması, öğrenci terkinin azaltıması alanında çalışan yöneticilere ve eğitimcilere daha değerli bilgiler sunacaktır.

Özet (Çeviri)

This thesis aims to address the student dropout problem in higher education institutions by using various machine learning models and enhancing their explainability. The study uses a dataset obtained from Tecnologico de Monterrey, containing records of 121,584 students from 2014 to 2020. After data clearing process, high-school students are eliminated and 64,641 university student data is used in this study. Given the imbalance in the dataset, the Synthetic Minority Oversampling Technique (SMOTE) was applied to balance the data. The models used include embedding techniques to enhance categorical data representation and several classification algorithms. The results demonstrate the effectiveness of these methods in predicting student dropout and highlight the importance of model explainability. Student dropout is a critical issue in higher education institutions, impacting not only the students' futures but also the institutions' performance and reputation. Accurately predicting which students are at risk of dropping out can enable timely interventions and support. This thesis explores the use of machine learning models to predict student dropout and examines the explainability of these models to ensure transparency and trust in the predictions. The dataset used in this study was provided by Tecnologico de Monterrey and includes 143,326 records of 121,584 students from 2014 to 2020. The dataset contains 50 variables, covering various aspects of the students' academic performance and demographics. Due to the imbalance between the dropout and non-dropout classes, the SMOTE algorithm was applied to balance the training data, ensuring that the models are trained on a representative sample of both classes. Extensive data cleaning and processing is applied to dataset according to Cross-industry standard process for data mining (CRISP-DM) standarts. Embedding techniques were used to enhance the representation of categorical data. By creating dense vector representations of features such as 'english.evaluation', 'school.cost', 'socioeconomic.level', 'failed.subject.first.period', 'dropped.subject.first.period', 'sports', 'culture', and 'leadership', the relationships between different categories were captured, providing a richer representation of the data. The study implemented and compared various machine learning models, including: • Multilayer Perceptron (MLP) • Logistic Regression (LR) • Decision Trees (DT) • K-Nearest Neighbors (KNN) • Naïve Bayes (NB) • AdaBoost • XGBoost • Random Forest (RF) These models were evaluated using both the original dataset and the dataset enhanced with embeddings. The performance metrics used include accuracy, precision, recall, F1 score, and ROC AUC. The experimental results are presented in two parts: cross-validation results and test results. The performance of the models on the original dataset and the dataset with embeddings was compared to assess the impact of embeddings on the prediction accuracy. The cross-validation results showed that the XGBoost and Random Forest models performed the best on both datasets. The average performance metrics for these models were significantly higher when embeddings were used, demonstrating the effectiveness of embeddings in capturing complex relationships in the data. On the test set, the XGBoost and Random Forest models again showed superior performance. The accuracy, precision, recall, F1 score, and ROC AUC metrics were higher for the models trained on the dataset with embeddings compared to those trained on the original dataset. The confusion matrix for the test results provided insights into the types of errors made by the models. For instance, the XGBoost model showed a high number of true negatives and a relatively low number of false positives and false negatives, indicating its robustness in identifying students who are not at risk of dropping out. To ensure the transparency and trustworthiness of the models, the explainability of the predictions was assessed using LIME algorithm. This method provided detailed explanations for individual predictions and highlighted the most influential features contributing to the model's decisions. LIME was used to explain the predictions of individual instances. By approximating the model locally with an interpretable model, LIME provides insights into the feature contributions for each prediction. This helps in understanding why a particular student is predicted to drop out or continue their studies. The findings of this study indicate that embedding techniques significantly enhance the performance of machine learning models in predicting student dropout. The use of explainability methods such as LIME ensures that the models are not only accurate but also transparent and interpretable. This is crucial for gaining the trust of educational administrators and policymakers. Future work will focus on expanding the dataset to include more institutions and exploring additional embedding techniques to further improve model performance. Additionally, incorporating real-time data and developing an early warning system for student dropout will be considered.

Benzer Tezler

Tez No
794302
Yükseköğretimin metalaşmasına ilişkin vakıf yükseköğretim kurumlarındaki öğrenci ve öğretim elemanlarının görüşleri
Views of students and academicians from foundation higher education institutions on commodification of higher education
DENİZ FIRAT
Doktora
Türkçe
2023
Ekonomi Ankara Üniversitesi
Eğitim Yönetimi ve Politikası Ana Bilim Dalı
PROF. DR. HASAN HÜSEYİN AKSOY
Tez No
892920
Makine öğrenimi algoritmalarını kullanarak öğrenci akademik performans tahmini
Student academic performance prediction using machine learning algorithms
AIGERIM SULTANALI
Yüksek Lisans
Türkçe
2024
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Gazi Üniversitesi
Bilişim Sistemleri Ana Bilim Dalı
PROF. DR. HASAN ÇAKIR
Tez No
117659
Yükseköğretim öğrencilerinin anadili yeterliklerinin değerlendirilmesi
Higher education students proficiency in their native language
SERPİL SAMUR
Yüksek Lisans
Türkçe
2002
Eğitim ve Öğretim Ankara Üniversitesi
Türkçenin Eğitimi ve Öğretimi Ana Bilim Dalı
PROF. DR. CAHİT KAVCAR
Tez No
277992
Yükseköğretim kurumlarındaki grafik eğitimi için temel tasarım eğitimi konulu örnek interaktif CD tasarımı ve tasarıma ilişkin öğretim elemanı ve öğrenci görüşleri
Sample of an interactive CD design related to the education of basic design for graphic design at higher education institutions and opinions of professors and students on this sample design
ÇAĞRI GÜMÜŞ
Yüksek Lisans
Türkçe
2010
Eğitim ve Öğretim Gazi Üniversitesi
Uygulamalı Sanatlar Eğitimi Ana Bilim Dalı
YRD. DOÇ. DR. TUTKU DİLEM KALAFAT ALPASLAN
Tez No
840992
Teacher study groups as a model of continuous professional development for tertiary level EFL teachers: A case study
Yükseköğretim kurumlarındaki İngilizce öğretmenlerinin sürekli mesleki gelişimi için bir model olarak öğretmen çalışma grupları: Bir durum çalışması
SERHAT BAŞAR
Doktora
İngilizce
2023
Eğitim ve Öğretim Dokuz Eylül Üniversitesi
Yabancı Diller Eğitimi Ana Bilim Dalı
PROF. DR. HATİCE İREM ÇOMOĞLU

Geri Dön