Makine öğrenmesinde ayrık öbekleme ve sınıflandırma algoritmaları

Discrete clustering and classifications in machine learning

PDF İndir

Tez No: 609482
Yazar: KEREM KABİL
Danışmanlar: DOÇ. DR. ATABEY KAYGUN
Tez Türü: Yüksek Lisans
Konular: Bilim ve Teknoloji, Matematik, Science and Technology, Mathematics
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2019
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Fen Bilimleri Enstitüsü
Ana Bilim Dalı: Matematik Mühendisliği Ana Bilim Dalı
Bilim Dalı: Matematik Mühendisliği Bilim Dalı
Sayfa Sayısı: 115

Özet

Bu tez çalışması, K-Ortalamalar (K-Means), K-En Yakın Komşu (K-Nearest Neighbor), Naive Bayes, Karar Ağaçları (Decision Trees), Logistik Regresyon (Logistic Regression), Destek Vektör Makineleri (Support Vector Machines) gibi bazı makine öğrenmesi algoritmalarının matematiksel altyapısını küçük örneklerle destekleyerek kapsamlı ve anlaşılabilir bir şekilde aktarmayı, bu algoritmalar kullanılarak bir makine öğrenmesi modelinin nasıl kurulması gerektiğini, kurulan bir modelin farklı özellikteki veri setleri üzerindeki performanslarının nasıl değiştiğini ve bu performansların nasıl yorumlanmasını gerektiğini göstermeyi amaçlamaktadır. Günümüzde makine öğrenmesinin önemi gittikçe artmaktadır. Tanımlayıcı analitik (Descriptive analytics) ve tanısal analitik (Diagnostic analytics) ile hazırlanan raporlar ve analizler artık yerini tahminleyici analitik (Predictive analytics) ve kuralcı analitik (Prescriptive analytics) ile hazırlanan tahminlere ve analizlere bırakmaktadır. Çünkü, insanlar artık geçmişte ne olduğundan çok gelecekte ne olacak sorularıyla ilgilenmeye başlamışlardır. Çünkü, artık geleceğin verisi daha değerli hale gelmiştir. Makine öğrenmesi ile verinin daha değerli olduğu alanlarda ileriye dönük tahminler, çıkarımlar yapılabilecektir. Makine öğrenmesi, bir sistemin geçmişte veya anlık edindiği tecrübeleri kurulan bir model yardımıyla öğrenip, gelecekte meydana gelebilecek benzer bir olayda tahmin yapabilmesini amaçlayan bir yapay zeka alanıdır. Makine öğrenmesi kullanılarak yapılan bir tahminin başarısı kurulan modelle doğrudan ilişkilidir. Dolayısıyla matematiksel ve istatistiksel temellere dayanan makine öğrenmesi modellerini iyi inşaa etmek oldukça önemlidir. Bu durum, iyi bir makine öğrenmesi modeli kurmak için makine öğrenmesi algoritmalarının matematiksel altyapısına hakim olma gerekliliğini doğurmaktadır. Bununla beraber üzerinde çalışılacak veriyi iyi analiz etmek de oldukça önemlidir. Veri temizleme, eksik veri kontrolü, veri dönüştürme, veri ölçeklendirme gibi veri ön işleme adımlarının veri seti üzerinde doğru bir şekilde uygulanabilmesi, hangi veri seti için hangi doğrulama yönteminin kullanılması gerektiği de iyi bir makine öğrenmesi modeli için diğer gerekliliklerdendir. Veri setleri üzerinde farklı makine öğrenmesi modelleri kullanılabilir. Fakat her model söz konusu veri seti üzerinde aynı performansı vermeyebilir. Dolayısıyla hangi modelin söz konusu veri seti için en doğru model olduğunu değerlendirebilmek de en az iyi bir model kurabilmek kadar önemlidir. Dolayısıyla bu noktada modelin değerlendirilmesinde kullanılan performans metriklerinin iyi anlaşılması büyük önem taşımaktadır. Bu tez çalışmasında, yukarıda belirtilen hassasiyetler göz önünde bulundurularak, makine öğrenmesi algoritmalarının matematiksel altyapıları verilmiş, bir makine öğrenmesi modeli oluşturma süreci ve oluşturulan modelin değerlendirme süreci anlatılmıştır. Teorisi anlatılan algoritmaların modelleri, UCI Machine Learning Repository' den alınan ve boyut, büyüklük, veri tipi olarak farklı, sınıf değişkenleri kategorik olan üç farklı veri seti (Mushroom, Congressional Voting Records, Tic-Tac-Toe) üzerinde kodlanmıştır. Kodlama işlemi Python Programlama Dili kullanılarak Jupyter Notebook üzerinde yapılmıştır. Bazı çıktılar Tableau Desktop kullanılarak görselleştirilmiştir.

Özet (Çeviri)

In this thesis, we investigate commonly used classification and clustering algorithms, K-Means, K-Nearest Neighbor, Naive Bayes, Decision Tree, Logistic Regression, Support Vector Machines, in machine learning. Our goal is to express mathematical background of such classification and clustering algorithms with supporting little examples, and explain how to use a machine learning model by using these algorithms, and show how the performance of an established model on various datasets with different properties. Also, our aim is to explain how these performances should be interpreted. In addition, since machine learning is an area that requires people from different fields of study to work together, care must be taken to explain machine learning and other topics that are subject to machine learning in a way that people with different levels of expertise can understand. Another goal of this thesis is to explain machine learning and its topics in that way. Recent history and nowadays, importance of machine learning increasing day by day. The main reason of this is the data becoming more valuable. Nowadays, people and companies are interested in these questions,“What will happen in the future?, How can make it happen?”. So, nowadays reports and analyzes by using descriptive analytics and diagnostic analytics have begun to lose their importance. In parallel, predictions reports and analyzes by using predictive analytics and prescriptive analytics are becoming more valuable day by day. Therefore, thanks to machine learning, we can now work on predictive analytics and prescriptive analytics which store more valuable data in different fields. For all these reasons, machine learning has been actively used in many areas like medicine, technology, ... As time goes on, these areas of use will increase. Machine learning is the field of artifical intelligence that aims to enable a system to learn from past or instant experiences with the help of an established model, and to predict a similar event that may occur in the future. The process of learning begins with observations or data, such as examples, direct experience, or instruction, in order to look for patterns in data and make better decisions in the future based on the examples that we provide. The primary aim is to allow the computers learn automatically without human intervention or assistance and adjust actions accordingly. There are generally two learning types of a machine learning models such as supervised learning and unsupervised learning. In the supervised learning models, we need to seperate dataset into training data and test data basically. K-Nearest Neighbor, Naive Bayes, Decision Tree, Logistic Regression, Support Vector Machine, are all supervised learning algorithms. In supervised learning, we can built prediction and classification models based on both input and output data. On the other hand, in unsupervised learning there is no need to seperate dataset into training data and test data. K-Means clustering, Hierarchical clustering are all unsupervised learning. In unsupervised learning, we can group and interpret data based on only input data. The success of a prediction by using machine learning algorithms is directly related to the established model. Namely, it is very important to build well machine learning models based on mathematical and statistical fundamentals. This situation necessitates a comprehensive knowhow of the mathematical background of machine learning algorithms in order to establish good machine learning model. However, it is also important to analyze well the studied dataset. Applying data preprocessing steps such as data cleaning, missing data control, data conversion, data scaling correctly is important. Also, which validation method such as k-fold cross validation, leave one out cross validation, hold-out method, re-substitution method should be used for which dataset is another requirement for a good machine learning model. Because, different validation method may cause overfitting and underfitting problems, validation method selection is directly related to machine learning prediction or classification performance. Because some validation methods is worse run time on a some dataset, validation selection is also related to model run time. Different machine learning models may be used on datasets, but these machine learning models may not return the same performance. Therefore, evaluating which model is the most accurate model as important as establishing a machine learning model. At this point, a good understanding of performance metrics used in the evaluation of the model is of great importance. Although, all of these performance metrics are really important, some act that different. When evaluating performance of a machine learning model, some think that evaluate only accuracy is enough. This act may be fatal when evaluating performance of a machine learning model, or comparing two or more machine learning models. Because there is not only balanced dataset in the world, we also need to calculate performance metrics other than accuracy. Precision, recall, f-measure are another commonly used performance metrics in machine learning world. Each of these has different importance for the model. Except those, we investigate another performance metric in this thesis. This is called Cohen Kappa Score or Cohen Kappa Statistic. The importance of the Cohen Kappa Score is to measure whether accuracy depends on chance or not. Because this metric validate accuracy in way, this makes it a really important performance metric. So, all performance metrics in machine learning world has different importance and meanings. Therefore, we calculate all the performance metrics when we need to evaluate performance of a machine learning model. We also calculate all the performance metrics to compare different machine learning models, and to determine which machine learning model is the best In this thesis, considering the sensitivities mentioned above, mathematical backgrounds of machine learning algorithms are given, process of a machine learning model and evaluation process of the model is explained. Also, in order to better explain the logic of the investigated algorithms we present fully worked out short examples for each of the algorithms we cover in this thesis. In the last chapter, we apply these algorithms on different datasets taken from UCI Machine Learning Repository, and analyze their performances by evaluating performance metrics values for each algorithm on this dataset. These datasets are different based on instance number, data types and dimension. In this thesis, all machine learning model is built by using k-fold cross validation. Because seeing performance scores of a machine learning model on each cross validation steps give us some information about the performance of each fold, we are not calculate only average performance scores of machine learning models on each datasets, but also calculate performance score on each cross validation step. Also, in the last chapter Cohen Kappa Score and accuracy metric is compaired. By this comparison, it is tried to be stated that Cohen Kappa Score is a validation of accuracy metric. Thus, we can better compare different machine learning models and analyze performances in-depth. All of these machine learning algorithms on the datasets taken from UCI Machine Learning Repository are coded by using Python Programming Language on Jupyter Notebook, and some of the graphs are visualised by using Tableau.

Benzer Tezler

Tez No
418882
Sınıflandırma problemlerinde meta-sezgisel optimizasyon yöntemlerinin özellik seçimi ve ayrıklaştırma amacıyla kullanımı
Utilization of metaheuristic optimization methods for feature selection and discretization on classification problems
İSMAİL KOÇ
Yüksek Lisans
Türkçe
2016
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Selçuk Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
YRD. DOÇ. DR. İSMAİL BABAOĞLU
Tez No
338228
Fuzzy cognitive maps for emotion modeling
Bulanık bilişsel haritalar yardımıyla insan duygularının modellenmesi
HASAN MURAT AKINCI
Yüksek Lisans
İngilizce
2013
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Kontrol Mühendisliği Ana Bilim Dalı
YRD. DOÇ. DR. ENGİN YEŞİL
Tez No
884316
Penalized stable regression
Cezalandırılmış stabil regresyon
İREM SARIBAŞ
Yüksek Lisans
İngilizce
2024
Matematik İstanbul Teknik Üniversitesi
Matematik Mühendisliği Ana Bilim Dalı
DOÇ. DR. GÜL İNAN
Tez No
885338
Metin ön işleme fazının makine öğrenmesinde sınıflandırmaya etkileri
Effects of text preprocessing phase on classification in machine learning
ESME GÜL TOPRAK
Yüksek Lisans
Türkçe
2024
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Haliç Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
DOÇ. DR. ÜLVİYE HACIZADE
Tez No
729822
Makine öğrenmesi algoritmaları ile kalp hastalığı tahmini
Prediction of heart disease with machine learning algorithms
GÜNEŞ GÜRSOY
Yüksek Lisans
Türkçe
2022
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Maltepe Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
PROF. DR. ASAF VAROL

Geri Dön