Yazılım projelerinde iş gücü tahmini için makine öğrenmesi yöntemlerinin karşılaştırılması

Comparison of machine learning methods for software project effort estimation

PDF İndir

Tez No: 517471
Yazar: VEHBİ YURDAKURBAN
Danışmanlar: PROF. DR. TAKUHİ NADİA ERDOĞAN
Tez Türü: Yüksek Lisans
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2018
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Fen Bilimleri Enstitüsü
Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
Bilim Dalı: Belirtilmemiş.
Sayfa Sayısı: 57

Özet

Yazılım projeleri, sektörden bağımsız olarak neredeyse tüm şirketlerin operasyonlarını yürütmelerinde stratejik bir öneme sahiptir. Son 10 yılda bu önem giderek artmıştır. Hem yazılım şirketleri hem de temel iş alanı yazılım olmayan çeşitli boyutlarda birçok şirket kendi bünyesinde yazılımlar geliştirmekte ve yazılım evlerine yazılım projeleri yaptırmaktadır. Yazılım projelerinin planlanmasında iş gücünün doğru olarak tahmin edilmesi, proje maliyetlerinin doğru olarak öngörülmesi ve projelerin zamanında bitirilmesi açısından önemlidir. Yazılım efor tahmini alanında uzun zamandan beri çalışmalar yapılmaktadır. Yazılım efor tahmini yöntemleri algoritmik yöntemler, istatistiksel yöntemler, Makine Öğrenmesi tabanlı yöntemler gibi alt başlıklarda incelenmektedir. Tez kapsamında bu çalışmalar değerlendirilmiş ve literatür taraması yapılarak öne çıkan metodolojiler literatür araştırması bölümünde sunulmuştur. Tez konusu olarak Makine Öğrenmesi tabanlı üç farklı efor tahmin yöntemi seçilmiştir. Bu yöntemler Karar Ağacı, Naive Bayes ve Çoklu Regresyon analizi modellerini baz almaktadır. Modellerin çalıştırılması için Windows tabanlı bir masaüstü uygulaması geliştirilmiştir. Piyasada genel kullanıma açık olarak bulunan Makine Öğrenmesi bileşenleri seçilmiş ve modellerin yazılıma aktarılmasında kullanılmıştır. Modellerin eğitilmesinde ve test edilmesinde kullanılan parametrelerin seçiminde daha önceki birçok çalışma incelenmiş ve baz alınmışsa da bazı projelere ait tüm parametrelerin geçmişe dönük olarak elde edilememesinden ve bazı parametrelere ait değerlerin ayırt edici olmamasından dolayı parametrelerde eleme yapılmıştır. Her üç model aynı eğitim veri kümesiyle eğitilmiş ve aynı test veri kümesiyle test edilmiştir. Eğitim ve test verileri 10 yıldır ulusal yazılım piyasasında faaliyet gösteren bir yazılım evinden alınmıştır. 64 farklı yazılım projesine ait parametre verisi bulunmaktadır. Modellerin eğitilmesi ve test edilmesi için k-kat doğrulama yöntemi kullanılmıştır. Bu yöntemin kullanılmasıyla, veri kümesinde bulunan tüm verilerin hem eğitim hem test amacıyla değerlendirilmesi sağlanmıştır. Üç model için de algoritma test verileri üzerinde çalıştırıldıktan sonra elde edilen tahmini efor değerleri gerçek efor değerleriyle karşılaştırılarak Makine Öğrenmesi algoritmaları tarafından hesaplanan tahmini efor değerleri için hata katsayıları belirlenmiştir. Hata katsayılarının belirlenmesinde Bağıl Hata Büyüklüğü (MRE), Ortalama Bağıl Hata Büyüklüğü (MMRE) ve Tahmin Kalitesi (PRED(25)) değerleri kullanılmıştr. Elde edilen sonuçlar karşılaştırıldığında, Çoklu Regresyon Analizi modelinin en doğru tahminleme sonucunu elde ettiği görülmüştür. İkinci sırada Karar Ağacı, son sıradaysa Naive Bayes modeli yer almıştır. Bu durumun sebebinin, seçilen parametrelerin bir kısmının test kümesindeki değerlerinin yeteri kadar ayırt edici olmamasından kaynaklandığı düşünülmektedir. Sonuçların iyileştirilmesi için parametre setinin değiştirilmesi, modellerin farklı yazılım evlerinden veri toplanarak tekrar eğitilmesi, daha fazla proje verisi üzerinde eğitim ve test yapılması gibi yöntemler izlenebilir.

Özet (Çeviri)

Software products have become strategically important in managing daily operations for nearly all companies in various sectors. Both software houses and companies whose main business is not software highly depend on software products. These companies design and develop software products using their in-house teams or outsource their software development process to third party software houses. Since the amount of software projects being developed have been vastly increasing in the last 10 years, new methods are being developed in software project management. New agile project management methodologies have been developed and adapted. Software project effort estimation has gained crucial importance since project budgets are calculated using techniques based on development effort and project costs are predicted using effort estimation techniques. Software projects that cannot be completed in the estimated duration directly increase project costs, and project budgets are exceeded. During the planning phase, resources should be planned with the most available accuracy since both under planned and over planned resources directly affect project budgets and resources. If the resources needed are estimated under the necessary level, projects cannot be completed on time or resources have to be added in the late stages of projects. This situation causes the initial project budget to be increased. If the resources needed are estimated more than the necessary level, the project will be completed on time but the initial project budget will be more that the necessary budget. In the first chapter of this study, software effort estimation problem has been described and scope and aim of the thesis has been stated. The second chapter contains the literature review in which software effort estimation techniques have been researched and the most outstanding ones have been summarized. In the third chapter, selected data set has been explained, implemented machine learning models have been described and the parameters used have been summarized. The fourth chapter consists of training and test results and comments about improving the performance of the machine learning models. The data set and test results for the selected algorithms were listed in the appendix. Software effort estimation methods are categorized as algorithmic methods, statistical methods and Machine Learning based methods. Studies are being made in software project estimation area for a long time. After Nelson's work in the 1960's, SLIM methodology was developed in the 1970's. These early models influenced the SEER and COCOMO models that were developed in the 1980's. In 1995, COCOMO II model was developed as a framework for software effort estimation. The COCOMO II model was focused on object oriented software methodology. Lately, more studies are being performed on various areas of Machine Learning and Statistical Analysis. Decision Trees, Neural Networks, Bayesian Networks, Classification and Decision Trees, Case Based Reasoning, Multiple Regression Analysis, Logistic Regression Analysis are some of the outstanding software effort estimation models. Three different Machine Learning based software effort estimation methods have been selected within the thesis study. These methods are based on Decision Tree, Naive Bayes and Multiple Regression Analysis models. A basic Windows desktop application has been developed in order to implement the effort estimation algorithms. Publicly available Machine Learning components have been inspected and some of them were used to implement the estimation models. Training and test parameter data were obtained from a local software house that has been developing software projects for the last 10 years. Training and test data consists of 64 different software projects that were developed mainly for corporate customers. All of the projects were designed and developed by 18 different engineers using the object-oriented software design methodology. k-fold cross validation technique was used in training and testing the models. The entire data set was divided into 10 nearly equal subsets randomly. The algorithms were run 10 times. In each iteration, one of the 10 subsets was selected as test data set and the rest 9 subsets were used as training data set. By using the k-fold validation technique, all the data in the data set was used both for training and testing purposes. The same training and test parameter subsets were used for each of the three models. Previous work on the software effort estimation domain has been used for selecting the parameters to be used in training and testing the models. Although the initial parameter set consists of 40 parameters that were used to predict software development effort, most of the original parameter set could not be used in this study since some or all of the projects lack some parameter data and a subset of the parameters do not contain elective information. 10 parameters were used to train and test the data set. Statistical values such as minimum, maximum, mean and standard deviation were calculated and examined. Decision Trees are algorithms that are widely used in classification problems. The root node of the tree contains the parameter whose value is searched for, the inner nodes contains each of the test parameters and each leaf contains the decision expression. Although there are different algorithms for Decision Tree construction, C45 algorithm was used in this study. Naive Bayes algorithm is a classification method in which statistically independent parameters directly affect the classification result. This model is based on the Bayes Theorem and can effectively be used in large data sets. The Naive Bayes Classifier performs good even if the parameter independence cannot be guaranteed. Multiple Regression Analysis is a statistical model that uses a set of parameters whose values are known in order to estimate a parameter whose value is unknown. This model does not test the parameter value for being linear and It is used in continuous parameter values. All of the three algorithms were trained and tested using the k-fold validation technique. After running each of the algorithms for 10 times, the error values were calculated for the machine learning models. Magnitude of Relative Error (MRE), Mean Magnitude of Relative Error (MMRE) and Prediction Quality (PRED(25)) formulas were used for obtaining the error values. Afterwards, test results were compared to the original actual effort values. After comparing the effort estimation results with the actual effort values obtained from the test data, it is clearly observed that the Multiple Regression Analysis model has performed the best estimation results. The next best performing model is the Decision Tree algorithm whereas the worst performing model is the Naive Bayes algorithm. Results were visualized as bar charts. This situation can be explained with a couple of reasons. First of all, after observing the training and test data manually, we became aware that some of the parameter values in the test data set is not sufficiently distinguishing to generate consistent estimation values. Secondly, since we have elected more than half of the parameter set, we may have missed the opportunity to train the models with some of the most important parameters. Also, when we observe the parameter set manually, some parameters seem to be irrelevant to software effort estimation techniques such as image file count, video file count, text file count etc. The results show that the Decision Tree model has performed better than the Naive Bayes model. Although our initial assumption was that the Naive Bayes model generally performs better than the Decision Tree model, this situation may have been caused by the efficient formation of the decision tree that was constructed by the C45 algorithm. Also, the Naive Bayes model may have generated similar effort estimation results for most of the test parameter set. In order to find the real reason, we should run our tests using different training and test data. In order to generate better estimation results, data should be obtained from various software houses and software departments of companies in order to have a larger data set with different project sizes, features and technologies used. This would both enable us to work with project data with different estimation and actual sizes and use different sets of parameters. Furthermore, since we only had the opportunity to work with a data set consisting of 64 projects, this training and test data size may not be enough for generating accurate estimation results. As future work, we are planning to obtain data from different companies and create a new parameter set that is more appropriate for current software development processes and techniques. We are also planning to test and validate the selected parameters in order to eliminate the unnecessary ones.

Benzer Tezler

Tez No
886529
A dataset quality enhancement method for fine-grained just-in-time software defect prediction models
İnce taneli tam zamanında yazılım hata tahmin modelleri için veri kalitesi iyileştirme yöntemi
İREM FİDANDAN
Yüksek Lisans
İngilizce
2024
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
DOÇ. DR. FEZA BUZLUCA
Tez No
621937
Nesne tabanlı metrikler kullanılarak yazılım projeleri maliyetlerinin tahmin edilmesi
Prediction of software project costs using object-oriented metrics
ADEM DİLBAZ
Yüksek Lisans
Türkçe
2020
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Ankara Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
DR. ÖĞR. ÜYESİ BÜLENT TUĞRUL
Tez No
919590
Proje efor tahmini için makine öğrenmesi modellerinin geliştirilmesi ve SHAP yöntemi kullanılarak açıklanması
Development of machine learning models for project effort prediction and explanation using SHAP method
ESMA NUR KAYA
Yüksek Lisans
Türkçe
2025
Yönetim Bilişim Sistemleri Sivas Cumhuriyet Üniversitesi
Yönetim Bilişim Sistemleri Ana Bilim Dalı
DR. ÖĞR. ÜYESİ YASİN GÖRMEZ
Tez No
954897
Gelecek arge projelerinin gereksinim duyduğu çalışan yeteneklerinin makine öğrenmesi algoritmaları kullanarak tahminlenmesi
Predicting the employee skills required for future R&D projects using machine learning algorithms
İREM TAŞKIRAN
Yüksek Lisans
Türkçe
2025
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İSTANBUL NİŞANTAŞI ÜNİVERSİTESİ
Yazılım Mühendisliği Ana Bilim Dalı
DR. ÖĞR. ÜYESİ GÜLSÜM ŞANAL
DR. ÖĞR. ÜYESİ AHMET KILIÇ
PROF. DR. HÜSEYİN PEHLİVAN
Tez No
945011
Exploring the potential of digital twin technology to improve factors affecting construction productivity during the construction phase
Yapım aşamasında inşaat verimliliğini etkileyen faktörlerin iyileştirilmesinde dijital ikiz teknolojisinin potansiyelinin incelenmesi
İREM KOMAR
Yüksek Lisans
İngilizce
2025
Mimarlık İstanbul Teknik Üniversitesi
Mimarlık Ana Bilim Dalı
PROF. DR. HÜSNÜ MURAT GÜNAYDIN

Geri Dön