Gezgin satıcı probleminin hadoop üzerinde çalışan paralel genetik algoritma ile çözümü

Parallel genetic algorithm to solve traveling salesman problem on hadoop cluster

PDF İndir

Tez No: 350674
Yazar: HARUN RAŞİT ER
Danışmanlar: PROF. DR. NADİA ERDOĞAN
Tez Türü: Yüksek Lisans
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2013
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Fen Bilimleri Enstitüsü
Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
Bilim Dalı: Belirtilmemiş.
Sayfa Sayısı: 83

Özet

MapReduce, günümüzde birçok büyük ölçekli firmanın büyük miktardaki verileri işlemek için kullandıkları bir dağıtık programlama altyapısıdır. MapReduce altyapısının asıl amacı, büyük miktardaki verileri paralel olarak işlemektir. Ancak son zamanlarda işlemci-yoğun programlar için de kullanılmaya başlanmıştır. Google tarafından ortaya atılan bu programlama modeli - isminden de anlaşılacağı gibi - ?map? ve ?reduce? fonksiyonlarından oluşmaktadır. Kullanıcı bir MapReduce programı yazmak istediğinde en azından bu iki fonksiyonu gerçeklemek zorundadır. Map fonksiyonu dışarıdan anahtar/değer ikilisi alır, bunları işler ve çıktı olarak yeni bir anahtar/değer ikilisi verir. Reduce fonksiyonu ise; bir anahtar değeri ve o anahtar değerine sahip tüm değerlerin listesini girdi olarak alır. Listedeki veriler üzerinde işlem yaptıktan sonra çıktı olarak bir anahtar/değer ikilisi verir. MapReduce programlama modeli, büyük miktardaki verileri işlemek için yüksek maliyetli süper bilgisayarlar ve veri tabanı sunucuları yerine sıradan bilgisayarlardan oluşan bir ağ kullanır. Ağ üzerindeki bilgisayarlar üzerinde map ve reduce fonksiyonları paralel olarak çalıştırılır. MapReduce mimarisi ağ üzerinde veriyi taşımak yerine verinin bulunduğu makinelere ilgili map veya reduce fonksiyonunu gönderir. Böylece ağ üzerinde oluşacak bant genişliği problemini ortadan kaldırır. Hadoop, MapReduce altyapısının ücretsiz ve açık-kaynak gerçeleştirimidir. Hadoop kullanıcıya iki yapı sunmaktadır. Bunlardan birisi kullanıcının dağıtık program geliştirmesini ve çalıştırmasını sağlayan MapReduce yapısıdır. Diğeri ise; geliştirilen MapReduce programının girdi ve çıktılarının saklandığı dağıtık dosya sistemi olan Hadoop Dağıtık Dosya Sistemi?dir (Hadoop Distributed File System - HDFS). HDFS üç parçadan oluşur. Namenode, HDFS üzerindeki verilerin yönetimini yapmaktan sorumludur. Secondary Namenode, adından anlaşılacağı üzere, Namenode?un yedeği olarak çalışmaktadır. Datanode ise, gerçek verilerin saklandığı bölümdür. Hadoop sisteminde Namenode ve Secondary Namenode birer adet bulunur ve master makine üzerinde çalışır. DataNode ise bir veya daha fazla sayıda olabilir ve slave makineler üzerinde çalışır. Hadoop MapReduce yapısı ise; Jobtracker ve TaskTracker olmak üzere iki ana parçadan oluşur. JobTracker Hadoop işlerinin çalıştırılmasını koordine eder. Kullanıcıdan çalıştırılacak olan işi alır ve onun düzgün bir şekilde çalıştırılıp sonlandırılmasını sağlar. Tasktracker kullanıcınının yazmış olduğu map ve reduce fonksiyonlarını çalıştırır. Hadoop sisteminde bir adet JobTracker bulunur ve master makine üzerinde çalışır. TaskTracker bir veya birden fazla olabilir ve slave makineler üzerinde çalışır. Genetik Algoritma, optimizasyon problemlerinin çözümünde sıklıkla kullanılan sezgisel arama algoritmasıdır. Genetik algoritma doğal hayattaki en iyi olanın hayatta kalması prensibine göre çalışır. Buna göre her jenerasyonda en iyi bireyler hayatta kalır ve bir sonraki jenerasyon için yeni bireylerin oluşturulmasına katkı sağlarlar. Genetik algoritma doğal evrim sürecinden esinlerek geliştirilmiştir. Buna göre ilk olarak bir başlangıç populasyonu oluşturulur. Bu ilgili problem için oluşturulan rastgele çözümler kümesidir. Daha sonra populasyondaki her bireyin ne kadar iyi çözüm oldukları hesaplanır. Sonraki adımda bireyler ne kadar iyi olduklarına göre eşleşme için seçilir. Seçilen bireyler, aralarında gen değişimi gerçekleştirerek yeni bireyleri oluştururlar. Son olarak yeni bireyler belirlenen bir olasılıkla mutasyona uğrarlar. Bu şekilde yeni bir jenerasyon ortaya çıkar. Yeni jenerasyondaki bireylerin herhangi birisinin sonlanma koşulunu sağlayıp sağlamadığı kontrol edilir. Eğer sonlanma koşulu sağlanıyorsa algoritma sonlanır. Aksi taktirde, algoritma önceki adımları tekrar eder. Genetik algoritmalar her ne kadar kabul edilebilir bir zaman içerisinde optimum çözüme yakın çözümler üretseler de bazı problemler içermektedir. Bunlardan en önemli ikisi; uygunluk değerinin hesaplanması ve algoritmanın yerel optimum değere takılmasıdır. Özellikle populasyondaki bireylerin sayısının fazla olduğu durumlarda bireylerin uygunluk değerinin hesaplanması çok fazla zaman alabilmektedir. Bu sorunların çözümü için paralel genetik algoritmalar kullanılır. Paralel genetik algoritmalar uygunluk değerinin hesaplanma yöntemine göre, tek veya birden çok populasyon kullanımına göre ve birden fazla populasyon kullanımı durumunda, bireylerin populasyonlar arasındaki değişim yöntemine göre sınıflandırılmaktadır. Buna göre genetik algoritmalar üç ana sınıfa ayrılabilir. Bunlar; master-slave modeli, göç alan statik alt-populasyonlar modeli ve statik kesişen alt-populasyonlar modelidir. Göç-alan statik alt-populasyonlar modelinde birden fazla alt-populasyon bulunur. Alt-populasyonlar birbirlerinden bağımsız olarak bireyler üzerinde genetik operatörleri uygularlar. Ve belirlenen aralıklarla göç operatörü yardımı ile bireyler diğer populasyonlara göç ettirilir. Bu şekilde algoritma çözüm uzayının çok daha büyük bir kısmını taramış olur ve yerel optimum değerine takılma sorununu ortadan kaldırır. Bu çalışmada Hadoop ağı üzerinde genetik algoritmanın paralelleştirilmesi gerçekleştirilmiştir. Paralelleştirme yöntemi olarak ise göç-alan statik alt-populasyonlar yöntemi kullanılmıştır. Buna göre her alt-populasyon ayrı bir map/reduce görevi olarak farklı bir bilgisayarda çalışır. Böylece tüm alt-populasyonlar paralel olarak farklı makinelerde genetik operatörleri bireylere uygular. Hadoop işinin tamamlanması sürecinde; ilk olarak tüm map görevleri HDFS?den tüm populasyonlara ait bireyleri okurlar. Bireyler populasyon kimliklerine göre reducer fonksiyonuna gönderilir. Buna göre aynı populasyon kimlik numarasına sahip bireyler aynı reducer tarafından işlenir. Populasyonlar belirlenen sürede evrim geçirdikten sonra sonuçlar tekrar HDFS?e yazılır ve Hadoop işi sonlanır. Bir sonraki Hadoop işi ise daha önce yazılan populasyonları okuyarak başlar ve aynı süreç, sonlanma koşulu sağlanana kadar devam eder. Gerçekleştirilen genetik algoritmanın test edilmesi için gezgin satıcı problemi kullanılmıştır. Bilindiği gibi gezgin satıcı problemi (GSP) birçok arama yönteminin test edilmesinde kullanılmaktadır. GSP, bağlı bir çizgede tüm noktaların bir kez ziyaret edilerek başlangıç noktasına geri dönen en kısa yolun bulunması problemidir. GSP, NP-zor kombinasyonel optimizasyon problemidir ve bilgisayar bilimlerinde önemli bir yere sahiptir. Gerçekleştirilen ardışıl genetik algoritma ve daha önceki yapılan çalışmalarda kullanılan genetik algoritmaların sonuçları karşılaştırılmıştır. Ayrıca MapReduce ağı üzerinde paralelleştirilen genetik algoritmanın sonuçları ile ardışıl genetik algoritmanın sonuçları karşılaştırılmıştır. Sonuçlara bakıldığında, gerçekleştirilen ardışıl genetik algoritma diğer algoritmalardan daha kısa sürede daha iyi sonuçlar üretmiştir. Ayrıca paralel genetik algoritma, problem boyutunun büyük olduğu durumlarda ardışıl algoritmadan daha kısa sürede daha iyi sonuçlar üretmiştir.

Özet (Çeviri)

MapReduce is a distributed framework that large-scaled companies use to handle large data sets. The main objective of this framework is to process huge data set in a parallel manner using a cluster of commoduty computers. It is also used in CPU-intensive computations in recent years. Google has proposed the MapReduce model that enables users to develop and run distributed programs easily. MapReduce programming model has two main components that are called map and reduce funtions. It is necessity to implement these two functions at least in order to write a MapReduce program. Map function reads the input data as key/value pairs. Then it produce an intermediate key/value pairs. Reduce function reads the results of map funtions. It takes a key and a list of values associated with this key as its input data. It performs required operations on the list of values. At the end of the MapReduce job, reduce function writes the results to a file as key/value pairs. MapReduce programming model uses commodity computer cluster instead of large database servers or super computers in order to process data. Map and reduce functions are run in the machines on the cluster in parallel. MapReduce model uses data locality. As a result, data is not transferred on network but the map and reduce funtions are sent to the machine where the data is stored instead. This eliminates the network bandwidth problem. Hadoop is open source implementation of MapReduce framework for writing and running distributed applications that process large amounts of data. It provides two main concepts for the users: MapReduce programming model and Hadoop Distributed Fie System (HDFS). MapReduce programming model enables users to develop and run the application. HDFS is a distributed file system on Hadoop cluster in order to store input and output data of MapReduce program. HDFS contains three main components. Namenode is responsible for managing data on HDFS. Secondary Namenode, as the name suggests, runs as a backup for Namenode. Datanode stores the real data on HDFS. Hadoop cluster has only one Namenode and one Secondary Namenode running on master machine. And one or more Datanode running on slave machines on the cluster. Hadoop MapReduce consists of two main components called Jobtracker and TaskTracker. Jobtracker coordinates the running of Hadoop jobs on the cluster. It receives the job file from the user and manage the cluster to run the job consistently and gives the results back to user. Tasktracker runs the map and reduce tasks in parallel. Hadoop cluster contains only one Jobtracker running on master machine and one or more Tasktracker running on slave machines. Genetic algorithm is a heuristic search method that is commonly used in the solution of optimization problems. It uses Darwin?s Theory where the only best individuals survive. According to the theory, the best individuals survive in each generation and contribute to creation of new individuals for the next generation. Genetic algorithm is inspired from natural evolution. The first step of the algorithm is creation of initial population. The initial population is a set of random solutions for the problem. The next step is to evaluate the fitness of each individual. Then individuals are selected according to their fitness values for crossover. Crossover operator changes some genes between selected individuals and constructs the new individuals. The last step is mutation which changes some genes of the new individuals randomly in order to conserve the diversity of solutions. As a result a new generation is obtained. Before the next iteration, the algorithm checks whether any of the individuals in the population satisfies termination condition. If it does the algorithm terminates, otherwise the algorithm goes into the next iteration where fitness calculation, selection, crossover and mutation operators are applied to new population. One of the drawbacks of the genetic algorithm is that it can get stuck in local optimum. In order to avoid this situation, we propose preventing consanguineous marriage. Our policy in this respect is selecting two individuals that have different gene orders over a particular percent for the crossover operation. Setting a percentage value too high prevents the algorithm from converging. Otherwise, if the percentage value is minimized too much, it has no effect on avoiding from local optima. After some experimentation, we have chosen the percentage value as %20. Therefore, if two parents have similarity over %80, they cannot be chosen for the crossover. Eventhough the genetic algorithm produces near optimal solutions in a reasonable time, there are some problems to be addressed. The two of most important problems are fitness evaluation time of individuals and getting stuck in local optima as mentioned above. Fitness evaluation can be very time consuming and sometimes impossible especially when the population size is large. If the genetic algorithm gets stuck in local optima, it will never find the global optimum solution. In order to overcome these problems, parallel genetic algorithm (PGA) is used. The calculation of fitness evaluation, the use of single or multiple population, in case of multiple populations, the way individual exchange is carried out are some of the criteria according which PGAs are classified. PGAs are classified into Master-slave, multiple populations with migration and multiple populations without migration. Master slave parallelization method includes one master machine and one or more slave machines. Master machine manages the whole genetic algorithm phases. This method contains one population. While fitness calculation is processed by slave machines in parallel, the other phases of genetic algorithm are carried out by master machine sequentially. This method can be implemented as synchronous or asynchronous. In synchronous method, master machine waits all slave machines to complete fitness calculations. After all slaves send the results, the master machine continues in order to complete the other phases of genetic algorithm. The slowest slave machine determines the performance of the algorithm in this case. In asynchronous method, master machine does not wait for all slave machines to complete their tasks. Master machine applies the genetic operators for those individuals that are received recently and again waits for some individuals to be sent by slave machines. Static populations without migration model uses more than one populations. Exchange of traits between sub-populations is done by individuals that reside in overlapping areas. These individuals belong to more than one population and they compete in different sub-populations. The genetic operators should be applied sycnhronously on these individuals that reside in overlapping areas. Static population with migration model, which is used in this study, contains more than one population. All sub-populations apply genetic operators independent of each other. And some individuals are transferred to other populations via migration operator at specified intervals. As a result, sub-populations search the space in different directions and the algorithm avoids getting stuck into local optima. Parallelization of genetic algorithm on MapReduce framework is implemented on Hadoop cluster in this study. Multiple population with migration method is used as parallelization method. Each sub-population evolves in a different map/reduce task on different machines. As a result sub-popuations apply genetic operators to their individuals in parallel. The first step of a Hadoop job is that map functions read all individuals from HDFS. Individuals are grouped according to their population identifiers and sent to reducer funtions. Reducer receives a population identifier and a list of individuals sharing the population identifier. Reducers apply genetic operators to individuals and populations evolve until migration phase. At he end of reduce funtion, individuals are written back to HDFS. The key point is to write indviduals that will migrate to other populations with different population identifiers. So they will compete within different populations in the next Hadoop job. The next Hadoop job starts to read individuals from HDFS and continue running as the previous job. This process continues until the termination condition is met. Travelling Salesman Problem (TSP) is used to test the algorithm implemented. TSP is a common bechmark for testing optimization algorithms. TSP can be defined as finding shortest possible tour which visits every node exactly once and returns to starting node in a connected graph. TSP is an important NP-Hard combinatorial opmimization problem that is studied in computer science. Implemented sequential genetic algorithm is compared with other implemetations studied before. MapReduce parallel genetic algorithm is also compared with the sequential genetic algorithm. Compared to other implementations, the implemented sequential genetic algorithm always finds better results and takes less time. The comparison of MapReduce PGA and sequential GA shows that MapReduce PGA always finds better results than the sequential one. While sequential GA takes less time for small size of problems, MapReduce PGA exhibits better perfomance when the problem size increases. The main objective of this study is to introduce MapReduce framework for the next studies. And also, it tests the performance of Hadoop for cpu-intensive algorithms. Recent studies practice on parallelization of GA on MapReduce framework also, but they all concentrate on parallelizing fitness function. In this study, multiple population parallelization method is adopted to MapReduce infrastructure. And it is shown that multiple population PGA can best fit to MapReduce model and it can be scaled to a larger cluster without any effort.

Benzer Tezler

Tez No
84453
Gezgin satıcı probleminin çözümünde sinirsel ağ yaklaşımı
Neural network approach in the solution of traveling salesman problem
KAAN ASLAN
Yüksek Lisans
Türkçe
1999
Endüstri ve Endüstri Mühendisliği Eskişehir Osmangazi Üniversitesi
Endüstri Mühendisliği Ana Bilim Dalı
DOÇ. DR. A. SERMET ANAGÜN
Tez No
655965
Gezgin satıcı probleminin çözümü için geliştirilmiş uyarlanabilir bir genetik algoritma tasarımı
An improved adaptive genetic algorithm design for solving traveling salesman problem
MERVE GENEL
Yüksek Lisans
Türkçe
2021
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Van Yüzüncü Yıl Üniversitesi
Elektrik-Elektronik Mühendisliği Ana Bilim Dalı
DOÇ. DR. RIDVAN SARAÇOĞLU
Tez No
743612
Gezgin satıcı probleminin DBSCAN kümeleme yöntemiyle analizi
Analizing traveling salesman problem with DBSCAN method
UĞUR SİNAN EREN
Yüksek Lisans
Türkçe
2022
Endüstri ve Endüstri Mühendisliği Kocaeli Üniversitesi
Endüstri Mühendisliği Ana Bilim Dalı
DR. ÖĞR. ÜYESİ YILDIZ ŞAHİN
Tez No
539711
Gezgin satıcı probleminin karınca kolonisi algoritması ile çözüm performansının arttırılmasında parametre optimizasyonu
Parameter optimization to increase solution performance of travelling Salesman problem by using ant colony algorithm
KUMRU AKŞEHİR
Yüksek Lisans
Türkçe
2019
İstatistik Ondokuz Mayıs Üniversitesi
İstatistik Ana Bilim Dalı
DOÇ. DR. TALAT ŞENEL
Tez No
681749
Gezgin satıcı probleminin çözümü için guguk kuşu arama algoritma tabanlı yeni bir hibrit metasezgisel yöntem
A new hybrid metaheuristic method based on cuckoo search algorithm for solving the traveling salesman problem
MUSTAFA FURKAN BERKAYA
Yüksek Lisans
Türkçe
2021
Endüstri ve Endüstri Mühendisliği Konya Teknik Üniversitesi
Endüstri Mühendisliği Ana Bilim Dalı
DOÇ. DR. AHMET SARUCAN

Geri Dön