Automated curriculum design for reinforcement learning with graph theory and evaluation heuristics

Çizge kuramı ve değerlendirme bazlı sezgisel yöntemler ile pekiştirmeli öğrenme için otomatik müfredat tasarımı

PDF İndir

Tez No: 676636
Yazar: ANIL ÖZTÜRK
Danışmanlar: DOÇ. DR. NAZIM KEMAL ÜRE
Tez Türü: Yüksek Lisans
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2021
Dil: İngilizce
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Lisansüstü Eğitim Enstitüsü
Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
Bilim Dalı: Bilgisayar Mühendisliği Bilim Dalı
Sayfa Sayısı: 109

Özet

Pekiştirmeli öğrenmede, modeller ortam dinamiklerini ve onları hedefe götürecek poliçeyi ortama keşfederek bulmaya çalışırlar. Bu amaç doğrultusunda modelleri hedefe hızlı yoldan ulaştırabilmek için ortam bazlı veya model bazlı yardımcı metotlardan faydalınabilmektedir. Model tabanlı çözümler daha çok modelin mimarisini ve güncelleme hassasiyetini değiştirmeye yararken, ortam tabanlı çözümler ortam dinamiklerini ve hedef görevin zorluğunu ayarlamayı sağlamaktadır. Pekiştirmeli öğrenme problemlerinde modelin ortamı iyice analiz edebilmesi için öncelikle çoğu kritik ve temel durumu keşfetmesi, ardından bütün önemli durumların tüm olası değerlerini öğrenmesi (sömürmesi) gerekir. Fakat bir model ne kadar keşfetmeye meyilli olursa sömürüden o kadar uzaklaşır, aynı şekilde sömürüye ne kadar meyilli olursa da keşfetmekten o kadar uzaklaşır. Bahsedilen iki yaklaşım da pekiştirmeli öğrenme alanındaki yoğun araştırma konularından biri olan keşif-sömürü ikilemini çözmek adına çeşitli öneriler sunmaktadır. Görece zor bir ortam veya görev karşısında, modelin hedef görevi tek seferde öğrenmesi yavaşlayabilir veya imkansızlaşabilir. Model, uzun süreli planlama veya ani aksiyonları gerektiren koşulları barındırdıkları çok sayıda değişken ve kombinasyon uzayının genişliğinden ötürü öğrenemeyebilir. Müfredat öğrenme yapısı, modelin eğitim süreci boyunca birbirinden ayırt edici zorluk farkları olan fazlardan geçmesini sağlar. Bu bağlamda modelin öğrenmek için ortamdan topladığı örnekleri belli bir sıklıkla görmesini sağlayan, önceden tasarlanmış ayrık ortam tasarımlarını modele gitgide zorlaştırarak sunan, daha zor bir görevi öğretip daha kolay bir görevde daha fazla başarıya ulaşmasını sağlayan metotlar önerilmiştir. Otomatik müfredat öğrenme metotları ile, bahsi geçen müfredat öğrenme yapılarındaki alan uzmanlığı gereksinimi ve model optimizasyon süreçlerine sarfedilen efor en aza indirgenmeye çalışılmaktadır. Birbiriyle müşterek bir şekilde çalışan öğrenci-öğretmen sinir ağları, modelin öğrenme çıktılarına göre zorluğu anında düşürüp artırabilen öntanımlı metotlar ve ortama ait durumları, ödülleri ve hedefleri dinamik şekilde değiştirmek otomatikleştirilmiş müfredat öğrenme algoritmaları içerisinde sıklıkla başvurulan stratejilerdendir. Önerilmiş algoritmanın tasarımı süreci boyunca iki adet ön çalışma yapılmış olup müfredat öğrenmenin temel öğrenme sürecine olan etkileri, avantajları, dezavantajları, çeşitleri araştırılmış ve etkileri gözlenmiştir. İlk çalışmada otonom bir trafik aracı modelinin değişen trafik durumlarına gösterdiği adaptasyon becerisi sınanmıştır. Modelin adaptasyonunun artırılabilmesi amacıyla model, süreç içerisinde değişken trafik ortamlarında eğitilmiştir. Zor, karmaşık ve rastlantısallığı yüksek ortamda eğitilen araç modelinin daha basit trafik senaryolarında en başından beri basit trafik senaryosunda eğitilen bir modele kıyasla daha başarılı olduğu gözlemlenmiştir. İkinci çalışmada ise gerçekçi bir fizik simülasyonunda bir otonom araç sürüş modelinin değişken yol tipi ve hava durumu karşısındaki öğrenme becerisinin değişimi gözlemlenmiştir. Araç modelinin değişik ortam parametrelerindeki standart eğitimleri kıyaslanarak model için kolaydan zora giden bir rota oluşturulmuştur. Bu rota bilgisine uyarak parçalı bir eğitim sürecinden geçen modellerin hedef (en zor) ortamda diğer modellere kıyasla belirgin derecede daha başarılı olduğu gözlemlenmiştir. Bu çalışmada, bahsi geçen zor problemleri modelin öğrenebilmesi için algoritma tabanlı bir strateji önerilmiştir. Strateji kapsamında, çizge kuramı ve ödül metrikleri referans olarak kullanılmıştır. Pekiştirmeli öğrenme için kullanılan ortamlar değişkenler ile düzenlenebilir hale getirilmiş, ilgili modelin daha kararlı bir eğitim süreci yaşaması adına değişkenlerin değer ve sıralarını belirleyen bir algoritma tasarlanmıştır. Tanımlı değişkenler kapsamında, ortamların sahip olabileceği bütün değişken kombinasyonları ayrı birer ortam olarak modellenmiştir. Sözkonusu ayrık ortamlar kendi aralarında değişkenlerdeki farklılık mukayese edilerek bir zorluk sıralamasına tabii tutulmuştur. Öğrenilmiş zorluk sıralamasına uymak (sadece kolaydan zora gitmek) şartıyla, her kombinasyondaki olası ortam değişimi için modelin elde ettiği ödüller kıyaslanır. Ödüllerdeki değişimler oluşturulmuş müfredat çizgesindeki bağlantıların ağırlıkları olarak belirlenir. Ödüldeki değişimin büyüklüğü, modelin en başta aldığı ödül ile yeni ortamda aldığı ödülün arasındaki farkın göstergesidir. Oluşturulmuş müfredat çizgesi üzerinde en-kısa-yol algoritması çalıştırılarak bir başlangıç ortamından hedef ortama olabilecek en az toplam ödül değişimini yaşayarak gidilebilecek bir rota aranır. Olabilecek en az miktarda ödül değişimi içeren rota müfredat eğitim rotası olarak belirlenir ve oluşturulmuş model bu rotadaki ortam kombinasyonlarını sırasıyla öğrenerek sürecin sonunda hedef ortamdaki görevini gerçekleştirmeyi öğrenmiş olur. Önerilmiş algoritmanın denenmesi amacıyla alanda bilinen sanal oyun ortamları değişkenler ile değiştirilebilir hale getirilerek kullanılmıştır. Kolay öğrenildiği bilinen ortamlar olabildiğince seçilmemeye çalışılmıştır. Algoritmanın öne sürdüğü hipotezlerden birisi de eğitim süresini kısaltmak olduğundan göreli zor problemlere odaklanılmıştır. Algoritma her ortam için 10 kez bağımsız ve sonuçların yeniden üretilebilir olması adına sabit rastlantısallık ile çalıştırılmıştır. Sonuçlar 10 denemenin de çıktılarının ortak olarak yorumlanabileceği şekilde raporlanmıştır. Deneyler boyunca kullanılan modelin tipi olarak PPO seçilmiş, derin öğrenme mimarisi ve algoritmaları kullanılmıştır. Süreç boyunca Python programlama dili kullanılmış ve çalıştırma zamanının olabildiğince kısa olması adına işe tahsis edilmiş, Ubuntu işletim sistemine sahip özel işlem sunucuları kullanılmıştır. Test çıktıları önerilen algoritmanın normal eğitim sürecinin verdiği sonuçları ortalama kx kat daha hızlı verdiğini, sürecin sonundaki sonuçlarda ise +k\% bir katkı sağladığını göstermektedir. Önerilen metot test ortamlarının çoğunda standart metoda göre avantajlı sonuçlar vermekteyken; bazı ortamlar için standart metot ile başabaş sonuçlar göstermiş, çıktılarda herhangi bir iyileşme alınamamıştır. Eğitim esnasında oluşturulan modellerin parametreleri, ortamların başlangıç koşulları gibi durumlar rastlantısallık bağlamında değerlendirilebileceği için, sabitlenebilecek bütün rastlantısallara bir tohum değişken atanmış, sabitlenemeyenlerden daha istikrarlı bir sonuç çıkarmak adına ise 10 deneme üzerinden ortalama bir sonuç çıkarılmıştır. Sonuçlar, algoritmanın müfredat çıktılarının standart eğitime göre bariz avantajlar barındırdığını belirtmektedir. Veri (ortam) odaklı yeniliklerin de model odaklı geliştirmeler ve buluşlar kadar önemli olduğu bu çalışmayla da anlaşılmaktadır. Yapılan deneyler esnasında önemli olduğu düşünülen bir değişkenin hiç etkisinin olmadığı, değişiminin pozitif bir etki yaratacağı düşünülen değişkenlerin negatif etki yarattığı durumlar da görüldü. Önerilen müfredat öğrenme stratejisi sayesinde, ortam hakkında uzmanlık seviyesinde bir bilgi birikimine sahip olmadan da yapay zeka modeli için öğrenmeye en elverişli müfredat oluşturulabilmektedir. Bu durum, verimli müfredat rotasını oluşturmanın yanısıra; ortam hakkında yeni veya etkisi hakkında şüphe duyulan bir değişkenin önemini keşfetmek için de bir fırsat niteliği taşımaktadır. Çalışma, Gaussian tipi rastlantısal örnekleme metotları ile daha adaptif değişken değerleri sunacak şekilde geliştirilebilir. Sonuçların hiper-parametre optimizasyonunun diğer makine öğrenmesi tekniklerine katkılarıyla benzer olması beklenecektir.

Özet (Çeviri)

In reinforcement learning, models try to find the environment dynamics and the policy that will take them to the target by exploring the environment. For this purpose, environment-centric or model-centric auxiliary methods can be used to teach the target task to the models in a faster way. Model-centric solutions mostly serve to change the architecture and update sensitivity of the model, while environment-centric solutions allow to adjust the environment dynamics and the difficulty of the target task. In reinforcement learning problems, in order for the model to analyze the environment thoroughly, it must first discover most critical and ground situations, and then learn (exploit) all possible values of all important situations. But the more a model is inclined to explore, the further away it is from exploitation, and the more inclined it is to exploitation, the further away from discovery. Both of the mentioned approaches offer various suggestions to solve the exploration-exploitation dilemma, which is one of the intensive research topics in the reinforcement learning field. It may be slow or impossible for the model to learn a relatively difficult environment or task in one try. The model may not be able to learn conditions that require long-term planning or immediate action due to the large number of variables and combination spaces they contain. The curriculum learning structure allows the model to go through phases with distinctive difficulty differences throughout the training process. In this context, methods have been proposed that enable the model to see the examples it collects from the environment with a certain frequency, present the pre-designed discrete environment designs to the model with increasing difficulty, teach a more difficult task and enable it to achieve more success in an easier task. With automated curriculum learning methods, the need for domain expertise in the curriculum learning structures and the effort spent on model optimization processes are tried to be minimized. Student-teacher neural networks working mutually with each other, predefined methods that can instantly reduce or increase the difficulty according to the learning outcomes of the model, and dynamically changing the conditions, rewards and goals of the environment are the strategies frequently used in automated curriculum learning algorithms. During the design process of the proposed algorithm, two preliminary studies were carried out and the effects, advantages, disadvantages and types of curriculum learning on the basic learning process were investigated and their effects were observed. In the first study, the adaptability of an autonomous traffic vehicle model to changing traffic situations was tested. In order to increase the adaptation of the model, the model was trained in variable traffic environments during the process. It has been observed that the vehicle model trained in a difficult, complex and highly random environment is more successful in simpler traffic scenarios than a model trained in a simple traffic scenario from the very beginning. In the second study, the change in the learning ability of an autonomous vehicle driving model against variable road type and weather conditions was observed in a realistic physics simulation. By comparing the standard trainings of the vehicle model in different environment parameters, a route from easy to difficult was created for the model. It has been observed that the models that have undergone a phased training process by complying with this route information are significantly more successful in the target (the most difficult) environment compared to other models. In this study, an algorithm-based strategy is proposed for the model to learn the aforementioned difficult problems. Within the scope of the strategy, graph theory and reward metrics were used as references. The environments used for reinforcement learning have been made editable with variables, and an algorithm has been designed that determines the values and orders of the variables in order for the model to have a more stable training process. Within the scope of defined variables, all combinations of variables that environments can have are modeled as separate environments. These discrete environments were subjected to a difficulty ranking by comparing the difference in variables among themselves. The rewards obtained by the model for the possible environment change in each combination are compared, provided that the order of learned difficulty (only going from easy to hard) is followed. Changes in rewards are determined as the weights of edges in the generated curriculum graph. The magnitude of the change in reward is indicative of the difference between the model's initial reward and the reward it received in the new environment. By running the shortest-path algorithm on the generated curriculum graph, a route is searched for by experiencing the least possible total reward change from a starting environment to the target environment. The route that includes the least amount of reward change is determined as the curriculum learning route, and the created model learns the environment combinations in this route. At the end of the process, the model learns to perform its task in the target environment. In order to test the proposed algorithm, virtual game environments known in the field were used by making them parametrized-changable with custom variables. It has been tried not to choose environments that are known to be easy to learn. Since one of the hypotheses put forward by the algorithm is to shorten the training time, relatively difficult problems are chosen. The algorithm was run $10$ times for each environment independently and with fixed randomization seed to ensure reproducibility of the results. The results are reported in such a way that the outputs of all $10$ trials can be interpreted as common. PPO was chosen as the type of model used throughout the experiments, and deep learning architectures and algorithms were used. Python programming language and dedicated process servers with Ubuntu operating system were used in order to keep the runtime as short as possible. The test outputs show that the proposed algorithm gives the results of the normal training process kx times and it contributes +k\% to the results at the end of the process. While the proposed method gives advantageous results over the standard method in most of the test environments; for some environments, it showed break-even results with the standard method - no improvement in output. Since the parameters of the models created during the training and the initial conditions of the environments can be evaluated in the context of randomness, a seed variable was assigned to all the randoms that could be seeded, and an average result was obtained over 10 trials in order to obtain a more stable result from those that could not be seeded. The results indicate that the curriculum outcomes of the algorithm have obvious advantages over standard training. It is understood from this study that data (environment)-centric innovations are as important as model-centric developments and inventions. During the experiments, it was observed that a variable thought to be important had no effect at all, and the variables thought to have a positive effect had a negative effect. Thanks to the proposed curriculum learning strategy, the most suitable curriculum can be created for the artificial intelligence model without having an expert knowledge about the environment. This situation is an opportunity to discover the importance of a variable that is new about the environment or whose impact is suspected as well as creating an efficient curriculum route. The study can be improved to generate more adaptive variable values with Gaussian-based randomized sampling methods. The results would be expected to be similar to the contributions of hyper-parameter optimization to other machine learning techniques.

Benzer Tezler

Tez No
419027
Türk tasarım ve inşaat sektöründe inovasyon tabanlı yaklaşımların değerlendirilmesi
Evaluation of innovation-based approaches in the Turkish design and construction sector
GÖZDE TEMİZ
Yüksek Lisans
Türkçe
2015
Mimarlık İstanbul Teknik Üniversitesi
Mimarlık Ana Bilim Dalı
YRD. DOÇ. DR. OZAN ÖNDER ÖZENER
Tez No
757612
A cross-sectional evaluation of syntactic complexity and lexical diversity as predictors of foreign language writing quality: A study with pre-service teachers of English
Sözdizimsel karmaşıklık ve sözcük çeşitliliğinin yabancı dilde yazma kalitesi göstergeleri olarak çapraz-kesişimsel incelenmesi: İngilizce öğretmen adayları ile yapılan bir çalışma
ZAFER SUSOY
Doktora
İngilizce
2022
Dilbilim Anadolu Üniversitesi
İngiliz Dili Eğitimi Ana Bilim Dalı
PROF. DR. GÜL DURMUŞOĞLU KÖSE
Tez No
797699
Türkçe öğrenen yabancıların akıcı okuma becerilerinin gelişiminde akıcılık geliştirme dersinin etkisi
The effect of fluency development lesson on the development of fluent reading skills of foreign Turkish language learners
FATİH ARSLAN
Doktora
Türkçe
2023
Eğitim ve Öğretim Gazi Üniversitesi
Türkçe Eğitimi Ana Bilim Dalı
PROF. DR. YILMAZ YEŞİL
Tez No
804721
Fen ders kitaplarındaki değişimin terim frekansı-ters doküman frekansı (TF-IDF) analizi ile incelenmesi
Investigation of changes in science textbooks using term frequency-inverse document frequency (TF-IDF) analysis
MUHAMMET KUZUCU
Yüksek Lisans
Türkçe
2023
Bilim ve Teknoloji Niğde Ömer Halisdemir Üniversitesi
Matematik ve Fen Bilimleri Eğitimi Ana Bilim Dalı
DOÇ. DR. AHMET YAVUZ
Tez No
896404
Curriculum learning for robot navigation in dynamic environments with uncertainties
Belirsiz dinamik ortamlarda robot seyrüseferi ı̇çin müfredatlı öğrenme
DEVRAN DOĞAN
Yüksek Lisans
İngilizce
2024
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Hacettepe Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
DR. ÖZGÜR ERKENT

Geri Dön