Gömülü sistemlerde sesli komut tanıma

Voice command recognation in embedded systems

PDF İndir

Tez No: 633411
Yazar: CAN ÇETİN
Danışmanlar: PROF. DR. MUSTAFA DOĞAN
Tez Türü: Yüksek Lisans
Konular: Mekatronik Mühendisliği, Mechatronics Engineering
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2020
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Fen Bilimleri Enstitüsü
Ana Bilim Dalı: Mekatronik Mühendisliği Ana Bilim Dalı
Bilim Dalı: Mekatronik Mühendisliği Bilim Dalı
Sayfa Sayısı: 130

Özet

Kişilerarası iletişim en yaygın olarak konuşma ile sağlanır. Ses, konuşmanın temel ve çok önemli bileşenidir. Akciğerlerden gelen havayı ses organlarıyla şekillendirerek kulak veya hassas aletler tarafından algılanabilen titreşimler dönüştürülmesi sesin en basit tanımı olarak nitelendirilebilir. Konuşma, boğaz ve ağızdaki bu titreşimlerin ve insan zihni tarafından belirli bir gramer altyapısında algılanabilecek karmaşık bir yapıdaki bir dönüşümdür. Yukarıdaki bilgiler ışığında, tezin amacı, insanların belirli komutları karakterize etmek için oluşturdukları ses verilerini analiz etmektir; bu analize uygun çıkacak özelliklere sahip gömülü işlemci mimarisini içeren bir sistem yetiştirmektir. Ses ve konuşma tanıma, günümüz dünyasında çok popüler hale gelen bir teknoloji olma yolunda ilerlemektedir. Özellikle mahkeme ve savcılık büroları gibi konuşmaların çok hızlı yazılması gereken platformlar veya kolay işlemlerin hızlı bir şekilde gerçekleştirilmesi gereken banka müşteri hizmetleri için telefon uygulamaları bu özelliğin en büyük müşterileridir. Tüm bunlara ek olarak, ses işleme teknolojisi hızla gömülü mimarilere de girmektedir. Örneğin; Televizyonun önünde otururken, uzaktan kumandayla bir program aramak yerine, aradığınız programı söyleyerek programı daha hızlı filtreleyebilir ve isteğinize çok hızlı bir şekilde ulaşabilirsiniz. Aynı şekilde, otomobil teknolojilerinde yüksek hızlarda sürerken, radyoyu sesinizle kontrol etmek güvenlik ve konfor açısından büyük bir gelişmedir. Tez kapsamında yapılan çalışmada, ses işleme teknolojisinin getirdiği bu konfor ve rahatlığı destekleyecek literatüre yeni çalışmaların eklenmesi hedeflenmektedir. Amacımız eğitilmiş sistem üzerinden ilgili komutları tespit etmek ve komuta karşılık gelen eylemi gerçekleştirmektir. Tezin en büyük başarısı, ses analizi ve bu analiz sonucunda oluşturulan veri seti ile ilgili komut kelimesinin ait olduğu dil ailesine bakılmaksızın istenen dilde komutların kullanılmasını sağlamasıdır. Bu çalışmanın en başarılı sonucunu vermek için ses analizi işlemleri; örneğin, sesin filtre seçimi, pencereleme fonksiyonları, özellik çıkarma fonksiyonları karşılaştırmalı olarak denenmiş ve tezde kullanılacak yönteme karar verilmiştir. Derin öğrenme, insan beyninin çalışmasını taklit eden yapay bir zeka işlevidir. Veri işleme ve karar verme için kullanılabilecek modeller oluşturmak üzere tasarlanmıştır. Çok büyük bir sinir ağı ve büyük miktarda erişilebilir veri gerektirir. Makine öğrenimi daha basit kavramlar kullanırken derin öğrenme, insanların nasıl düşündüklerini ve öğrendiklerini taklit etmek için tasarlanmış yapay sinir ağlarıyla çalışır. Sinir ağları, tıpkı insan beyninin nöronlardan oluştuğu gibi katmanlardan oluşur. Ayrı katmanlardaki düğümler bitişik katmanlara bağlanır. Ağın sahip olduğu katman sayısından daha derin olduğu söylenir. İnsan beynindeki tek bir nöron, diğer nöronlardan binlerce sinyal alır. Yapay bir sinir ağında, sinyaller düğümler arasında seyahat eder ve ilgili ağırlıkları atar. Daha ağır bir düğümün bir sonraki düğüm katmanı üzerinde daha fazla etkisi olacaktır. Son katman, bir çıktı üretmek için ağırlıklı girdileri derler. Derin öğrenme sistemleri, büyük miktarda veri işlendiğinden ve birkaç karmaşık matematiksel hesaplama içerdiğinden güçlü donanım gerektirir. Bununla birlikte, bu tür gelişmiş ekipmanlarla bile, derin öğrenme eğitimi hesaplamaları haftalar sürebilir. Derin öğrenme sistemleri, doğru sonuçlar elde etmek için büyük miktarda veri gerektirir; Buna göre, bilgi büyük veri setleri olarak beslenir. Verileri işlerken, yapay sinir ağları, verileri oldukça karmaşık matematiksel hesaplamaları içeren bir dizi ikili doğru veya yanlış sorudan cevaplarla sınıflandırabilir. Örneğin, bir yüz tanıma programı yüzlerin kenarlarını ve çizgilerini, daha sonra yüzlerin daha önemli kısımlarını ve son olarak yüzlerin genel temsillerini tanımayı ve tanımayı öğrenerek çalışır. Zamanla, program kendini eğitir ve doğru cevap olasılığı artar. Bu durumda, yüz tanıma programı yüzleri zaman içinde doğru bir şekilde tanımlayacaktır. Tezin gömülü mimari yönünden bu yana, bir mikrodenetleyici ve 32-bit 600 Mhz saat frekansına sahip çevresel elemanlar kullanılmaktadır. Google AI Lab tarafından sağlanan, masa ve mikrofona yakın veri kümesini gürültüsüz bir ortamda kullanarak; Komut olarak 5 İngilizce kelime seçildi ve MFCC (Mel Frequency Cepstral Coefficients) ve LSTM (Uzun süreli bellek ağları) olan model TensorFlow ve Keras kütüphaneleri kullanılarak eğitildi. Cortex mimarisi elde edildiğinde, Cortex M serisi ailesine ait M7 işlemci, düşük kesme gecikmesi, düşük maliyetli hata ayıklama özellikleri ve geriye dönük uyumluluk ile yüksek verimli, yüksek performanslı, gömülü bir işlemcidir. Colab ağındaki eğitimin sonunda başarı oranı“%95.07”idi. Benzer şekilde MFCC (Mel Frequency Cepstral Coefficients) ve CNN (Evrişimli Sinir Ağları) kullanılarak başarı oranı '%88.03' olarak elde edildi. Bu modeller, TFLite özelliğini destekleyen eIQ platformuna sahip i.MX RT 1060 gömülü işlemci mimarisinde kayan nokta kuralları dikkate alınarak C / C ++ dilleri ile yazılıma dönüştürülmüştür. Sonuç Ram. Flash optimizasyonu, Model uygulanabilirliği ve uyumluluğu dikkate alınarak yapılan karşılaştırmanın bir sonucu olarak, her iki yöntemin de“MFCC ve CNN”yönteminin kullanıldığı sistemlerde ses tanıma projelerinde daha yararlı olduğu düşünülmüştür. Gömülü sistemler ve PC çalışmaları karşılaştırıldığında, performans kayıpları meydana geldi. Nedenleri üç ana kategoriye ayrılabilir: ses yalıtım sorunları, kullanıcı hataları ve çevresel etkiler. Tanıma ortamındaki gürültü, hem konuşmanın spektral özelliklerini deforme ederek hem de uç noktaların yanlış algılanmasına neden olarak tanıma performansını azaltır. Tanınan odanın akustik özellikleri konuşmanın spektral özelliklerini de etkiler. Bu problemlerin ışığında, gelecekteki çalışmalar daha geniş bir veri kümesi, konuşma yönü algılama, aktif gürültü filtreleme, ses çocuk kilidi, cihazın bulunduğu ortamın akustik özelliklerini öğrenme, konuşma komutlarını bluetooth mesh ağlarına bağlı cihazlar üzerinden aktarmaya odaklanacaktır.

Özet (Çeviri)

Interpersonal communication is most commonly provided by speech. Sound is basic and crucial constituent of the speech. The simplest definition of sound which is the vibrations that can be perceived by the ear or sensitive instruments by shaping the air coming from the lungs with the organs of the sound. Speech is a transformation of these vibrations in the throat, mouth and a complex structure that can be perceived by the human mind in a certain grammatical infrastructure. In the light of the above information, the purpose of the thesis is by analyzing the audio data created by people to characterize certain commands; It is to train a system that includes embedded processor architecture with features that will come out in line with this analysis. Voice and speech recognition is on the way to becoming a technology that has become very popular in today's world. Phone applications especially for platforms such as andorid, ios, court and prosecution offices where conversations should be written very quickly, or bank customer service where easy operations should be handled quickly. In addition to all these, sound processing technology is rapidly entering embedded architectures. For example; While sitting in front of the TV, instead of looking for a program with the remote, you can filter the program faster by saying the program you are looking for and reach your request very quickly. Likewise, while driving at high speeds in automobile technologies, controlling the radio with your voice is a great improvement in terms of safety and comfort. In the study carried out within the scope of the thesis, it is aimed to add new studies to the literature that will support this comfort and convenience brought by the sound processing technology. Our purpose is to detect the relevant commands through the trained system and to perform the action corresponding to the command. The biggest achievement of the thesis is that it enables the use of commands in the desired language regardless of the language family to which the related command word belongs, by means of the sound analysis and the data set created as a result of this analysis. In order to give the most successful result of this study, sound analysis operations; For example, filter selection, windowing functions, feature extraction functions of the sound have all been tried comparatively and the method to be used in the thesis has been decided. In addition to sound analysis, the artificial learning method used in architecture was also searched and compared in the related literature and resources, and used in the thesis. Deep learning is an artificial intelligence function that mimics the work of the human brain. It is designed to create models that can be used for data processing and decision making. It requires a very large neural network and a large amount of accessible data. While machine learning uses simpler concepts, deep learning works with artificial neural networks designed to mimic how people think and learn. Neural networks consist of layers, just as the human brain is made up of neurons. Nodes in individual layers are connected to adjacent layers. The network is said to be deeper than the number of layers it has. A single neuron in the human brain receives thousands of signals from other neurons. In an artificial neural network, the signals travel between the nodes and assign the respective weights. A heavier weight node will have more impact on the next node layer. The last layer compiles weighted inputs to produce an output. Deep learning systems require powerful hardware because a large amount of data is processed and contains several complex mathematical calculations. However, even with such advanced equipment, deep learning training calculations can take weeks. Deep learning systems require a large amount of data to get accurate results; Accordingly, information is fed as large data sets. When processing data, artificial neural networks can classify data with answers from a series of binary true or false questions that involve highly complex mathematical calculations. For example, a face recognition program works by learning to identify and recognize the edges and lines of faces, then the more important parts of faces, and finally the general representations of faces. Over time, the program trains itself and the probability of correct answers increases. In this case, the face recognition program will correctly identify faces over time. Since the embedded architectural side of the thesis, a microcontroller and peripheral elements with a 32-bit 600 Mhz clock frequency are used. By using dataset, which is provided by Google AI Lab, close to the desk and microphone, in a noise-free environment; 5 English words were chosen as commands and the model with MFCC (Mel Frequency Cepstral Coefficients) and LSTM (Long-term memory networks) were trained using TensorFlow and Keras libraries. When obtain the Cortex architecture, the M7 processor, which belongs to the Cortex M series family, is a high-efficiency, high-performance, embedded processor with low cutting latency, low-cost debugging features and backward compatibility. At the end of the training on the Colab network, the success rate was“95.07%”. Similarly, using the MFCC (Mel Frequency Cepstral Coefficients) and CNN (Convolutional Neural Networks), the success rate was achieved as '88.03%'. These models have been converted into software with C / C ++ languages by considering floating point rules on the i.MX RT 1060 embedded processor architecture with eIQ platform supporting TFLite feature. Keras ensures that each training phase is recorded during a deep learning model training. We can achieve the default training steps recorded while training all the deep learning models. Records training metrics for each training phase. This includes the loss and accuracy of the verification data set (for classification problems), if set. The history object is returned from the calls to the fit () function, which is used to train the model. Metrics are stored in a dictionary on the past member of the returned object. While training the model as keras, the accuracy and loss in the keras model for verification data may vary in different cases. Generally, as each age increases, losses should decrease and accuracy should increase. When training a machine learning model, one of the main things to avoid is to adapt to extreme. This is when the model fits well to train data, but cannot generalize and make accurate predictions for data that it has not seen before. To find out if their models fit more than necessary, data scientists use a technique called cross-validation, where they divide their data into two parts - the training set and the validation set. The training set is used to train the model, while the validation set is used only to evaluate the performance of the model. The metrics in the training set allow you to see how your model is progressing in terms of education, but metrics in the validation set that allow you to measure the quality of your model - make new predictions based on data. At this point, as can be seen, the terms loss and accuracy are the measures of loss and accuracy in the training set, while test_ loss (val_loss) and test_accuracy (val_acc) are the loss and accuracy measurements in the verification set. As a result of the comparison made by considering both Ram and Flash optimization, Model applicability and compatibility, both methods have been considered to be more useful in voice recognition projects within the systems embedded in the use of“MFCC and CNN”method. When the embedded systems and PC studies were compared, performance losses occurred. The causes can be grouped into three main categories: sound insulation problems, user errors, and environmental impacts. Noise in the recognition environment reduces recognition performance, both by deforming the spectral properties of speech and by causing false detection of endpoints. The acoustic properties of the recognized room also affect the spectral properties of speech. In the light of these problems, future studies will focus on a broader dataset, speech direction detection, active noise filtering, voice childlock, learning the acoustic properties of the environment where the device is located, transferring speech commands over devices connected to bluetooth mesh networks.

Benzer Tezler

Tez No
921173
A novel real-time full parameters estimator and multi-purpose MPC for pmasynrms: A sparsity and parallelization based approach
Pmasynrmler için yenilikçi gerçek zamanlı tüm parametre kestirimcisi ve çok amaçlı bir MPC: Seyrek ve koşut zamanlı bir yaklaşım
ALPER TAP
Doktora
İngilizce
2025
Elektrik ve Elektronik Mühendisliği İstanbul Teknik Üniversitesi
Elektrik Mühendisliği Ana Bilim Dalı
PROF. DR. LALE ERGENE
Tez No
244149
Reduced ınstruction set processor design
Indirgenmiş komut setli işlemci tasarimi
ALİ ŞENTÜRK
Yüksek Lisans
İngilizce
2009
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Çukurova Üniversitesi
Bilgisayar Mühendisliği Bölümü
YRD. DOÇ. DR. MUSTAFA GÖK
Tez No
517131
Çağdaş kabile konutlarının doğal yerel veriler bağlamında analizi
Analysis of contemporary tribal huts according to local natural properties
SELİN KÜÇÜK
Yüksek Lisans
Türkçe
2018
Antropoloji İstanbul Teknik Üniversitesi
Mimarlık Ana Bilim Dalı
PROF. DR. NİHAL ARIOĞLU
Tez No
363546
Modelling, control and implementation of an unmanned vertical take-off and landing aircraft
Dikey iniş kalkış yapabilen bir insansız hava aracının modellenmesi, kontrolü ve gerçeklenmesi
FARABİ AHMED TARHAN
Yüksek Lisans
İngilizce
2014
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Kontrol ve Otomasyon Mühendisliği Ana Bilim Dalı
PROF. DR. HAKAN TEMELTAŞ
Tez No
553703
Voice recognition system with score level fusion methods and embedded system design
Skor seviyesi füzyon metotları ile ses tanıma sistemi ve gömülü sistem tasarımı
CİHAN AKIN
Yüksek Lisans
İngilizce
2019
Elektrik ve Elektronik Mühendisliği İstanbul Teknik Üniversitesi
Elektrik-Elektronik Mühendisliği Ana Bilim Dalı
DOÇ. MÜRVET KIRCI

Geri Dön