Karma söz üretme yöntemi ile Türkçe yazılı metinden söze geçme

Text-to-speech in Turkish language by using a mixed speech synthesis method

Tez No: 39615
Yazar: MURAT SERVET ERER
Danışmanlar: PROF.DR. AHMET DERVİŞOĞLU
Tez Türü: Yüksek Lisans
Konular: Elektrik ve Elektronik Mühendisliği, Electrical and Electronics Engineering
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 1994
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Fen Bilimleri Enstitüsü
Ana Bilim Dalı: Belirtilmemiş.
Bilim Dalı: Belirtilmemiş.
Sayfa Sayısı: 113

Özet

ÖZET Bu çalışmada Türkçe yazılı herhangi bir metnin söze çevrilmesi (yapay konuşma) konusu incelenmiş ve Sound Blaster Pro 2.0 ses kartına sahip bir kişisel bilgisayar yardımıyla, Turbo Pascal 6.0 programlama dilinde yazılan programlarla herhangi bir Türkçe yazılı metni yapay konuşmaya çevirme işlemi gerçekleştirilmiştir. Yazılı Türkçe metnin yapay olarak söze çevrilmesi amacıyla, genel söz sentezleme yöntemlerinden Doğrudan Söz Sentezleme Yöntemi (daha önce sayısallaştırılarak saklanmış konuşma işaretinin yeniden analog formata dönüştürülmesi yöntemi) ve Kurala Göre Söz Üretme Yönteminin (yapay söz üretiminin dilbilgisi kurallarına göre gerçekleştirilmesi yöntemi) birlikte kullanılması esasına dayanan yeni bir Karma Söz Üretme Yöntemi geliştiririn iştir. Bu yöntemde Türk dilinin hecesel bir yapıya sahip olması özelliği esas alınmıştır. Geliştirme sırasında Türk dilinde yer alan hece çeşitleri incelenmiş, metinden söze geçme uygulamasında bütün hecelerin kullanılması durumunda ortaya çıkacak gerekli bellek miktarını azaltmak amacıyla, ikiden fazla harften olumuş hecelerin bir ve iki harften oluşan hecelerle yardımcı sessiz harfler cinsinden belirtilmelerini sağlayan özel bir hece ayrışım yöntemi verilmiştir. Uygulamada yazılı metin öncelikle bir ve iki harfli heceler ile yardımcı sessiz harflerin bir birleşimi şeklinde ele alınmakta ve dilbilgisi kurallarına uygun olarak bu elemanlara ayrıştın İm aktadır. Bu aşama Kurala Göre Söz Üretme yöntemnin kapsamındadır. Daha sonra ayrıştırılmış metinde yer alan bir ve iki harfli hecelerle yardımcı sessiz harflerin metindeki sıralarına göre arka arkaya Doğrudan Söz Üretme yöntemine göre seslendirilmeleri sonucunda metnin söze çevrilmesi işlemi gerçekleştirilmektedir. Gerçekleme işlemi için, bir defaya mahsus olmak üzere, bir ve iki harfli hecelerle yardımcı sessiz harflerin daha sonradan kullanılmak üzere doğal seslendirmelerinin sayısallaştırılarak ayrı dosyalarla saklanması işlemlerinin oluşturduğu bir ön hazırlık aşamasına gerek duyulmaktadır. Bu aşamada elde edilen ses dosyalarına özel olarak geliştirlmiş bir genlik dengeleme algoritması uygulanmış ve böylece ses dosyaları arasındaki maksimum ve minimum genlik seviye farklılıkları azaltılarak anlaşılırlık arttırılmıştır. VI

Özet (Çeviri)

SUMMARY TEXT-TO-SPEECH IN TURKISH LANGUAGE BY USING A MIXED SPEECH SYNTHESIS METHOD Communication between man and computerized systems is becoming more effective with the developments in the speech synthesis and artificial speech via text-to-speech methods either in the field of hardware or software. In this study, a method for the“Text-to-Speech in Turkish Language”has been designed and realized by using the Direct Speech Synthesis Method and the Speech Synthesis Method Based on the Linguistic Rules in accordance with each other. The direct speech synthesis- method is the easiest way of artificial speech production that is based on conversion of the analog speech data into digital form by sampling and storing. The speech that is stored in binary format, is then converted to the analog speech signal whenever it is needed. The speech synthesis method based on the linguistic rules is described as one of the most difficult artificial speech production methods which is strictly based on the linguistic rules of the language which is used. Every language has its own linguistic rules. Therefore, while dealing with the text-to-speech conversion problem, the necessary design of the solution has to be specially made for the language that is used. This means that, a method which is used for text-to-speech conversion in a language can not be adapted completely for an other language. As an example, if a text-to- speech algorithm, which is developed for English is applied to Turkish, errors like misspelling and unacceptance of the Turkish alphabet characters, occur. So in the first step, the rules of the language have to be studied carefully. The formation of the Turkish language is based on the syllables. The words in the text are formed with the different combination of the syllables and these syllables can be separated from each other without any loss in their pronunciation. According to the Turkish Language linguistic rules, a syllable can contain maximum five letters. Using this property, the syllables can be classified into five groups according to the number of letters that they have. These are syllables with one letter, syllables with two letters, syllables with three letters, syllables with four letters and syllables with five letters. The syllables with one letter are the vowels and the total number of the one lettered syllables in Turkish language is ight. Syllables with two letters are VIIthe vowel-consonant and consonant-vowel combinations. Since there are eight vowels and 21 consonants in the Turkish alphabet, the total number of the two lettered syllables can be calculated as 336. The syllables with three letters are the consonant-vowel-consonant, consonant-consonant-vowel and vowel-consonant-consonant combinations. Although the total number of the three lettered syllables can be calculated as 10584, the total number of the ones which have meaning is a couple of thousands. The four lettered syllables are the consonant-vowel-consonant-consonant and consonant- consonant-vowel-consonant combinations. There are a few of the four lettered syllables that have meaning. The syllables that have five letters in their structure are not originally Turkish syllables. They are the combinations of consonant-consonant-vowel-consonant-consonant and are the words alone with meaning. The basics of the text-to-speech conversion method which will be described in this study are the separation of the text to its syllables by using the speech synthesis method based on the linguistic rules and then conversion of the digitized speech data of the syllables alone to analog sound format one after the other, using the direct speech synthesis method, so that the combination makes the complete artificial speech of the text. The main linguistic difficulty of this method can be described as the irregularities in the separation process of the text to its syllables. Yet, the main technical difficulty of this method is the demand for the memory capacity. If all of the syllable types of the Turkish language are sampled and stored in the computer (with the assumption of the speech frequency band is 5 kHz and the sampling frequency is 10 kHz according to the Nyguist theorem and each sample is represented by 1 byte without any coding) the demand for the memory to store the total of 4000 syllables will be approx. 14 Mbytes. The time to be spend to sample the voice data of all the syllables will also be long. In order to reduce the memory demand, a new separation method is developed. In this separation method, the text is divided into only one and two lettered syllables and some consonants (pseudo-syllables). The three, four and five lettered syllables are divided into one and two lettered syllables with special rules. A three lettered syllable with consonant-vowel-consonant combination can be assumed as the combination of the two lettered syllables with the same vowel at the intersection point. So, this three lettered syllable can be separated into two two lettered syllable. The three lettered syllable with consonant-consonant-vowel and vowel-consonant-consonant combination can be assumed as the combination of a two lettered syllable which is the vowel and the nearest consonant to the vowel combination and a consonant pseudo- syllable. In the same way the four lettered syllable can be separated first into a one consonant pseudo-syllable and a three lettered syllable, then with the separation of the three lettered syllable into two two lettered syllables, the separation is completed as: one consonant pseudo-syllable + two lettered syllable + two lettered syllable. Five lettered syllable is separated first into a four lettered syllable and a one consonant pseudo-syllable. Then as the four lettered syllable is separated to a one consonant pseudo-syllable and a three VIIIabove. During the separation, spaces between the words and the punctuation marks are changed with special characters that will be used as a stop (silence) period marking by the computer program which will be explained in the second step. The separated artificial elements and the special characters are written to an other file with a space between each other. This writing format is defined as the AYR format and the file is called as AYR type file. In the second step of the second procedure, the AYR type file is separated into the part that are long enough to be used as a command by the Sound Blaster voice file joining software (JOINTVOC.EXE). The voice files related to the artificial speech elements are then joined according to their turn in the text and the text-to-speech conversion of the text is completed. For the realization of the text-to-speech algorithm in Turkish, a personal computer equipped with a Sound Blaster Pro 2.0 multimedia sound card is used as hardware. Developed and produced by Creative Labs. Inc., Sound Blaster Pro is a stereo 8 bit sound card which has two separate DAC (Digital to Analog Converter) units for stereo analog outputs and a mono/stereo selectable ADC (Analog to Digital Converter) unit for sampling the analog input. A microphone or a sound source line-in can be connected to the card's input and headphones or loudspeakers or a tape deck can be connected to the card's output connection terminals. The minimum sampling frequency is 4000 Hz and the maximum sampling frequency is 44.1 kHz. The sampling frequency can be changed with the necessary software designed for this card. The analog sound input is sampled to 8 bit (1 byte) which means that the amplitude of the analog input signal can be sampled through a total of 256 different volume levels. The minimum peak level of the analog input signal is sampled as 0 value and the maximum peak level of the input signal is sampled as 255. the 0 V level of the analog input signal is sampled as 128. The Sound Blaster Pro sound card has a built-in LPF (Low Pass Filter) and a HPF (High Pass Filter) in the input stage. The cut-off frequency of the LPF is 3.2 kHz and the cut-off frequency for the HPF is 8.8 kHz. The selection of the filters is made with a software designed for SB Pro. The text to speech conversions of many Turkish text files have been performed and understandable speeches have been obtained by this method. An echo effect has been heard especially because of the repetition of the vowel which is the intersection vowel that is explained in the separation of three lettered syllable into two two lettered syllables. Text to speech conversion can be used in many areas of daily life to make it more comfortable. Mostly based on the direct speech synthesis method, many speech synthesis systems are used in talking cars, elevators, telephone services, computer games etc. With the development of the new systems especially with the use of the new neural network technology, it will be very easy for the blind people to have information from the written documents. XIthe vowel-consonant and consonant-vowel combinations. Since there are eight vowels and 21 consonants in the Turkish alphabet, the total number of the two lettered syllables can be calculated as 336. The syllables with three letters are the consonant-vowel-consonant, consonant-consonant-vowel and vowel-consonant-consonant combinations. Although the total number of the three lettered syllables can be calculated as 10584, the total number of the ones which have meaning is a couple of thousands. The four lettered syllables are the consonant-vowel-consonant-consonant and consonant- consonant-vowel-consonant combinations. There are a few of the four lettered syllables that have meaning. The syllables that have five letters in their structure are not originally Turkish syllables. They are the combinations of consonant-consonant-vowel-consonant-consonant and are the words alone with meaning. The basics of the text-to-speech conversion method which will be described in this study are the separation of the text to its syllables by using the speech synthesis method based on the linguistic rules and then conversion of the digitized speech data of the syllables alone to analog sound format one after the other, using the direct speech synthesis method, so that the combination makes the complete artificial speech of the text. The main linguistic difficulty of this method can be described as the irregularities in the separation process of the text to its syllables. Yet, the main technical difficulty of this method is the demand for the memory capacity. If all of the syllable types of the Turkish language are sampled and stored in the computer (with the assumption of the speech frequency band is 5 kHz and the sampling frequency is 10 kHz according to the Nyguist theorem and each sample is represented by 1 byte without any coding) the demand for the memory to store the total of 4000 syllables will be approx. 14 Mbytes. The time to be spend to sample the voice data of all the syllables will also be long. In order to reduce the memory demand, a new separation method is developed. In this separation method, the text is divided into only one and two lettered syllables and some consonants (pseudo-syllables). The three, four and five lettered syllables are divided into one and two lettered syllables with special rules. A three lettered syllable with consonant-vowel-consonant combination can be assumed as the combination of the two lettered syllables with the same vowel at the intersection point. So, this three lettered syllable can be separated into two two lettered syllable. The three lettered syllable with consonant-consonant-vowel and vowel-consonant-consonant combination can be assumed as the combination of a two lettered syllable which is the vowel and the nearest consonant to the vowel combination and a consonant pseudo- syllable. In the same way the four lettered syllable can be separated first into a one consonant pseudo-syllable and a three lettered syllable, then with the separation of the three lettered syllable into two two lettered syllables, the separation is completed as: one consonant pseudo-syllable + two lettered syllable + two lettered syllable. Five lettered syllable is separated first into a four lettered syllable and a one consonant pseudo-syllable. Then as the four lettered syllable is separated to a one consonant pseudo-syllable and a three VIIIabove. During the separation, spaces between the words and the punctuation marks are changed with special characters that will be used as a stop (silence) period marking by the computer program which will be explained in the second step. The separated artificial elements and the special characters are written to an other file with a space between each other. This writing format is defined as the AYR format and the file is called as AYR type file. In the second step of the second procedure, the AYR type file is separated into the part that are long enough to be used as a command by the Sound Blaster voice file joining software (JOINTVOC.EXE). The voice files related to the artificial speech elements are then joined according to their turn in the text and the text-to-speech conversion of the text is completed. For the realization of the text-to-speech algorithm in Turkish, a personal computer equipped with a Sound Blaster Pro 2.0 multimedia sound card is used as hardware. Developed and produced by Creative Labs. Inc., Sound Blaster Pro is a stereo 8 bit sound card which has two separate DAC (Digital to Analog Converter) units for stereo analog outputs and a mono/stereo selectable ADC (Analog to Digital Converter) unit for sampling the analog input. A microphone or a sound source line-in can be connected to the card's input and headphones or loudspeakers or a tape deck can be connected to the card's output connection terminals. The minimum sampling frequency is 4000 Hz and the maximum sampling frequency is 44.1 kHz. The sampling frequency can be changed with the necessary software designed for this card. The analog sound input is sampled to 8 bit (1 byte) which means that the amplitude of the analog input signal can be sampled through a total of 256 different volume levels. The minimum peak level of the analog input signal is sampled as 0 value and the maximum peak level of the input signal is sampled as 255. the 0 V level of the analog input signal is sampled as 128. The Sound Blaster Pro sound card has a built-in LPF (Low Pass Filter) and a HPF (High Pass Filter) in the input stage. The cut-off frequency of the LPF is 3.2 kHz and the cut-off frequency for the HPF is 8.8 kHz. The selection of the filters is made with a software designed for SB Pro. The text to speech conversions of many Turkish text files have been performed and understandable speeches have been obtained by this method. An echo effect has been heard especially because of the repetition of the vowel which is the intersection vowel that is explained in the separation of three lettered syllable into two two lettered syllables. Text to speech conversion can be used in many areas of daily life to make it more comfortable. Mostly based on the direct speech synthesis method, many speech synthesis systems are used in talking cars, elevators, telephone services, computer games etc. With the development of the new systems especially with the use of the new neural network technology, it will be very easy for the blind people to have information from the written documents. XIThe name of the computer programs in this study are as follows. HISTOGRA.PAS: designed to investigate the histogram and some other characteristics of the voice file. EQ363.PAS: designed to calculate the volume level limitation values according to the given tolerance. DENGELE.PAS: designed to equalize the volume peak-to-peak differences of the 363 artificial speech elements voice files. TXTAYR.PAS: designed to convert the TXT format file into the.AYR format file. AYROKU.PAS: designed to convert artificial speech elements into artificial speech one after another. The programs which are written in Borland Turbo Pascal 6.0 programming language are given in Appendix 1, Appendix 2, Appendix 3, Appendix 4, Appendix 5, Appendix 6 and Appendix 7 with complete lists and explanations. XII

Benzer Tezler

Tez No
758027
Anadolu sahası Türk kültüründe törensel yemekler
Ceremonial meals in Anatolian field Turkish culture
SONGÜL ERMEÇLİ
Yüksek Lisans
Türkçe
2022
Halk Bilimi (Folklor)Nevşehir Hacı Bektaş Veli Üniversitesi
Türk Halk Bilimi Ana Bilim Dalı
PROF. DR. MEHMET ÇERİBAŞ
Tez No
3212
İletişim açısından ticari ve kültürel olarak, Türk banka ve tiyatro afişleri
Başlık çevirisi yok
CANAN SUNER
Yüksek Lisans
Türkçe
1986
İletişim Bilimleri Gazi Üniversitesi
Sanat Tarihi Ana Bilim Dalı
PROF. DR. RÜÇHAN ARIK
Tez No
463436
Türkçe dersinde oyunlaştırmanın ilkokul öğrencilerinin söz varlığına ve motivasyonlarına etkisi
The effect of gamification in a Turkish course on primary school students' vocabulary development and motivation
BERRİN GENÇ ERSOY
Doktora
Türkçe
2017
Eğitim ve Öğretim Anadolu Üniversitesi
İlköğretim Ana Bilim Dalı
DOÇ. DR. Ş. DİLEK BELET BOYACI
Tez No
559845
20. yüzyıl sanatında yabancılaşma ve oyun kavramının sanatsal üretime etkisi
The impacts of the terms of alienation and game on artistic production in the art of the 20th century
MUSTAFA KEMAL ABACI
Yüksek Lisans
Türkçe
2019
Güzel Sanatlar Sakarya Üniversitesi
Resim Ana Sanat Dalı
DOÇ. ŞİVE NEŞE BAYDAR
Tez No
9043
İşletmelerde üretimin planlanması ve kontrolü
Başlık çevirisi yok
OSMAN DEMİR
Doktora
Türkçe
1990
İşletme İstanbul Üniversitesi
İktisat Ana Bilim Dalı
PROF. DR. İ. DOĞAN KARGÜL

Geri Dön