Yazım kurallarına uygun yazılmamış türkçe metinleri makine çevirisi yöntemleriyle normalleştirme

Normalizing non-canonical turkish texts using machine translation approaches

PDF İndir

Tez No: 601204
Yazar: TALHA ÇOLAKOĞLU
Danışmanlar: DOÇ. DR. AHMET CÜNEYD TANTUĞ
Tez Türü: Yüksek Lisans
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2019
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Fen Bilimleri Enstitüsü
Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
Bilim Dalı: Bilgisayar Mühendisliği Bilim Dalı
Sayfa Sayısı: 75

Özet

Sosyal medya kullanımının hayatımızda yaygınlaşmasıyla beraber, çevrimiçi üretilen içerik daha önce görülmemiş boyutlara ulaşmıştır. Çoğunlukla dil bilgisi kurallarına uyulmadan yazıldığı için, bu metinleri geleneksel doğal dil işleme araçlarıyla işlemek zordur. Metin normalleştirmesi, kurallara uyulmadan yazılmış metinlerin, dil bilgisi kurallarına uygun yazılmış hale çevrilmesi işlemine denir. Genellikle bu işlem diğer doğal dil işleme araçlarının ön aşamasıdır ve o araçların başarımını artırmaktadır. Metin normalleştirme, geleneksel normalleştirme yaklaşımlarının dışında, kuralsız-kurallı metin arasında makine çevirisi problemi olarak da ele alınabilir. Fakat bu tarz sistemleri eğitebilmek için büyük miktarlarda paralel veri gerekir. Diğer makine çevirisi problemlerinin aksine, bu konuda daha önce işaretlenmiş herhangi bir veri bulunmadığı için, bu tez kapsamında öncelikle yapay paralel veri üretilmesine odaklanılmıştır. Paralel veri üretmek için iki yönteme başvurulmuştur: raslantısal olarak belli kurallar kapsamında düzgün kelimeyi bozma ve düzgün-bozuk kelime çiftlerini eşleştirme. Bu yöntemler için, hatalı ve doğru yazılmış kelimelerin çoğunlukta olduğu iki farklı derleme ihtiyaç vardır. Hatalı yazılmış kelimeleri barındıran derlemi oluşturmak için Twitter'dan Kasım 2018 - Ocak 2019 tarihleri arasında yirmi beş milyon tekil tweet çekilmiştir. Düzgün kelime kaynağı olaraksa altyazı derlemi seçilmiştir. Bu iki derlemdeki kelimeler ön işlemden geçirildikten sonra, yukarıda bahsettiğimiz yöntemler kullanılarak paralel veri üretilmiştir. Düzgün-Bozuk kelime çiftlerini eşleştirme yöntemi için ağırlıklandırılmış Levenshtein uzaklığı esas alınmıştır. Bu yöntemin yakalayamadığı veya yakalamasının zor olduğu hata tipleri için ise raslantısal olarak belli kurallar kapsamında düzgün kelimeler bozulmuştur. Üretilen yapay paralel veriyle beraber makine çevirisi sistemleri eğitilmiştir. Buna ek olarak, ön ve son işleme bileşenleriyle beraber metin normalleştirme sistemi oluşturulmuştur. Oluşturulan sistemin önceki sistemleri yüksek bir farkla geçtiği görülmüştür.

Özet (Çeviri)

With the increase of online user generated content (UGC), text normalization has gained a huge importance on natural language processing. People now express their ideas and thoughts using social media platforms such as Twitter, Facebook and Youtube. Additionally, it is common to have a comment section in news page and review section e-commerce pages. Ignoring grammatical rules, omitting letters and other spelling mistakes are often the case while posting content to these places. The content of these data are quite valuable to many firms which like to keep track of their customer's satisfaction. %burası düzeltilecek güzel olmadı. Nevertheless, colloquial writing makes difficult to process immense amount of data generated by humans. Many NLP tools are designed to work well only with formally written text rather than informal. Due to UGC's idiomatic nature, traditional NLP tools require a preprocessing step that we call text normalization. Text normalization for Turkish has been categorized into 6 groups: Letter Case Correction, Diacritic Restoration, Vowel Restoration, Accent Normalization, Spelling Correction, Other Errors (repeating characters, abbreviation). Since the labeled data is limited in this field, previous systems tend to use rule based system and statistical systems (trained on autmatically created data). However, these systems can not generalize as well as machine learning algorithm does. It is shown that machine translation systems are capable of correcting not only local errors like letter omission but also grammatical errors like subject-verb disagreement. Labeled data for text normalization is a scarce resource. In addition to that, labeling data, especially training data, takes a lot time and human resource to finish. Given the lack of manually data for the text normalization problem and the data hunger of machine translation systems, the necessity of artificially producing data has emerged. Producing data only using rule-based methods (such as dropping vowels, replacing letters with similar ones) will inevitably lead to bias in the data. At the same time, this method may sometimes produce unrealistic words. Not only that, we will not be able to know every error in the language, and even if we do, it will be very difficult to realize, so this method will be insufficent to produce artificial data. Therefore, in order to obtain a more realistic corpus, the artificial data was produced by using weighted Levenshtein algorithm. Since this method matches the noisy words obtained from a real collection with the clean words, it is aimed to produce more realistic data than the previous method. Clean words are collected from an existing collection and filtered through the morphological analyzer. The Twitter website is chosen as a ill-formed word source and millions of tweets are scraped from this site. These tweets were then parsed with a tokenizer and the non-word components were eliminated. However, due to computing limitations, even though catching local errors are easy, it is more difficult to spot errors that spread over multi-words (e.g. question phrases like gruyormusun? vs görüyor musun? with this approach. To tackle this problem, we use additional rule-based approach that roughly simulates these types of errors. Since Turkish is a morphologically rich, additive language, it has a very rich vocabulary. Therefore, it is hard for to catch the all spelling errors found in the Turkish by pure Levenshtein algorithm. Based on this, new features have been added to the Levenshtein algorithm to capture of spelling errors broadly. One of the spelling errors occurred online environments is typographical errors and one of these errors is the accidental typing of a letter with adjacent letter on the keyboard. For example, it is possible to write ``geldfi'' when trying to write ``geldi'' since letter d and letter f are adjacent letter on Q keyboards. It is impossible to catch these error using existing Levenshtein weights. For that reason, ``Delete Adjacent Weight'' feature is added to the algorithm to detect these errors. Another spelling error of occurs a lot is the intentional spelling mistakes by rewriting a certain letter to stress a word or writing the spoken language as it is. In addition to that, Since the Levenshtein algorithm make comparison only between two strings, it cannot capture the errors that occur in only one string. To overcome this, ``Repetitive Letter Weight'' feature, which includes the letters in the same string is added the Levenshtein algorithm. In addition to all,``Stopping by Threshold'' feature is added to minimize running time of the algorithm. To produce clean-noisy word pairs of high quality, one needs to set parameters of the Levenshtein algorithm precisely. To achieve that the parameters are first set intuitively, then the parameters are fine-tuned on Val$_{small}$ with repetitive runs. After the parameters are determined and the algorithm is run, as a result a noisy word lists for every clean word is obtained. These lists doesn't always contain correct noisy words which needs to be dismissed. Eliminating all mismatches is not possible because it requires manual control. But leaving the lists intact will negatively affect the performance of the translation system. Therefore, it is necessary to apply a selection method that will negatively affect the performance of the translation system to a minimum and eliminate as many false matches automatically as possible without requiring manual control. As a selection method, a popular genetic algorithm ``Roulette Wheel Selection'' is chosen. Using this method, first a Levenshtein distance is selected from 0.0 to 1.0 range. After the distance selected, one word in that distance is randomly selected as a noisy word for that spesific clean word. Apart from these, there have been situations such as catching the wrong spelling of multi-word phrases and catching the wrong spelling of special names. Such errors are simulated by means of coincidence according to random rules. The proposed translation pipeline consists of 3 main components, a pre- and post-processing component as well as a translation component. The preprocessing component provides a basic normalization at the letter level, but also converts all letters to lower case. This reduces the complexity of the translation process as it reduces vocabulary. The post-processing component truecases output of the translation component consisting of only lower case letters. Lastly, the translation component performs the normalization. In this thesis, It is proposed to solve the Turkish text normalization problem by using machine translation approaches. In contrast to the previous state of the art, the proposed system can solve the problem with fewer components and include the context. Furthermore, the system is less affected by human errors as it does not contain rule-based components. In addition, the system we present does not require external natural language processing tools. According to the results, it has been observed that the character-based statistical machine translation method proposed in this thesis exceeds the best available study with high margins on different set of test sets. These results show that our unattended parallel data generation methods and machine translation method show high performance in solving Turkish text normalization problem.

Benzer Tezler

Tez No
389367
Kısa metinlerde varlık ismi tanıma
Named entity recognition on Turkish short texts
BEYZA EKEN
Yüksek Lisans
Türkçe
2015
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
YRD. DOÇ. DR. AHMET CÜNEYD TANTUĞ
Tez No
252318
Doğal dil işleme ile Türkçe yazım hatalarının denetlenmesi
Turkish spell check with natural language processing
AYNUR DELİBAŞ
Yüksek Lisans
Türkçe
2008
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
PROF. DR. EŞREF ADALI
Tez No
729919
Diacritic restoration of Turkish sentences
Türkçe cümlelerde fonetik işaretlerin düzeltilmesi
HÜSEYİN EKİCİ
Yüksek Lisans
İngilizce
2021
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Galatasaray Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
DR. ÖĞR. ÜYESİ İSMAİL BURAK PARLAK
PROF. TANKUT ACARMAN
DR. ÖĞR. ÜYESİ CEMAL OKAN ŞAKAR
Tez No
765982
Sürûrî Efendi'nin Şerh-i Mesnevî'si (VI. cilt) (metin-dizin)
Sharh-e Mathnawi of Sururi Efendi (volume VI) (text-index)
SEYİT ALİM ÖZDEMİR
Yüksek Lisans
Türkçe
2022
Doğu Dilleri ve Edebiyatı Kırıkkale Üniversitesi
Doğu Dilleri ve Edebiyatları Ana Bilim Dalı
PROF. DR. ADNAN KARAİSMAİLOĞLU
Tez No
419177
Uluslararası müzik yazısının incelenmesi
The examination of international music strory
KAMER GÜNGÖR
Yüksek Lisans
Türkçe
2012
Müzik Kırıkkale Üniversitesi
Müzik Ana Bilim Dalı
PROF. DR. SALİH AKKAŞ

Geri Dön