Türkçe cümlelerde isim tamlamalarının bulunması

Noun phrase chunking of Turkish sentences

PDF İndir

Tez No: 353806
Yazar: KÜBRA ADALI
Danışmanlar: YRD. DOÇ. DR. AHMET CÜNEYD TANTUĞ
Tez Türü: Yüksek Lisans
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Doğal Dil İşleme, İsim Tamlamaları, Cümle Ayrıştırılması, Makine Öğrenmesi, Koşullu Rastgele Alanlar, Paralel Derlem, Natural Language Processing, Noun Phrases, Shallow Parsing, Machine Learning, Conditional Random Fields, Paralel Corpus
Yıl: 2014
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Fen Bilimleri Enstitüsü
Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
Bilim Dalı: Belirtilmemiş.
Sayfa Sayısı: 72

Özet

TÜRKÇE CÜMLELERDE İSİM TAMLAMALARININ BULUNMASI ÖZET Bu tezde Türkçe cümlelerde bulunan isim tamlamalarının sınırlarının tesbit edilmesi amaçlanmıştır. Türkçe cümlelerin içerisindeki isim tamlamarının bulunması varlık ismi tanıma, Türkçe cümlelerin ayrıştırılması, cümle anlam analizi, metin madenciliği, bilgi çıkarımı ve cümlenin bağlılık analizi vb. çalışmalara da destek verebilecek veya temek oluşturabilecek bir bölümü kapsamaktadır. Bu nedenle kullanım amacı kendi işlevinin yanında başka doğal dil işleme araçlarına da destek olabilecek bir çalışmadır. Sistemin yapısı, temel olarak bir kural tabanlı sistem ve bir ardışık sınıflandırıcı tipi olan Koşullu Rastgele Alanlar kullanılmasına dayanmaktadır. Sistem, esas olarak bir makine öğrenmesi tekniği kullandığından dolayı ilk önce en iyi sonuçları verebilecek bir makine öğrenmesi modeli amaçlanmıştır. Bu amaçla ilk önce test verisi belirlendikten sonra, geriye kalan verinin içerisinden en iyi sonuç verebilecek eğitim verisi seçilmiştir. Eğitim verisi oluşturulurken alışılagelmiş yöntemlerin aksine Türkçe-İngilize paralel bir derlemin kullanılmasıyla oluşturulmuştur. Paralel derlemin İngilizce tarafı bir doğal dil işleme aracı kullanılarak ayrıştırılmış, ve paralel derlem başka bir doğal dil işleme aracı kullanılarak eşleştirilmiştir. Daha sonra ayrıştırılmış İngilizce tamlamalarının eşleştirme sonucunda Türkçe cümlelerdeki karşılıkları bulunarak içerisinde isim tamlamalarının sımırlarının belirlendiği bir eğitim verisi oluşturulmuştur. Bu oluşturulan eğitim verisinin içerisinden çeşitli parametreler kullanılarak en iyi sonuç veren cümleler seçilmiş, ve model optimize edilmiştir. Sonuç olarak isim tamlamalarının çıkarılması amacının ilk olarak kural tabanlı bir sistem ve ardından ardışık bir sınıflandırıcı kullanlarak yapılmış ve bu ardışık sınıflandırıcının da eğitim verisinin otomatik olarak üretilmesi sağlanmıştır. Aldığımız sonuçlar ise kural tabanlı sistemden daha iyi sonuç vermektedir .

Özet (Çeviri)

NOUN PHRASE CHUNKING OF TURKISH SENTENCES SUMMARY In this thesis, it is aimed that the detection of the bounds of the noun phrases that exists in Turkish sentences. The chunking of the noun phrases in Turkish sentences is a work which supports and/or can be a baseline system for the works of named entity recognition, parsing of Turkish sentences, sentiment analysis, text mining, information extraction, dependency parsing, text summirization, machine translation systems etc. In fact, to understand the meaning of a sentence and comment about it or use it, the first way to analyze it is to parse it with a dependency parser or constituency parser. But in some cases that summarized information is needed from huge amoun of data such as information extraction etc the parsing of a sentence is can be unnecessarily big amount of effort and increases the rate of the error. In these cases, the parsing which is done more shallowly can be more useful and easier to apply so shallow parsing which is called chunking is chosen for these natural languge processing modules. The most common type of shallow parsing is noun phrase chunking because it gives noun phrases that commonly contains the main words of the sentence. There are two main works for Turkish in the past. The first work was a noun phrase chunker which has rule-based system. It uses dependency parser and rules that uses the dependencies and morhpological features. The second work was not an exactly noun phrase chunker. It was a work which finds the chunks that are constiuent of the sentence. We used two different architecture of system. The first system depends on a rule-based system as a baseline system and the second system depends on Conditional Random Fields which is type of a sequence classifier that uses morphological features and dependency relations. For the rule-based system, we built a human-annotated set. The half of the this set is used as development set which is used for formation of rules and the second half of the set is used as test set. The rule-based system has three main parts that are preprocessing part, the dependecy parsing part and the rule set part. In the first part there are five main parts that contains normalization, sentence division, word tokenization, morphological analysis and morphological disambiguation. In normalization part, the text is deasciified, vowelized, spell-checked, and passed through the other nomalization steps. After the first part, the morphologically analyzes of the text is given to the second part that makes the dependency parses of the sentences of the text. We get the relation types and relation numbers from the second part. In the last part, we apply some rules that uses morhpological features that is taken from the first part and dependency relations that is taken from the second part. We used some statistics of numbers of chunk labels which is obtained from the development set to produce rules. The rules also uses the relation types and relation numbers, and sentence chunks. After we finished the development of the rules we tested our test set with the whole rule-based system. In our main system, we used a machine learning system, and the same test set that we used for the rule-based system. Because of the usage of a machine learning system, it is aimed optimize a machine learning model that gives the best results for noun phrase chunking. With this aim, after the test set is isolated from the corpus, the sentences of the train data set which gives the best results is selected from the rest of the corpus. Instead of conventional manual annotation of training data, an automatic system which needs a Turkish – English parallel corpus is used for producing annotated data for training set. There are three different processes for this automatic production of chunked sentences which can be used as training data. In the first part, the English side of parallel corpus is parsed and annotated second-level noun phrase chunks by an English parsing tool . Secondly, the parallel corpus is aligned word-by-word by also an natural language processing tool called GIZA++ . In the third part, the chunked noun phrases on the English side which are aligned to Turkish sentences are found in Turkish sentences and annotated as noun phrase chunks. Conclusionly we used two different natural language processing tools for the automatic annotation of the train data. After we did the automatic chunking process, we tried to select the most accurate chunked sentences because the annotated sentences has the errors that is resulted from the natural language processing tools and the alignment mechanism. For selecting the most accurate sentences, we used the normalized form of the scores that are given by the natural language processing tools by the sentence lengths. The annotataed sentences which gives the best results are selected by using these normalized scores and some heuristic filters.The filtered sentences used as train set and our model is optimized with this automatically annotated train set. After we selected the train set, we set 14 different features and we had done some set of experiments to select the optimum feature combination to train our model of conditional random fields. We have done the set of experiments that contains the experiments that selects and adds the best scoring feature to the combination at eahc step. We have done this set of experiments with the half of our training data because we have done so many experiments and we wanted to do the experiments in less time. At the final step, we selected eleven different features and we used this optimum selected group of automatically annotated sentences with this optimum combination which contains eleven features. We trained our optimum model with conditional random fields tool and tested our model with our test set. The results that we get from the second system is hopeful. As a result, the purpose of noun phrase chunking is done by using a sequence classifier additional to a rule-based system and the automatic production of annotated sentences for the train set is provided. Results that we obtain from the second system is much better than the first rule-based system. Our second system has three marked and important property. The first one is the language independence. In the case of the parallel corpus that consists of the language which needs to be chunked and English exists, our method is applicable for the language. The second property is that our method produces annotated data automatically and no need to manual annotation.The final important property of our system is that it is the first system that a machine learning system is used for noun phrase chunking and that conditional random fields is used for Turkish.

Benzer Tezler

Tez No
671177
II. Uluslararası Türk Dili ve Edebiyatı Sempozyumu bildirilerinin dil ve kavram incelemesi
Ii. language and style analysis of The Internatıonal Turkısh Language and ConceptSymposıum Papers
SEDANUR SÖNMEZ
Yüksek Lisans
Türkçe
2020
Türk Dili ve Edebiyatı İstanbul Arel Üniversitesi
Türk Dili ve Edebiyatı Eğitimi Ana Bilim Dalı
DOÇ. DR. ALİ TAŞTEKİN
Tez No
414518
Processing genitive possessive long distance dependencies in Turkish
Türkçe ilgi - iyelik yapılarından oluşan ayrık ilişkilerin işlemlenmesi
SEDA AKPINAR
Yüksek Lisans
İngilizce
2015
Dilbilim Boğaziçi Üniversitesi
Dilbilim Ana Bilim Dalı
DOÇ. DR. HAYRİYE MİNE NAKİPOĞLU DEMİRALP
Tez No
205872
Nurullah Ataç'ın denemelerinde devrik yapılar
Inverted structures in Nurullah Ataç's essays
YETER TORUN
Doktora
Türkçe
2005
Türk Dili ve Edebiyatı Çukurova Üniversitesi
Türk Dili ve Edebiyatı Ana Bilim Dalı
PROF.DR. MEHMET ÖZMEN
Tez No
189319
Sait Faik Abasıyanık'ın Son Kuşlar isimli eserindeki hikayelerin kelime grupları ve Türkçe eğitimi bakımından değerlendirilmesi
The review of word groups ın the story Son Kuşlar of Sait Faik Abasıyanık and evaluation of this word groups ın termf of Turkısh education
FATMA VANLI AKPINAR
Yüksek Lisans
Türkçe
2006
Eğitim ve Öğretim Dokuz Eylül Üniversitesi
Ortaöğretim Sosyal Alanlar Eğitimi Ana Bilim Dalı
DOÇ.DR. ŞERİF ALİ BOZKAPLAN
PROF.DR. İLHAN GENÇ
Y.DOÇ.DR. MEHMET YARDIMCI
Tez No
777009
Yeni Uygur Türkçesinde eksiltili yapılar
Elliptical constructions in contemporary Uyhgur Turkish
AHMET DAĞTEKİN
Yüksek Lisans
Türkçe
2022
Dilbilim Nevşehir Hacı Bektaş Veli Üniversitesi
Çağdaş Türk Lehçeleri ve Edebiyatları Ana Bilim Dalı
DOÇ. DR. NEŞE HARBALİOĞLU

Geri Dön