Linguistic category induction and tagging using the paradigmatic context representations with substitute words

Düşey kelime bağlamlarını olası kelimeler ile temsil ederek dil bilimsel sözcük kümeleri ve etikletlerinin bulunması

PDF İndir

Tez No: 352482
Yazar: MEHMET ALİ YATBAZ
Danışmanlar: DOÇ. DR. DENİZ YURET
Tez Türü: Doktora
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2014
Dil: İngilizce
Üniversite: Koç Üniversitesi
Enstitü: Fen Bilimleri Enstitüsü
Ana Bilim Dalı: Bilgisayar Mühendisliği Ana Bilim Dalı
Bilim Dalı: Belirtilmemiş.
Sayfa Sayısı: 160

Özet

Bu tez kelime bağlamlarını temsil etmek için yeni bir düşey bağıntı tanımlamaktadır. Bir kelimenin düşey bağıntısı kelimenin bağlamında değiştirim sonucu onun yerine gelebilen olası kelimelerin oluşturduğu bağıntıdır. Öte yandan yatay bağıntı bir kelimenin öncesindeki ya da sonrasındaki kelimeler arasında kurulan bağıntıdır. Bir kelimenin yerini alabilecek olası kelimeler işlenmemiş veri üzerinde eğitilmiş bir istatistiksel dil modeli ile hesaplanmaktadır. Sonuç olarak kelime bağlamları, o bağlamda görülebilecek olası kelime dağılımları ile temsil edilmektedir. Bu tez bahsi geçen yeni düşey bağıntıyı kullanabilen farklı doğal dil işleme modelleri tanımlamakta ve bu modellerin doğal dil işlemede kullanılan farklı dizisel etiketleme problemleri üzerindeki uygulamalarını göstermektedir. Doğal dil işleme problemlerindeki dizisel etiketlemenin temel amacı verilen bir kelime dizisine birebir denk gelen dizisel etiketileri bulamaktadır. Bu nedenle modeller girdi olarak kelime dizisi almakta ve çıktı olarak her kelimeye bir etiket gelecek şekilde bir etiket dizisi vermektedir. Öğreticisiz modellerde çıktı dizisi her kelimeye ait küme isimleri iken öğreticili modellerde çıktı dizisi her kelimeye ait önceden tanımlanmış etiketlerdir. Bu tezde 5 farklı model tanımlanmaktadır. İlk model öğreticisiz bir modeldir ve olası kelime dağılımlarını kullanarak kelimeleri kümelemeyi amaçlamaktadır. İkinci model verilen bir kelime ile o kelimeye ait olası kelimelerin birlikte görülme sıklıklarını modelliyen öğreticisiz bir modeldir. Üçünci model kelimenin yerini alabilecek kelimeleri kullanarak olasılıksal oylama yapan bir modeldir. Bu model ilk iki modelin aksine, her kelimenin olası etiketlerine ihtiyaç duyan öğreticili bir modeldir. Dördüncü model dizisel etiketleme probleminde sıklıkla kullanılan saklı Markof modelleriyle birlikte kullanılabilen 2 yöntem önermektedir. Bir önceki model gibi bu model de her kelimeye ait olası etiketlere ihtiyaç duyar. Tezdeki son model gürültülü kanal modelidir ve bu model gürültülü kanal ve alınan mesajı kullanarak esas gönderilmek istenen mesajı bulmayı amaçlar. Bu modelde her bağlam bir kanal, her kelime alınan mesaj ve kelimeye ait etiket ise gönderilmek istenen esas mesajdır. Tezin son kısmında yukarıda bahsi geçen modeller farklı özelliklerdeki etikeleme problemlerine uygulanmıştır. İlk iki model öğreticisiz sözcük türü bulma problemine uygulanmıştır. Olasılıksal oylama modeli ise Türkçe ekbiçim belirsizliği giderme problemi üzerinde denenmiştir. Saklı Markof modeline dayanan yöntemler ise öğreticili sözcük türü bulma problemine uygulanmıştır. Son olarak gürültülü kanal modeli kelime anlam belirsizliği giderme problemi üzerinde denenmiştir.

Özet (Çeviri)

This thesis introduces a new paradigmatic representation of word contexts. Paradigmatic representations of word context are constructed from the potential substitutes of a word, in contrast to syntagmatic representations, which are constructed from the properties of neighboring words. The potential substitutes are calculated by using a statistical language model that is trained on raw text without any annotation or supervision. Thus, each context is represented as a distribution of substitute words. This thesis introduces models with different properties that can incorporate the new paradigmatic representation, and dis- cusses the applications of these models to the tagging task in natural language processing (NLP). In a standard NLP tagging task, the goal is to build a model in which the input is a sequence of observed words, and the output, depending on the level of supervision, is a sequence of cluster-ids or predefined tags. We define 5 different models with different properties and supervision requirements. The first model ignores the identity of the word, and clusters the substitute distributions without requiring supervision at any level. The second model represents the co-occurrences of words with their substitute words, and thus incorporates the word identity and context information at the same time. To construct the co-occurrence representation, this model discretizes the substitute distribution. The third model uses probabilistic voting to estimate the distribution of tags in a given context. Unlike the first and second models, this model requires the availability of a word-tag dictionary which can provide all possible tags of each given word. The fourth model proposes two extensions to the standard HMM-based tagging models in which both the word identity and the dependence between consecutive tags are taken into consideration. The last one introduces a generative probabilistic model, the noisy channel model, for the tagging tasks in which the word-tag frequencies are available. In this model, each context C is modeled as a distinct channel through which the speaker intends to transmit a particular tag T using a possibly ambiguous word W . To reconstruct the intended message (T ), the hearer uses the distribution of possible tags in the given context Pr(T|C) and the possible words that can express each tag Pr(W |T ). The models are applied and analyzed on NLP tagging tasks with different characteristics. The first two models are tested on unsupervised part-of-speech (POS) induction in which the objective is to cluster syntactically similar words into the same group. The probabilistic voting model is tested on the morphological disambiguation of Turkish, with the objective of disambiguating the correct morphological parse of a word, given the available parses. The HMM-based model is applied to the part-of-speech tagging of English, with the objective of determining the correct POS tag of a word, given the available tags. Finally, the last model is tested on the word-sense disambiguation of English, with the objective of determining the correct sense of a word, given the word-sense frequencies.

Benzer Tezler

Tez No
349576
Turkish morphological disambiguation using multiple conditional random fields
Çoklu koşullu rassal alanlar kullanarak Türkçe biçimbilimsel belirsizlik giderme
RAZIEH EHSANI
Yüksek Lisans
İngilizce
2013
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
PROF. DR. EŞREF ADALI
YRD. DOÇ. GÜLŞEN ERYİĞİT
Tez No
421061
Türkçe sözcük anlam belirsizliği giderme
Word sense disambiguation for Turkish
BAHAR İLGEN
Doktora
Türkçe
2015
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
PROF. DR. EŞREF ADALI
YRD. DOÇ. DR. AHMET CÜNEYD TANTUĞ
Tez No
931318
Kur'ân-ı Kerim'de temennî ve teraccî üslûbu
The Qur'anic style of wishful thinking and teraccî
BETÜL CANFEZA ŞEN
Doktora
Türkçe
2025
Dilbilim İzmir Katip Çelebi Üniversitesi
Temel İslam Bilimleri Ana Bilim Dalı
DOÇ. DR. İZZET MARANGOZOĞLU
Tez No
966228
Kıpçak grubu Türk lehçelerinde çokluk ve birliktelik bildiren ifadeler
Expressions indicating multiplicity and unity in the Kypchaq group of Turkic dialects
ALEYNA ALEVSAÇAN
Yüksek Lisans
Türkçe
2025
Dilbilim Fırat Üniversitesi
Çağdaş Türk Lehçeleri ve Edebiyatları Ana Bilim Dalı
PROF. DR. SÜLEYMAN KAAN YALÇIN
Tez No
819488
Sumerce'de ekler ve Türkçe'ye yansımaları
Affixes in Sumerian and their reflection on Turkish
OĞUZHAN ABACI
Doktora
Türkçe
2023
Eski Çağ Dilleri ve Kültürleri Nevşehir Hacı Bektaş Veli Üniversitesi
Tarih Ana Bilim Dalı
PROF. DR. LÜTFİ GÜRKAN GÖKÇEK

Geri Dön