Türkçe yazım denetleyen editör

Turkish spelling checker editor

Tez No: 22031
Yazar: K.MESUT YARIMBIYIKLI
Danışmanlar: DOÇ. DR. TAKUHİ NADİA ERDOĞAN
Tez Türü: Yüksek Lisans
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 1992
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Fen Bilimleri Enstitüsü
Ana Bilim Dalı: Belirtilmemiş.
Bilim Dalı: Belirtilmemiş.
Sayfa Sayısı: 92

Özet

ÖZET Kişisel bilgisayar dünyası gelişirken kullanıcılara sunulan hizmetler de çeşitlenmektedir. Bu hizmetler içinde kelime işlemciler en sık kullanılanlarıdır. Bu tezin konusu Türkçe metin yazanların islerini kolaylaştıran bir kelime işlemci programının tasarlanması ve hazırlanmasıdır. Yazım Denetleyici Editör adı verilen bu kullanıcı programının amacı Türkçe sözcüklerin yazımını canlı olarak denetleyebilmektir» Editör içinde kullanılan yazım denetim algoritması kök ve ek sözlüklerini tarayarak sözcük yazımının doğruluğunu denetler. Aynı köke yapım ve çekim eki alarak türetilen sözcüklerin hepsini sözlükte saklamak yerine tanıması kolay olan ekleri ayıklıyarak daha ufak bir kök sözlüğü elde edilir. Aranan sözcükler doğrudan sözlük içinde yer almadığı için yazım denetim algoritması oldukça karışık bir yapıdadır. Buna karşılık sözlüğün büyük ölçüde ufalmasını sağlar. Kök sözlüğünde sözcük kökü aramasının hızlı yapılması beklenen bir hizmettir. Kök arama algoritması olarak ikili arama seçilmiştir. Sözlükteki kökler uzunluklarına göre sınıflanıp sıkıştırılarak saklanmıştır. Editörün çalışması sırasında sözlüğün tamamı ana belleğe alınır. Yazım Denetleyici Editör bir WYSIWYG editördür. Editör, metni anabellekte oluşturulan bağlantılı liste yapısında tutar. Bir kelime işlemciden beklenen temel hizmetleri yerine getirebilir. Tez için yapılan pratik çalışmalarda karşılaşılan sorunlar, sorunlara bulunan çözümler ve elde edilen sonuçlar ilgili konular ile beraber verilmiştir. Gerçekleştirilen yazılımlar ek olarak sunulmuştur. (y)

Özet (Çeviri)

TURKISH SPELLING CHECKER EDİTÖR SUMMARY By the seventies, it was well known that computer age was sligtly started. in eighties, computers were common in use, IBM PC*s and compatîbles were the new oportunities för small companies and personal purpose use of computers. Now in nineties everybody does know that PC's can present an easier word. Docuttıent preperation is öne of the expanding utiliti.es of PC's, Word processors of ali kinds offer numerous functionalities för entering and formatting documents according to the users' requirements and preferences. However5 it has long been noted that the use of computers in this application area need not be lirnited to just formatting, but can extend to helping the user in improving the quality of the document., A number of tools have been developed f ör analyzing the text and suggesting changes that improve the readability of the docutnents, The simpler task of checking the spelling of the words in the text - usually a boring and an error- prone job - is ideally suited for computers as it is a reperative work that requires f ast reading and a good metnory, The reason for attacking the problem of spelling error detection for Turkish are manifold; Hore and more docurnents in Turkish business and goverment work are being prepared using computers and word processors, and it is clear that such usage will increase significantly in the years to come,. Motivation of this study is to help develop tools that can make the creation of high- quality documents in Turkish easier, Also, Turkish is a language that differs significantly from other languages in the Indo-European group in the ways word are formed.. The techniques for spelling checking developed for those languages are not readily applicable to Turkish. Hence understanding and solving the problem of spelling error detection for Turkish is itself an interesting challenge. Before starting to discuss spelling checkers we shall take a look at editors. The earliest text editors for microcomputers were line editors - editors that allow the user to display and edit text only öne line at a time. The next generation of text editors is WSIWYG editors, The acronym WYSIWYG satands for the phrase“What You See Is What You Get”- a characteristic of most modern editors, (vi)When using a WYSIWYG editör, the user sees what the finished text look like as it is manipulatecL Text is entered by simply typing it in and at ali the times, the context surrounding the place where the editing is beeing performed is visible. A RAM-based editör is an editör that loads an entire disk file into memory at önce, and keeps it there while it is being edited. Swapping ör Virtual editors use the system memory to store the amount of text that can be edited at any öne time» Öne of the most important characteristics of the design of a text editör is the data structure it uses to store textual data in memory «hile it is being edited, There are a number of possible techniques, and each affects the performance of spesific editing task. The simplest and most obvious way to arrange the text is simply as an array of strings, each with a given maximum length. This method is extremely f ast and simple but rttakes very poor use of space., Each line takes up the full maxirnum length whether it is empty ör full. The next technique also allocates a fixed chunck of memory to store the text in, but does not waste as much space. The text is simply read into a large block of memory» This structure displays its greatest weakness when text is inserted into the buffer. in Linked-List approaeh, the text consists of a linked list of nodes, each of which contains pointers to: 1) the previous line, 2) the following line, and 3) a string containing the line of text, This approach is more complicated to program than the previous two, but provides higher performance. Spellig programs can be classified into two groups: Spelling checkers identify rnisspelled words in an input text file, while spelling correctors suggest a list of most likely correct words after detecting a misspelled word, Obviously, a spelling corrector is significantly more complicated than a spelling checker, Spelling checkers can be loosely classified into two groups: in batch programs, the input words are sorfced and any duplicat.es are eliminated. Öne pass through the list and dictionary is enough to check ali input tokens. This contrasts with interactive programs which check each word as it is encountered in the input file. Ali spelling checkers use an external list of correctly spelled words in a data structure that serves the function of a dictionary. The structure of the dictionary is of great importance; a simple data structure will ease development and maintenance, but performance may be crucial especially in the interactive versions, The structure must allow for very fast searches, and för performance reasons it is desirable to keep the dictionary in main memory, (vii)Compact representation of the dictionary is also an Important issue, By removing affixes and sorting only the root t«fords, the dictionary size can be reduced significantly. Ho^ever, not every af f ix-root cotnbination is valid. A misspelled which forms an invalid combination may go undetected. A solution to this problem is to flag each word in the dictionary with its legal affixes, Then, after the root and the affixes are found, the flags associated with each root can be examined to see whether the particular affix is legal för this root, Although such solutions are applicable in languages like English where the number of affixes is rather limited, they are not readily applicable in the case of Turkish where the number of possible affixes is upwards of 300, Using such a tecnique for Tuskish can increase the dictionary size significantly. in order to reduce the flag size for Turkish, classification of roots and affixes into groups can be useful. The use of data compression in dictionaries for spell checkers is a good way of rnemory usage. Frequency dependent coding is öne of the compression techniques, in virtually ali text, some symbols occur rnore of ten than others, This observation suggests an encoding scheme in which common symbols are assigned short codes and rare symbols are assigned long codes. An algorithm due to Huffman can be used to produce an approxitnation to the best result. Besides the dictionary structures, searching algorithms are quite important parts of spelling checkers, The sequential seach is easy to code, it assumes to check sequentialy every word in the file j üst like its name. On avarage, test is made on n/2 elements» in the best case, it will test only öne element and in the worst case n elements. îf the information is stored on disk, the search time can be very long. But if the data is unsorted, this is the only method available, If the data to be searched is in sorted order, then a superior method, called the binary search, can be used to find a match,. The method uses the divide-and-conquer approach. If first tests the middle element; if the element is larger than the key, it then tests the middle element of the first half; otherwise, it tests the middle element of the second half, This process is repeated until either a match is found, ör there are no more elements to test. Irı binary search, the number of comparison is giyen by lg(n). A special type of searchig, called hashing is also used in spelling algorithms, Hashing requires constant time per operation on the average. in the worst case, this method requires time proportional to the size of the set. By careful design, however, we can make the probability of hashing requiring more than a constant time per operation be arbitarily small. (viii)In static hashing we shall consider two different forms of hashing» One, called open or extermal hashing, allows the set to be stored in a potentially unlimited space, and therefore places no limit on the size of the set» The second, called closed or internal hashing, uses a fixed space for storage and thus limits the size of The disadvantage of static hashing is requiring a fixed set of bucket. Most databases grow larger over time» Dynamic hash functions allow database to be modified dynamically, which is called dynamic hashing» The Bloom filtering method reduces the amount of the space required to contain the hash-coded information when compared with conventional methods. Bloom filters provide a probabilistic way to test set membership» If a valid word is checked for membership, it will be accepted» A Bloom filter consist of a large bit array and a collection of independent hush transforms into the range of bit array size. Representing the dictionary as a bit area reduces the space required by dictionaries» The reduction in space is accomplished by exploiting the possibility that a small fraction of errors» Arzu Bayramoglu has developed a spelling algorithm called Turkish Word Analysis, as a master thesis. This thesis represents a system for analysing Turkish words without entering morphological relations» Word analysing process investigates if the word written is valid according to the Turkish word structure or not» In order to investigate this validity, one has to search for the word either from a list of all possible Turkish words or try to generate them using a list of roots, suffixes and a set of word generating rules to find out whether the analysed word may be generated. It is not possible to search all possible words in an agglutinative language such as Turkish» All morpheros occuring the beginning of the words, such as roots and stems, have been consisted the root morphemes dictionary. Loan words, emphatic forms, some female and male personal names, family names, national names, city names, subtantives, pronouns, adjectives, verbs, adverbs, postpositions, conjunctions and interjections are all placed in the root dictionary. Because we do not enter morphological relations, we deal only with conjugational suffixes» But some derivational suffixes are also in the suffix dictionary» It would be not logical if we put all the words formed with these suffixes in the root dictionary» Almost all Turkish suffixes are subject to vowel and constant harmony rule» This means that a Turkish morpheme may often have 2, 4, 8, 16 or even 24 alomorphs» (is)The analysing process starts with the hyphenation. If the word fails in the subprocess, the entire process stops. In case of success, the word enters the analysis subprocess, which requires more time than the hyphenation. Word analysis subprocess involves four main steps. The fist step is root recognition with a dictionary look-up, to determines where the root morpheme ends and suffix morpheme begins. The second step is suffix recognition and the third is testing the root and the suffixes for structural validity. Turkish roots can be classified into two main classes; substantival and verbal. The verbal model comprises the verbs, while substantival class comprises nouns and adjectives. The suffixes that can be received by either of these groups are different. The fourth and last step is testing the root and suffixes for Turkish harmony rules. An important part of the system is its database, which contains Turkish roots, suffixes and their properties. The database is based on indexing. The indexing files that are organized in the binary tree order. The data files consist of data records which each root record contains the root itself, the type of root and a flag which shows subjecting to vowel harmony. Each suffix record contains the suffix, the types of roots that get it, the type of the root is converted to by the suffix. The aim of this thesis is to rearrange the Turkish m>rü analysis algorithm so that it can run“on-line”in a word processing enviroment. With efficient data structures and searching algorithms both dictionary size and searching time are reduced to a minimum. After an analysis on dictionary structures, binary search is merged with a compression thecnique and a dictionary algorithm is developed. The indexing files are leaved. Roots are compressed and gathered into four group, as the new dictionary files. A word processing system is implemented and is interfaced to the Turkish word analysis program. The word processor build on a link list, which is existed by nodes that each representing a line of the text. The node structure is consist of two pointers to the adjasent lines, a pointer to the line string, a length variable and a status flag. The status flag keeps the status of the line in the paragraph and the knowledge of occurence of an hypenation in the line. (x)On-line anti off-line spell checking, hypenation, expanding the root dictionary commands are specialy developed for the word processor. Besides the spell checking utilities, word processor can perperform the ordinnary word processing commands, such as file oprations and blocking operations. While developing the analysis program, some unsatisfactory points noticed on the spellirng algorithm. A better root representation and an extended group of affixes could strengthen the algorithm. But the incrementation in affixes needs a detailed morphological analysis. The on-line spell checking of the Turkish Spelling Cheker Editor performed successful results. The response time of spelling a word and the memory usage of the editor was good enough for personal computers. (si)

Benzer Tezler

Tez No
706167
Türkiye'de resmi ilanların Basın İlan Kurumu ya da Valilikler aracılığıyla dağıtımının gazeteler üzerindeki etkisi (Kastamonu-Çankırı örneği)
The effect of distribution of official announcements through Press Advertising Agency (PAI) or Governorates, in Turkey on newspapers (A case study of Kastamonu and Çankırı)
ÖZGÜR ALANTOR
Yüksek Lisans
Türkçe
2021
Gazetecilik Kastamonu Üniversitesi
Gazetecilik Ana Bilim Dalı
DOÇ. DR. ERSOY SOYDAN
Tez No
73405
Basında okurluk araştırmaları -Türkiye Avrupa karşılaştırması-
Printed media research -Comparison of Turkey and Europe-
COŞKUN HALICI
Yüksek Lisans
Türkçe
1998
Gazetecilik Anadolu Üniversitesi
PROF. DR. YILMAZ BÜYÜKERŞEN
Tez No
277054
Basının gündem belirleme işlevi üzerine ampirik bir çalışma; yazılı basının 2009 yerel seçimlerine bakışının değerlendirilmesi
An emprical study about the press's agenda determining fuction; the evalvation of the print media's view abauot the 2009 local elections
KORAY KOPAN
Yüksek Lisans
Türkçe
2010
Gazetecilik Gazi Üniversitesi
Gazetecilik Ana Bilim Dalı
PROF. DR. NAZİFE GÜNGÖR
Tez No
189535
Avrupa Topluluğu (Birliği) Hukuku'nda hizmet ihaleleri
Public procurement of services in the European Community (Union)Law
LALE BURCU ÖNÜT
Yüksek Lisans
Türkçe
2006
Hukuk Dokuz Eylül Üniversitesi
Kamu Hukuku Ana Bilim Dalı
DOÇ. DR. MELTEM KUTLU GÜRSEL
Tez No
166789
Bulanık mantık ve yapay sinir ağları ile Türkçe yazım denetleyicisi
Turkish spell checker and correction with fuzzy logic and artificial neural networks
SİMLA DİLSİZ
Yüksek Lisans
Türkçe
2005
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İstanbul Teknik Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
PROF.DR. EŞREF ADALI

Geri Dön