Çizge ve içerik verilerinde kolektif sınıflandırma algoritmalarının karşılaştırılması

A comparison of collective classification techniques on network and content data

PDF İndir

Tez No: 486558
Yazar: ÖZGE ATASEVEN
Danışmanlar: YRD. DOÇ. DR. YUSUF YASLAN
Tez Türü: Yüksek Lisans
Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
Anahtar Kelimeler: Belirtilmemiş.
Yıl: 2017
Dil: Türkçe
Üniversite: İstanbul Teknik Üniversitesi
Enstitü: Fen Bilimleri Enstitüsü
Ana Bilim Dalı: Bilgisayar Bilimleri ve Mühendisliği Ana Bilim Dalı
Bilim Dalı: Bilgisayar Mühendisliği Bilim Dalı
Sayfa Sayısı: 71

Özet

Nesnelerin kategorilerine, bağlantılarına, içeriklerine göre sınıflandırılmasına finans, sosyal bilimler, iletişim, bilgisayarda görü ve biyoloji gibi birçok araştırma alanında ihtiyaç duyulur. Özellikle son yıllarda bu çalışma alanlarında kullanılan ağ verilerinin içerikleri ve birbirleriyle olan ilişkilerini çözmeye çalışmak günden güne önem kazanmaktadır. Sınıflandırma problemlerinde, nesnelerin sadece öznitelikleri kullanılarak yapılan sınıflandırma işlemleri, metin sınıflandırma gibi uygulama alanlarında İçerik-Tabanlı sınıflandırma olarak adlandırılmaktadır. Geleneksel sınıflandırma algoritmaları yalnızca bağımsız içerik özelliklerine bakarak sınıflandırma işlemini gerçekleştirirken Kolektif Sınıflandırma algoritmaları hem bağımsız nesneleri (içerik verileri) hem de bağımlı nesneleri (çizge verilerini) kullanarak sınıflandırma işlemini gerçekleştirir. Bu algoritmalar doküman, görüntü, ilişki tipi sınıflandırma ya da farklı özelliklerdeki düğümlerin tespiti gibi işlemlerde sınıflandırıcının performasını arttırmak için nesneler arasındaki çizge verilerini kullanırlar. Kolektif sınıflandırma üzerine yapılan çalışmalar, ağ verilerinin bilgilerini de sınıflandırıcılara dahil ettiği için yalnızca içerik verileri kullanan sınıflandırma algoritmalarına göre daha yüksek doğruluk sonuçları elde eder. Ağ verisinin yapısı çizge şeklindedir ve içeriğindeki her bir satır iki düğüm arasındaki ilişkileri temsil eder. Bu ilişkiler dökümanlar arası atıf, kişiler arası akrabalık, görüntüler arası benzerlik olabilir. Bu ilişki bilgilerinin sınıflandırıcılara dahil edilmesi nesneler arası benzerliklerin daha ön plana çıkmasını sağlar. Sınıflandırma işlemini çizge verileri arasında gerçekleştiren algoritmalardan biri Iterative Classification Algorithm (ICA)' dir. ICA algoritması sınıflandırmayı gerçekleştirirken etiketi belli olmayan düğümün komşularının etiket bilgilerini kullanır. Bu yüzden bu algoritma yerel (lokal) sınıflandırma algoritması olarak adlandırılır. ICA algoritması etiketsiz bir düğümün etiketini belirlerken incelediği komşu düğümlerin de etiketi belli değilse özyinelemeli bir şekilde etiketsiz komşulara geçici değerler atar. İçerik verisi ile ağ verisini birlikte kullanabilmek için yerel sınıflandırıcı olan Naïve Bayes ve K-En Yakın Komşuluk (KNN) algoritmaları ICA algoritması ile birleştirilerek Naïve Bayes-ICA, Naïve Bayes-KNN şeklinde kullanılır. Her iki algoritmada ele alınan etiketsiz düğümler için Naïve Bayes ve KNN içerik verisini sınıflandırırken, ICA algoritması ağ verisini sınıflandırır. Yerel sınıflandırıcılara alternatif olarak Loopy Belief Propagation (LBP) ve Mean Field Relaxation Labeling (MF) gibi global sınıflandırıcılar da bulunmaktadır. Bu algoritmaların global olarak adlandırılma sebebi, ağ verilerinde sınıflandırma yaparken global bir amaç fonksiyonu (global objective function) kullanması ve Markov rasgele alanlar ile ağın belli parçalarından sınıflandırılacak düğümleri elde etmesidir. Özetle bu çalışmada, yerel ve global sınıflandırıcılar içerik ve bağlantı tabanlı veri kümelerine uygulanmış, sınıflandırıcıların doğruluk değerleri karşılaştırılmış ve bu sınıflandırıcıların birleştirilmesiyle daha yüksek başarım elde edilebilmesi amaçlanmıştır.

Özet (Çeviri)

The classification of objects according to categories, links or topics on network data is an important research area such as finance, social science, communication, computer vision and biology. Solving problems within network data by using each node's relation with the others, where for each node its features and relations with other nodes are available, become more common and important research topic in the last decade. If only attributes in the objects are used for classification, the classification is called Content-Only classification. Collective classification algorithms use both independent objects (content dataset) and interconnect objects (graph dataset). These algorithms use link structure among the instances in order to improve the classification performance. This iterative algorithm which uses network dataset is called Iterative Classification Algorithm (ICA). The format of the network structure is graph data type and it includes the cites relationship between objects. Moreover, content dataset includes word and label information for document classification tasks. To use content data with this graph data we have combined the local classifiers as Naive Bayes-ICA and K-Nearest Neigbors (KNN)-ICA for“Collective Classification”. These two algorithms provide local classification because of using neighborhood of unknown labeled object that needs to be classified. ICA is one of the most successful local classification algorithms that uses local classifiers. It also has different variations such as Iterative Collective Classification (ICC) which is a generalization of ICA that use case-based classifier and Bayesian classifier. Collective classification techniques aim to improve the classification performance of linked data by utilizing unknown nodes in the network that are classified by using known nodes and network structure. They can be grouped into local and global classification algorithms and use both link and content data for classifying the objects which are shown as nodes in the graph. The class labels can be both estimated by using content only Naïve Bayes classifier and collective classifier as well. K-Nearest Neighbors (KNN) is another classifier that is used to label an unlabeled object. The KNN algorithm measures the closeness of the objects by looking at the words and citations of known and unknown labels in the network. Then, it looks at the neighbors in a particular proximity of the objects and assign the value of the label with the highest number of labels to the object. The similarities between unlabeled node and its neighbor labeled nodes are computed using the content information and weighted using the degree of the labeled nodes. Thus a node which is a higher degree neighbor of an unlabeled node will have small similarity value. As an alternative to local classifiers, there are also global classifiers that use Markov Random Fields (MRF) as Loopy Belief Propagation (LBP) and Mean Field Relaxation Labeling (MF) algorithms. We focus on both single-label and multi-label dataset classification by using these global algorithms and compare the approximation results for these two type of datasets. Global classification algorithms use pairwise MRF to assign the class label for an unlabeled node by defining a global objective function. The algorithm uses clique potential of nodes which are found through marginal probability distribution of observed nodes. However, it is not an easy task to achieve numerically. In order to calculate one marginal probability, it is required to sum an exponentially large number of variables. Therefore, two approximate inference algorithms LF and MF are used to compute marginal probabilities. Both of the algorithms embrace the idea of avoiding the complexity in calculating marginal probability distribution. They use trial distribution rather than directly using pairwise MRF probability distribution. We begin with making trial distribution suitable for MRF distribution. Afterwards, it is easy to get marginal probabilities from trial distribution. Each of the random variables in MRF sends a message to unlabeled node which is a neighbor of unlabeled node. This message consists of sum of the neighbor clique potentials and includes the probability of the label information. The LBP algorithm is applied on pairwise MRF. Another approximate inference algorithm that can be applied to pairwise MRF is MF. As in the LBP algorithm MF algorithm computes marginal probabilities for each messages and it repeats the process until all marginal probabilities are stabilized. When the process is completed, the calculated marginal probabilities are given as a belief result. We also rearrange the MF for adapting the multi-label classification named as Multi-Label Mean Field Relaxation Labeling (ML-MF). For each class label we do an MF and considered all class labels together during performance evaluation. Global classification algorithms aim to optimize a global objective function. As a global classification algorithm LBP is successful in different areas such as fraud detection on online retailers or classification in social network. However, it does not guarantee convergence due to increased cycles in the heterogeneous networks. In this research, we use two different types of datasets namely single-labeled and multi-labeled datasets. CORA and Citeseer are the single-labeled document classification datasets where document topics are assigned as a label for each document. CORA is a bibliographic dataset with 2708 documents, 1433 words, 5429 links and 7 labels which are Rule Learning, Neural Network, Theory, Reinforcement Learning, Probabilistic Method, Genetic Algorithm. CiteSeer is also one of the most commonly analyzed bibliographic dataset with 3312 documents, 4732 links and 5 labels which are Agents, IR, DB, AI, HCI, ML. Whereas as a multi-labeled dataset Terrorist-Rel has 4 different relation information (family, accomplice, contact, congregate) for each person which are treated as class labels. Multi-label classification resembles multi-class classification, except classes are not mutually exclusive. In multi-label dataset an instance can be assigned simultaneously into multiple classes. This dataset has the relationship between terrorists and represents each event of a terrorist as a feature. These events are used to obtain the relation between two terrorist nodes such as family or contact relationship. In our experiment two terrorist node could belong to same organization and same family. The terrorists are connected by a total number of 851 relation nodes with event information and it has 8592 links between each other. There are 4 sub-datasets that represents the relationship. If two people are in the same family (e.g. father-son, husband-wife) they have family relation. If two people are members of the same terrorist organization, they have accomplice relation. If two people have contacted each other (e.g email, phone), they have contact relation. If two people use the same facility (e.g. go to the same training camp), they have congregate relation. All of datasets (CORA, Citeseer, Terrorist-Rel) have graph dataset and these relational datasets have binary class labels. Thus, the relation between two terrorists is represented as a node. If the label of this node is family, contact, non-accomplice, noncongregate this means that these two terrorists are in the same family and contact each other but they have never been in the same terrorist organization and never used the same facility. In the experimental results, the test datasets are obtained using the snowball sampling (SS) method by selecting initial node and expanding the other nodes iteratively. In this study, single-labeled linked data classification problem has been evaluated using ICA-KNN, ICA-Naïve Bayes, LBP, MF algorithms on bibliographic dataset. Then we have extended our experiments on terrorism relation multi-labeled linked dataset by using Multi-Label Loopy Belief Propagation (ML-LBP), Multi-Label Mean Field Relaxation Labeling (ML-MF) global classification algorithms. To sum up, in this thesis we consider both single and multi-labeled linked data classification problem using local and global classification algorithms. Initially, single-labeled linked data classification problem is evaluated using ICA-KNN, ICA-Naïve Bayes, LBP and MF algorithms on bibliographic datasets. Then we extend our experiments on terrorism relation multilabeled linked dataset by using ML-LBP, ML-MF global classification algorithms. The experimental results show that for single-labeled linked data the best classification accuracy is obtained by MF global classification algorithm. For multi-labeled data both ML-LBP and ML-MF algorithms perform similarly.

Benzer Tezler

Tez No
142597
Yeni Cami'nin akustik açıdan performans değerlendirmesi
Evaluation of the acoustical performance of the New Mosque
EVREN YILDIRIM
Yüksek Lisans
Türkçe
2003
Mimarlık İstanbul Teknik Üniversitesi
Mimarlık Ana Bilim Dalı
PROF. DR. SEVTAP YILMAZ DEMİRKALE
Tez No
286511
Matematik problemlerinin çözümünde öğretmen adaylarının kullandıkları stratejilerin ve gösterim şekillerinin analizi
Analyzing pre-service teachers? strategies and representations regarding their solutions to mathematical problems
FATMA CEMRE PEHLİVAN
Yüksek Lisans
Türkçe
2011
Eğitim ve Öğretim Dokuz Eylül Üniversitesi
Ortaöğretim Fen ve Matematik Alanları Eğitimi Ana Bilim Dalı
YRD. DOÇ. DR. ESRA BUKOVA GÜZEL
Tez No
596805
Okul öncesi dönem çocuklarının en çok izledikleri çizgi filmlerde toplumsal cinsiyet kalıp yargılarının incelenmesi
Examination the gender phenomenon in the most watched cartoons by preschool children
BÜŞRA GÜL KALEM
Yüksek Lisans
Türkçe
2019
Eğitim ve Öğretim Gazi Üniversitesi
Okul Öncesi Eğitimi Ana Bilim Dalı
DR. ÖĞR. ÜYESİ GÜLHAN GÜVEN
Tez No
385111
Fusion of multimodal information for multimedia information retrieval
Çoğulortam bilgi erişimi için çok kipli bilginin birleştirilmesi
TURGAY YILMAZ
Doktora
İngilizce
2014
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol Orta Doğu Teknik Üniversitesi
Bilgisayar Mühendisliği Bölümü
PROF. DR. ADNAN YAZICI
Tez No
268047
Structural scene analysis of remotely sensed images using graph mining
Uydu görüntülerinin çizge madenciliği ile yapısal sahne analizi
BAHADIR ÖZDEMİR
Yüksek Lisans
İngilizce
2010
Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol İhsan Doğramacı Bilkent Üniversitesi
Bilgisayar Mühendisliği Ana Bilim Dalı
YRD. DOÇ. DR. SELİM AKSOY

Geri Dön