Geri Dön

Parallel processing of large scale genomic data

Başlık çevirisi mevcut değil.

  1. Tez No: 402722
  2. Yazar: MÜCAHİD KUTLU
  3. Danışmanlar: DR. GAGAN AGRAWAL
  4. Tez Türü: Doktora
  5. Konular: Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol, Computer Engineering and Computer Science and Control
  6. Anahtar Kelimeler: Belirtilmemiş.
  7. Yıl: 2015
  8. Dil: İngilizce
  9. Üniversite: The Ohio State University
  10. Enstitü: Yurtdışı Enstitü
  11. Ana Bilim Dalı: Belirtilmemiş.
  12. Bilim Dalı: Belirtilmemiş.
  13. Sayfa Sayısı: 172

Özet

Özet yok.

Özet (Çeviri)

An increasing amount of genomic data is becoming available for researchers with development of high-throughput and low-cost sequencing technologies. Analysis of such data has a significant potential for the new scientific and medical advances. However, as the amount of available data increases, use of parallelism and effective utilization of the computing resources become even more critical. Thus, novel parallelization approaches and frameworks that can help researchers develop parallel applications without dealing with low-level details of parallel coding are urgently needed for new advances in genomic research. In this dissertation, we introduce parallel genomic data analysis tools and middleware systems for developing efficient parallel genomic applications easily. With the proposed frameworks and parallel algorithms, we address the following challenges. (1) How to partition genomic data for parallel SNP calling and sequence quantification tools 2) Is it possible to utilize existing genomic applications in parallel executions? (3) How can we take advantage of domain-specific knowledge to increase the performance of the applications? (4) How to schedule the data intensive tasks (5) How can we implement efficient parallel genomic applications for memory-constrained many-core architectures such as Intel Xeon Phi? First, we focused on identification of variants in large-scale genomic data in parallel. After examining possible approaches, we identify one, which does not require any communication. However, achieving load-balance is non-trivial, because of the data-dependent nature of the processing. We develop three scheduling schemes including a dynamic scheme, which reduces scheduling overheads by using two different chunk sizes, a static scheme, which uses a pre-processing step to estimate workloads, and a combined scheme. We evaluate our schemes with various configurations and analyze their performances. Second, we develop a middleware system, PAGE, which supports 'mapreduce-like' processing, but with significant differences from a system like Hadoop, to be useful and effective for parallelizing analysis of genomic data. Particularly, it can work with map functions written in any language, thus allowing utilization of existing serial tools (even those for which only an executable is available) as map functions. Thus, it can greatly simplify parallel application development for scenarios where complex data formats and/or nuanced serial algorithms are involved, as is often the case for genomic data. It supports multiple partitioning methods and provides different scheduling schemes, and execution models, to match the nature of algorithms common in genetic research. In evaluation of our middleware system, we show that PAGE is able to parallelize various genomic applications and is able to achieve high parallel efficiency and scalability. Third, we focus on data-intensive computation challenge in genomic applications and develop a novel framework, RE-PAGE, which builds a processing framework for genomic data in distributed disks. We pay attention to scheduling and load balancing, particularly in view of the data-intensive nature of the target applications. Specifically, our framework includes: 1) use of domain-specific information in the formation of data chunks (which can be of non-uniform sizes), 2) replication and placement of each chunk on a small number of nodes, performed in an intelligent way, and 3) scheduling schemes for achieving load balance, when data movement costs out-weigh processing costs and the chunks are of nonuniform sizes. As stated above, amount of available genomic data is increasing rapidly with the recent developments in sequencing technologies. At the same time, the computational technologies are also developing with an enormous speed. For instance, the trends in computing technologies are towards architectures with large number of cores and smaller memory size per core (e.g., Intel Xeon Phi). Innovative solutions that meet the requirements of parallel genomic data processing with the constraints of the new computational architectures are urgently needed. Thus, as a forth contribution, we introduce a novel middleware system, GEM, for implementing shared-memory parallel genomic applications with memory-constrained many-core architectures. We propose load-map-reduce approach and a novel scheduling scheme to decrease I/O contention and to prevent over-consumption of the limited memory. We also use domain-specific knowledge to decrease the memory requirements of the tasks. In our experiments, we show that GEM has high scalability on Intel Xeon Phi architecture and outperforms other existing frameworks for genomic data processing. In our last work, we focused on probabilistic assignment of fragments, which are ambiguously mapped to target sequences. This is a very significant but time consuming procedure for the downstream analysis of the genomic data. We developed distributed memory parallel version of a popular probabilistic fragment assignment tool, namely eXpress, which is based on expectation-maximization algorithm. We discussed possible data distribution techniques and proposed a parallelization approach, which preserves the original algorithm's accuracy and also doesn't require communication of all processes at the end of each iteration. In our experiments, we showed that our approach achieves high speedup over eXpress without decreasing its accuracy.

Benzer Tezler

  1. Hadoop tabanlı büyük ölçekli görüntü işleme altyapısı

    Hadoop based large scale image processing infrastructure

    İLGİNÇ DEMİR

    Yüksek Lisans

    Türkçe

    Türkçe

    2012

    Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve KontrolKocaeli Üniversitesi

    Bilgisayar Mühendisliği Ana Bilim Dalı

    YRD. DOÇ. DR. AHMET SAYAR

  2. Genel amaçlı bir yapay sinir ağının karma bir donanımla gerçeklenmesi

    Mixed mode hardware design of a general purposed artificial neural network

    BURCU ERKMEN

    Doktora

    Türkçe

    Türkçe

    2007

    Elektrik ve Elektronik MühendisliğiYıldız Teknik Üniversitesi

    Elektronik ve Haberleşme Mühendisliği Ana Bilim Dalı

    PROF.DR. TÜLAY YILDIRIM

  3. The solution of large-scale electromagnetic problems with MLFMA on single-GPU systems

    Büyük ölçekli elektromanyetik problemlerin ÇSHÇY ile tekli-GİB sistemlerinde çözümü

    MEHMET FATİH ERKAL

    Yüksek Lisans

    İngilizce

    İngilizce

    2022

    Elektrik ve Elektronik Mühendisliğiİhsan Doğramacı Bilkent Üniversitesi

    Elektrik-Elektronik Mühendisliği Ana Bilim Dalı

    PROF. DR. VAKUR BEHÇET ERTÜRK

    DR. BARIŞCAN KARAOSMANOĞLU

  4. Geniş ölçekli veriler üzerinde sınıflandırma ve bölütleme amaçlı evrişimsel sinir ağı ve istatistiksel modellerin geliştirilmesi

    Development of convolutional neural network and statistical models for classification and segmentation on large-scale data

    NURULLAH ÇALIK

    Doktora

    Türkçe

    Türkçe

    2019

    Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve KontrolYıldız Teknik Üniversitesi

    Elektronik ve Haberleşme Mühendisliği Ana Bilim Dalı

    PROF. DR. LÜTFİYE DURAK ATA

  5. Platform development for parallel operation of single board computers

    Tek kart bilgisayarlarla paralel işlem yapabilmesi için platform geliştirilmesi

    KÜBRA KARADAĞ

    Yüksek Lisans

    İngilizce

    İngilizce

    2017

    Mekatronik MühendisliğiDokuz Eylül Üniversitesi

    Mekatronik Mühendisliği Ana Bilim Dalı

    YRD. DOÇ. DR. ÖZGÜR TAMER