Laboratory for Large-Scale Biomedical Data Technology

Research Topics

In our team, we are working to develop core technologies to handle large-scale biomedical data based on data engineering technologies. For this mission, we are now mainly working for the following studies and technology developments.

(1) Development of SCPortalen Database, which aims to provide reprocessed single-cell RNA-seq data for accessible downstream use and reuse for
researchers globally

Due to the improvement of expression profiling technologies at the single-cell level, single-cell transcriptome datasets have been published by many research groups worldwide. However, the datasets tend to be missing important supporting information (metadata) as well as using different protocols and processing analyses that makes the datasets difficult to compare and reuse for downstream analysis. To solve this issue, we are developing a new public database (SCPortalen: Single-cell centric database). In this database, we collect published single-cell datasets, curate their metadata and reprocess them using a unified computational pipeline. Thus, we normalize data from varied protocols and provide a structured database that is easily queried for use downstream. Currently, we are working to improve the database by adding new analyses, quality control (QC) methods and enhancing data exploration.

(2) Construction of Reference Datasets of Transcription Start Sites

We are focusing on Transcriptional start sites (TSS) as a reference point to efficiently integrate various types of annotation data about transcriptional regulations, and to compare them to experimental data. Moreover, there are few datasets of TSSs that can be used as a reference. For this purpose, we are constructing a dataset of reference transcription start sites (refTSS), and have published them on our website ( Currently, we are working to expand the dataset to  include more TSSs and annotations.

(3) Data Coordination Center for Large-Scale Data Production Projects

In the large-scale data production projects, including the FANTOM project, the data coordination center plays an important role in evaluating data quality and to provide access to the datasets and their metadata. In our laboratory, we have established and operated the data coordination centers for the FANTOM6 project, RIKEN single-cell project etc. We keep record of our activities of the data coordination, and collect/develop technologies and know-how for future activity.

(4) Transcriptome Analysis of Aged and Disease Targets

We apply developed data analysis technologies to elucidate transcriptional regulation systems, and develop diagnosis tools for diseases. In practice, we are working to study transcriptional regulation using transcriptome data of aged cardiovascular disease patents and their medical records. We are also studying Mycetoma, which is in the Neglected Tropical Disease (NTD) list of WHO.



  1. Software and tools
  2. biological databases