Laboratory for Large-Scale Biomedical Data Technology

Research Topics

In our team, we are working to develop core technologies to handle large-scale biomedical data based on data engineering technologies. For this mission, we are now mainly working for the following studies and technology developments.

(1) Development of a Database to provide reprocessed single-cell RNA-seq data for accessible downstream use and reuse for researchers globally

Due to the improvement of expression profiling technologies at the single-cell level, single-cell transcriptome datasets have been published by many research groups worldwide. However, the datasets tend to be missing important supporting information (metadata) as well as using different protocols and processing analyses that makes the datasets difficult to compare and reuse for downstream analysis. To solve this issue, we have created a QC tool (1-1), evaluated the analysis pipeline (1-2), and developed a new public database (1-3).

(1-1) Development of SkewC: quality assesment tool applicable to single-cell RNA-Seq data

We have created the SkewC QC tool. The methodology is based on the assessment of gene coverage for each cell, and its skewness as a quality measure. SkewC is capable of processing any type of scRNA-seq dataset, regardless of the protocol. This tool is designed to avoid misclustering or false clusters by identifying, isolating, and removing cells with skewed gene body coverage profiles.
https://doi.org/10.1016/j.isci.2022.103777
https://doi.org/10.1016/j.xpro.2022.102038

(1-2) Evaluation of computational pipelines for single-cell data

Quality of the results analyzed scRNA-seq datasets can be crucial affected by the analysis methods and technologies. The Cell Ranger developed by the 10x Genomics is a most major pipeline for scRNA-seq analysis, and updated several times. We have carefully evaluated new and previous version of the Cell Ranger, and chosen the best way to analysis scRNA-seq datasets.

(1-3) Construction of a new public database, SCPortalen

We have developed a new public database, SCPortalen (Single-cell centric database). In this database, we collect published single-cell datasets, curate their metadata and reprocess them using a unified computational pipeline. Thus, we normalize data from varied protocols and provide a structured database that is easily queried for use downstream. In the newest version, SCPortalen2 we have expanded the number of datasets, incorporating 10x Genomics' protocol data and focused attention to adding 5' end sequenced data. We now also provide a Single Cell Dataset Discovery (SCDD) interface to explore thousands of human and mouse scRNA-seq datasets. Currently, we are working to improve the database further by adding new datasets, quality control (QC) methods and enhancing data exploration.
https://doi.org/10.1093/nar/gkx949

(2) Construction of Reference Datasets of Transcription Start Sites

We are focusing on Transcriptional start sites (TSS) as a reference point to efficiently integrate various types of annotation data about transcriptional regulations, and to compare them to experimental data. Moreover, there are few datasets of TSSs that can be used as a reference. For this purpose, we are constructing a dataset of reference transcription start sites (refTSS), and have published them on our website (https://reftss.riken.jp/). Currently, we are working to expand the dataset to include more TSSs and annotations.
https://doi.org/10.1016/j.jmb.2019.04.045

(3) Development of an Integrated Transcriptional Regulation Data Platform and cis-Regulatory Element Database

In the transcriptional regulation studies, various and heterogeneious large-scale datasets are available, and the integrative use of the datasets becomes much important for the further understanding of transcriptional regulation mechanisms. For this background, we are jointly working to build a data platform named INTRARED, which can integrate information about cis-regulatory elements, trans-factors, and epigenomes. The data platform consists of two databases: fanta.bio and ChIP-Atlas. fanta.bio is a database of CREs, which are genomic regions contributing to the regulation of gene and transcript expressions in the same chromosomes, with their locations and activities in various cell types and states. ChIP-Atlas is a database of trans-factors that bind to CREs and epigenomes that affects to CREs and trans-factors. Our laboratory is collaboratively working for the development of INTRARED and fanta.bio database.

(4) Data Coordination for Large-Scale Data Production Projects

In the large-scale data production projects, including the FANTOM project, the data coordination center plays an important role in evaluating data quality and to provide access to the datasets and their metadata.

(4-1) Management of data coordination center

In our laboratory, we have established and operated the data coordination centers for the FANTOM6 project, RIKEN single-cell project etc. We keep record of our activities of the data coordination, and collect/develop technologies and know-how for future activity.

(4-2) ZENBU platform for the biomedical data analysis and sharing

ZENBU is an interactive website providing a platform for scientists to load and share data through secured collaborations and to interactively process and visualize that data through both a novel genome-browser and a data-interactive report-page based system for building data portal websites. The ZENBU genome browser is a powerful genomics visualization platform allowing users to work directly with BAM files by providing on-demand data processing and visualization. ZENBU allows multiple data files to be merged and processed within a track to create deep multi-experiment visualizations. By providing on-demand processing ZENBU allows the same uploaded data to be visualized in multiple way in different tracks. ZENBU provides a rich palette of visualization styles for genomic annotations, expression via graphs and heatmaps, and interactions via connections arc and interaction maps. ZENBU was developed during the FANTOM5 project which profiled the transcription expression of nearly 2000 different tissues and cells which ZENBU is able to process and visualize simultaneously within single tracks. ZENBU-Reports is a web application to create interactive scientific web portals by using graphical interfaces while providing storage and secured collaborative sharing for data uploaded by users. ZENBU-Reports provides the scientific visualization elements commonly used in supplementary websites, publications and presentations, presenting a complete solution for the interactive display and dissemination of data and analysis results during the full lifespan of a scientific project both during the active research phase and after publication of the results. Full documentation for the ZENBU system is available at https://zenbu-wiki.gsc.riken.jp/. Please cite this paper to refer to ZENBU:

Severin J, Lizio M, Harshbarger J, Kawaji H, Daub CO, Hayashizaki Y; FANTOM Consortium, Bertin N, Forrest ARR: “Interactive visualization and analysis of large-scale sequencing datasets using ZENBU.” Nature Biotechnology 32(3): 217-219 (2014). PubMed ID 24727769.

Resources

Software and Tools
- SkewC (https://github.com/LSBDT/SkewC/)
- HDRGenome (https://github.com/LSBDT/HDRGenome/)
- Moirai2 workflow engine (https://github.com/moirai2/moirai2/)
- ZENBU-Reports (https://fantom.gsc.riken.jp/zenbu/reports)
Biological Databases
- FANTOM Web Resource (https://fantom.gsc.riken.jp/)
- SCPortalen (https://single-cell.riken.jp/)
- refTSS (https://reftss.riken.jp/)
- SSTAR (https://fantom.gsc.riken.jp/5/sstar/)
- fanta.bio (https://fanta.bio/)
- INTRARED (https://www.intrared.org/)
- ZENBU (https://fantom.gsc.riken.jp/zenbu/)

RIKEN Center for
Integrative Medical Sciences