New Storage Hierarchies & Hyper-Convergence Research Projects
Multi-Level Hierarchy
Multi-level Memory/Storage Hierarchy
Traditional memory/storage hierarchy mainly encompasses DRAM and magnetic disc storage systems. It can deliver a high performance IO by placing frequent accessed blocks in DRAM and emulate block interface with using fast DRAM memory. However, this traditional hierarchy had started to impose a significant scaling issue. Moreover, it has significant logging overhead to maintain filesystem and application IO consistency.
One promising way to improve existing memory/storage hierarchy is to take advantage of new technologies and fully integrate them within the existing hierarchy. There are number of Non-volatile memory (NVM) technologies like PCM that can address scalability issues of the memory hierarchy. Moreover, Flash based storage devices (SSD) and Shingle-write drives (SWD) are the other emerging storage technologies that show tremendous promise to improve the performance and space efficiency of current storage hierarchy. These new technologies have fundamentally changed the traditional hierarchy. However, there exist many challenging research issues to be investigated before we can fully integrated them. Figure 2 illustrated a possible integration of new memory/storage hierarchy.
The main goal of this project is to understand the limitation of existing memory/storage hierarchy to fully utilize these new technologies in both hardware and software architectures.
In terms of hardware architectures, we are investigating device specific properties of new storage hierarchy to devise efficient data placement and migration techniques. For example, existing storage hierarchy assumes that data can promote and demote one level in the hierarchy at certain time. However, we believe that with additional levels of storage devices in the hierarchy like SWD, this assumption is not the right data migration decision anymore. Based on the property of each device in the hierarchy a bypassing strategy is required as well. Therefore, new set of caching and replacement policies should be integrated in the current hardware architecture to take advantage of emerging storage devices.
In terms of software architecture, we are looking at possible modification in Linux filesystems like ext3 to take advantage of NVM memory in the memory hierarchy. We believe an appropriate way to integrate NVM memory in current systems is trough a filesystem and memory mapping techniques. However, existing filesystems and mapping software optimized to accelerate block IO access using DRAM memory. Therefore, they cannot fully take advantage of promising features of emerging NVM memories like scalability and non-volatility. For this reason, we are developing new kernel API in addition to necessary changes in current software architecture in Linux filesystem to integrate these new assumptions.
Data Deduplication
Data Deduplication
Deduplication is the process of dividing a large data stream into multiple chunks and eliminating the duplicate chunks by storing only the unique chunks. The other redundant chunks are replaced with pointers to the stored/unique chunks. This technique has been widely adopted in not only backup systems but also primary storage. We have three research projects related to deduplication.
(1) TDDFS: A Tier-aware Data Deduplication based File System
In this project, a unified file system manages several storage tiers(e.g., HDD and flash based SSD). Data deduplication is applied when cold data is migrated from tier 1 (T1) to tier 2 (T2). The deduplication structure is also maintained in T1 for reloaded files to improve the migration efficiency and space utilization. The architecture is shown in Figure 3.
(2) SMRTS: An SMR-based Tiered Store
SMR is perfect for data deduplication systems to store the containers (all sequential write and no update). We designed an SMR based tiered store to reduce the tier architecture storage costs but maintain its high performance. Data is read and write on SSD, cold data is deduplicated and migrated to the SMR drives. The architecture is shown in Figure 4.
(3) Optismr: Restore-Performance Optimization for Deduplication Systems Using SMR Drives
In this project, the restore performance is optimized by allocating or replicating containers to different SMR zones, such that the disk head moving distance is reduced during the restore process. The initial design is shown in Figure 5.
Optimizing S/W Stack for SSD
Optimizing Software Stack for NVMe SSD
Ceph Bluestore is the latest object store backend that solves the ''double journaling'' problem of existing backend designs. Bluestore only writes out the objects once and stores the location of the object separately as the metadata in RocksDB.
NVMe SSD Increasing I/O speed of flash introduces NVMe protocol that supports large number of queue and large queue depth. Further, as kernel NVMe driver cannot keep up with the response time of the NVMe SSDs, user-space poll-mode NVMe drives (e.g. Intel SPDK) are proposed to fully unleash the power of the NVMe SSDs.
RocksDB on NVMe When directly using SPDK driven NVMe SSDs as the storage backends in RocksDB, we found that the interaction between RocksDB and SPDK cannot fully utilize the power of the NVMe SSDs. So, the focus of our project is to pinpoint the performance bottleneck of the metadata store stack in Ceph Bluestore (especially RocksDB) and propose optimization accordingly. In other words, the metadata store in Ceph Bluestore, RocksDB, is not intially designed for user space NVMe drivers, and we will investigate potential bottlenecks in Ceph Bluestore metadata path, especially RocksDB, when integrating SPDK.
OSD & HBase
Improving HBase for Objective Storages
In current HBase architecture design, the Hbase RegionServer is deployed on the same host as the HDFS datanode server. Thus, there is a very good data locality and the network traffic between the hosts is light. However, most HBase users have their own OSD server pool and the RegionServer hosts are separated from the storage nodes. Therefore, a huge amount of RPC I/O is required between the Hbase RegionSevers and OSD servers during the GET/SCAN operations
We deploy a LocalScanner on OSD servers and Some of the I/O intensive requests are directly processed by the LocalScanner such that only the requested data is transmitted to the RegionServer through the network. However, how to achieve the management and requests process is still challenging. By using the local scanner, some of the GET, SCAN and compaction can be processed locally to improve the performance when network bandwidth is low.