Scaling Up The Performance of Distributed Key-Value Stores Using Emerging Technologies for Big Data Applications [thesis]

Author

Hebatalla Eldakiky (Ph.D. 2021)

Abstract

The explosion in the amount of data with the development of the internet and cloud computing prompted much research to develop systems that are able to store and process this data efficiently. As data is generated by different sources with un-unified structures, NoSQL databases emerged as a solution due to their flexibility and high performance. Key-value stores, one of the NoSQL databases categories, are widely used in many big data applications. This wide usage is for its efficiency in handling data in key-value format, and flexibility to scale out without significant database redesign. In key-value stores, with such huge amount of data, data cannot be stored in a single storage server. Thus, this data has to be partitioned across multiple storage instances. Key-value queries have to access the information of these partitions to locate the target key-value pairs, and be directed to the right storage node that physically holds the data. This scenario introduces further forwarding steps in the path to the target storage node. These additional forwarding steps affect the query response time. Recently, the power and flexibility of software-defined networks with the evolution of the programmable switches lead to a programmable network infrastructure where in-network computation can help accelerate the performance of applications. This can be achieved by offloading some computational tasks to the network to improve data access performance when applications access storage through network. However, what kind of computational tasks should be delegated to the network to accelerate applications performance? To solve the partition management problem in key value stores, we developed TurboKV, an in-switch coordination model, which utilizes the programmable switches as partition management nodes and monitoring stations to scale up the performance of the distributed key-value stores. Our in-switch coordination model removes the load of routing the requests from storage nodes without introducing any additional forwarding steps in the path to the target storage node. Moreover, some key-value stores omit the transaction concepts because of their effect on the scalability and decreasing the performance of key-value stores, which are the key targets of any existing key-value store system. This effect is due to the complexity, locking, starvation introduced by transactions and the interference with the non-transaction operations. In order to provide efficient support for the transactions in key-value stores, we propose TransKV, an extension to our first work TurboKV, which introduces a networking support for transaction processing in distributed key-value stores. TransKV utilizes the programmable switches as a transaction coordinator who can decide whether the transaction can proceed to be processed by the storage nodes or just aborted from the network. On the storage node side, Seagate developed a new drive called "Kinetic drive". The Kinetic drive is an independent active disk accessible by Ethernet connection. This enables applications to directly connect to the drive via IP address, and retrieve a piece of data. Kinetic drive can also carry out key value pair operations on its own. So, in large scale data management, a set of Kinetic drives can be used to exploit parallelism in satisfying user requests, and solve the bottleneck caused by queuing of requests in the storage server which manages multiple HDDs/SDDs. On the other hand, Kinetic drive has a limited bandwidth and capacity. Therefore, a careful allocation scheme is needed to allocate key-value pairs to a set of Kinetic drives taking into account each drive's limited bandwidth and capacity. To this extent, we developed a key-value pair allocation strategy for Kinetic drives. This strategy takes into consideration the data popularity, the limited capacity and the bandwidth of Kinetic drive to avoid queuing on the level of the drive.

Link to full paper

Scaling Up The Performance of Distributed Key-Value Stores Using Emerging Technologies for Big Data Applications

Keywords

cloud computing, big data

Share