A Network Cost-aware Geo-distributed Data Analytics System [conference paper]

Conference

20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) - May 11-14, 2020

Authors

Kwangsung Oh (Ph.D. student), Abhishek Chandra (professor), Jon Weissman (professor)

Abstract

Many geo-distributed data analytics (GDA) systems have focused on the network performance-bottleneck: interdata center network bandwidth to improve performance. Unfortunately, these systems may encounter a cost-bottleneck () because they have not considered data transfer cost (), one of the most expensive and heterogeneous resources in a multi-cloud environment. In this paper, we present Kimchi, a network cost-aware GDA system to meet the cost-performance tradeoff by exploiting data transfer cost heterogeneity to avoid the cost-bottleneck. Kimchi determines cost-aware task placement decisions for scheduling tasks given inputs including data transfer cost, network bandwidth, input data size and locations, and desired cost-performance tradeoff preference. In addition, Kim- chi is also mindful of data transfer cost in the presence of dynamics. A Kimchi prototype has been implemented on Spark and experiments show that it reduces cost by 14% ~ 24% without impacting performance and reduces query execution time by 45% ~ 70% without impacting cost compared to other baseline approaches centralized, vanilla Spark, and bandwidth-aware (e.g. Iridium). More importantly, Kimchi allows applications to explore a much richer cost-performance tradeoff space in a multi-cloud environment.

Link to full paper

A Network Cost-aware Geo-distributed Data Analytics System

Keywords

distributed systems

Share