Enabling Fundamental Cacheability for Distributed Deep Learning Training

The word cache in yellow blocks on a black matrix background

Department of Electrical and Computer Engineering

Location: Edwin A. Stevens 330 or via Zoom

https://stevens.zoom.us/j/99616274687?pwd=bFAuNbvbmS5kuXzhrd84OMobVY3qIy.1

Passcode: 754083

Speaker: Ali R. Butt, Professor of Computer Science, Virginia Tech

ABSTRACT

Deep learning training (DLT) applications exhibit unique I/O workload behaviors that pose new challenges for storage system design. DLT is I/O intensive since data samples need to be fetched continuously from a remote storage. Accelerators such as GPUs have been extensively used to support these applications. As accelerators become more powerful and more data-hungry, the I/O performance lags behind. This creates a crucial performance bottleneck, especially in distributed DLT. At the same time, the exponentially growing dataset sizes make it impossible to store these datasets entirely in memory. While today's DLT frameworks typically use a random sampling policy that treat all samples uniformly equally, recent findings indicate that not all samples are equally important and different data samples contribute differently towards improving the accuracy of a model. This observation creates an opportunity for DLT I/O optimizations by exploiting the data locality enabled by importance sampling. In this talk, I’ll present the design of SHADE, a new DLT-aware caching system that detects fine-grained importance variations at per-sample level and leverages the variance to make informed caching decisions for a distributed DLT job. SHADE employs a number of optimizations and manages to significantly improve the cache hit ratio of the DLT job, and thus, improves the job's training performance. I will also provide an overview of I/O, storage, and large-scale distributed systems projects that are ongoing at our lab., DSSL.

BIOGRAPHY

Portrait of Ali R. Butt

Dr. Ali R. Butt is a Professor of Computer Science (and ECE by courtesy) and Associate Department Head for Faculty Development at CS@Virginia Tech. He is an ACM Distinguished Member. He received his Ph.D. degree in Electrical and Computer Engineering from Purdue University in 2006. He is a recipient of an NSF CAREER Award (2008), IBM Faculty Awards (2008, 2015), a VT College of Engineering (COE) Dean's award for "Outstanding New Assistant Professor" (2009), an IBM Shared University Research Award (2009), and NetApp Faculty Fellowships (2011, 2015). He was named a VT COE Faculty Fellow in 2013. Ali was an Academic Visitor at IBM Almaden Research Center (Summer 2012) and a Visiting Research Fellow at Queen's University of Belfast (Summer 2013). He has served as the Associate Editor for IEEE Transactions on Cloud Computing (2018-present), ACM Transactions on Storage (2016-present), IEEE Transactions on Parallel and Distributed Systems (2013-2016), Cluster Computing: The Journal of Networks, Software Tools and Applications (2013-present), and Sustainable Computing: Informatics and Systems (2010-2015). He is an alumnus of the National Academy of Engineering's US Frontiers of Engineering (FOE) Symposium (2009), US-Japan FOE (2012), and the National Academy of Science's AA Symposium on Sensor Science (2015). He was also an organizer for the US FOE in 2010. Ali's research interests are in: cloud and high-performance computing systems; systems support for machine and deep learning applications; file, I/O, and storage systems; distributed systems; and large-scale experimental computer systems. At Virginia Tech he leads the Distributed Systems & Storage Laboratory (DSSL) and directs the stack@cs Center for Computer Systems.