Episode 010 | September 28, 2021Artificial intelligence, Machine Learning, Deep Learning, and Deep Neural Networks are today critical to the success of many industries. But they are also extremely compute intensive and expensive to run in terms of both time and cost, and resource constraints can even slow down the pace of innovation. Join us as we speak to Muthian Sivathanu, Partner Research Manager at Microsoft Research India, about the work he and his colleagues are doing to enable optimal utilization of existing infrastructure to significantly reduce the cost of AI. Muthian's interests lie broadly in the space of large-scale distributed systems, storage, and systems for deep learning, blockchains, and information retrieval. Prior to joining Microsoft Research, he worked at Google for about 10 years, with a large part of the work focused on building key infrastructure powering Google web search — in particular, the query engine for web search. Muthian obtained his Ph.D from University of Wisconsin Madison in 2005 in the area of file and storage systems, and a B.E. from CEG, Anna University, in 2000. For more information about the Microsoft Research India click here. Related Microsoft Research India Podcast: More podcasts from MSR IndiaiTunes: Subscribe and listen to new podcasts on iTunesAndroidRSS FeedSpotifyGoogle PodcastsEmail TranscriptMuthian Sivathanu: Continued innovation in systems and efficiency and costs are going to be crucial to drive the next generation of AI advances, right. And the last 10 years have been huge for deep learning and AI and primary reason for that has been the significant advance in both hardware in terms of emergence of GPUs and so on, as well as software infrastructure to actually parallelize jobs, run large distributed jobs efficiently and so on. And if you think about the theory of deep learning, people knew about backpropagation about neural networks 25 years ago. And we largely use very similar techniques today. But why have they really taken off in the last 10 years? The main catalyst has been sort of advancement in systems. And if you look at the trajectory of current deep learning models, the rate at which they are growing larger and larger, systems innovation will continue to be the bottleneck in sort of determining the next generation of advancement in AI. [Music] Sridhar Vedantham: Welcome to the Microsoft Research India podcast, where we explore cutting-edge research that’s impacting technology and society. I’m your host, Sridhar Vedantham. [Music] Sridhar Vedantham: Artificial intelligence, Machine Learning, Deep Learning, and Deep Neural Networks are today critical to the success of many industries. But they are also extremely compute intensive and expensive to run in terms of both time and cost, and resource constraints can even slow down the pace of innovation. Join us as we speak to Muthian Sivathanu, Partner Research Manager at Microsoft Research India, about the work he and his colleagues are doing to enable optimal utilization of existing infrastructure to significantly reduce the cost of AI. [Music] Sridhar Vedantham: So Muthian, welcome to the podcast and thanks for making the time for this. Muthian Sivathanu: Thanks Sridhar, pleasure to be here. Sridhar Vedantham: And what I'm really looking forward to, given that we seem to be in some kind of final stages of the pandemic, is to actually be able to meet you face to face again after a long time. Unfortunately, we've had to again do a remote podcast which isn't all that much fun. Muthian Sivathanu: Right, right. Yeah, I'm looking forward to the time when we can actually do this again in office. Sridhar Vedantham: Yeah. Ok, so let me jump right into this. You know we keep hearing about things like AI and deep learning and deep neural networks and so on and so forth. What's very interesting in all of this is that we kind of tend to hear about the end product of all this, which is kind of, you know, what actually impacts businesses, what impacts consumers, what impacts the health care industry, for example, right, in terms of AI. It's a little bit of a mystery, I think to a lot of people as to how all this works, because... what goes on behind the scenes to actually make AI work is generally not talked about. Muthian Sivathanu: Yeah. Sridhar Vedantham: So, before we get into the meat of the podcast you just want to speak a little bit about what goes on in the background. Muthian Sivathanu: Sure. So, machine learning, Sridhar, as you know, and deep learning in particular, is essentially about learning patterns from data, right, and deep learning system is fed a lot of training examples, examples of input and output, and then it automatically learns a model that fits that data, right. And this is typically called the training phase. So, training phase is where it takes data builds a model how to fit. Now what is interesting is, once this model is built, which was really meant to fit the training data, the model is really good at answering queries on data that it had never seen before, and this is where it becomes useful. These models are built in various domains. It could be for recognizing an image for converting speech to text, and so on, right. And what has in particular happened over the last 10 or so years is that there has been significant advancement both on the theory side of machine learning, which is, new algorithms, new model structures that do a better job at fitting the input data to a generalizable model as well as rapid innovation in systems infrastructure which actually enable the model to sort of do its work, which is very compute intensive, in a way that's actually scalable that's actually feasible economically, cost effective and so on. Sridhar Vedantham: OK, Muthian, so it sounds like there's a lot of compute actually required to make things like AI and ML happen. Can you give me a sense of what kind of resources or how intensive the resource requirement is? Muthian Sivathanu: Yeah. So the resource usage in a machine learning model is a direct function of how many parameters it has, so the more complex the data set, the larger the model gets, and correspondingly requires more compute resources, right. To give you an idea, the early machine learning models which perform simple tasks like recognizing digits and so on, they could run on a single server machine in a few hours, but models now, just over the last two years, for example, the size of the largest model that's useful that state of the art, that achieves state of the art accuracy has grown by nearly three orders of magnitude, right. And what that means is today to train these models you need thousands and thousands of servers and that's infeasible. Also, accelerators or GPUs have really taken over the last 6-7 years and GPUs. A single V-100 GPU today, a Volta GPU from NVIDIA can run about 140 trillion operations per second. And you need several hundreds of them to actually train a model like this. And they run for months together to train a 175 billion model, which is called GPT 3 recently, you need on the order of thousands of such GPUs and it still takes a month. Sridhar Vedantham: A month, that's sounds like a humongous amount of time. Muthian Sivathanu: Exactly, right? So that's why I think just as I told you how the advance in the theory of machine learning in terms of new algorithms, new model structures, and so on have been crucial to the recent advance in the relevance in practical utility of deep learning.Equally important has been this advancement in systems, right, because given this huge explosion of compute demands that these workloads place, we need fundamental innovation in systems to actually keep pace, to actually make sure that you can train them in reasonable time, you can actually do that with reasonable cost. Sridhar Vedantham: Right. Ok, so you know for a long time, I was generally under the impression that if you wanted to run bigger and bigger models and bigger jobs, essentially you had to throw more hardware at it because at one point hardware was cheap. But I guess that kind of applies only to the CPU kind of scenario, whereas the GPU scenario tends to become really expensive, right? Muthian Sivathanu: Yep, yeah. Sridhar Vedantham: Ok, so in which case, when there is basically some kind of a limit being imposed because of the cost of GPUs, how does one actually go about tackling this problem of scale? Muthian Sivathanu: Yeah, so the high-level problem ends up being, you have limited resources, so let's say you can view this in two perspectives, right. One is from the perspective of a machine learning developer or a machine learning researcher, who wants to build a model to accomplish a particular task right. So, from the perspective of the user, there are two things you need. A, you want to iterate really fast, right, because deep learning, incidentally, is this special category of machine learning, where the exploration is largely by trial and error. So, if you want to know which model actually works which parameters, or which hyperparameter set actually gives you the best accuracy, the only way to really know for sure is to train the model to completion, measure accuracy, and then you would know which model is better, right. So, as you can see, the iteration time, the time to train a model to run inference on it directly impacts the rate of progress you can achieve. The second aspect that the machine learning researcher cares about is cost. You want to do it without spending a lot of dollar cost. Sridhar Vedantham: Right. Muthian Sivathanu: Now from the perspective of let's say a cloud provider who runs this, huge farm of GPUs and then offers this as a service for researchers, for users to run machine learning models, their objective function is cost, right. So, to support a given workload you need to support it with as minimal GPUs as possible. Or in other words, if you have a certain amo