Creator Database [Lex Fridman] Deep Learning Basics Introduction and Overview
Welcome everyone to, 2019. It's really good to see everybody here make it in the cold. This is 6S094, deep learning for self driving cars. It is part of a series of courses on deep learning that we're running throughout this month. The website that you can get all the content, the videos, the lectures, and the code is deeplearning.mit.edu. The videos and slides will be made available there along with a GitHub repository that's accompanying the course. Assignments for registered students will be emailed later on in the week. And you can always contact us with questions, concerns, comments at hcai, human centeredai@mit.edu.
So let's start through the basics, the fundamentals. To summarize in one slide. What is deep learning? It is a way to extract useful patterns from data in an automated way with as little human effort involved as possible. Hence the automated. How the fundamental aspect that we'll talk about a lot is the optimization of Neural Networks. The practical nature that will provide to the code and so on is that there's libraries that make it accessible and easy to do some of the most powerful things in deep learning using Python, TensorFlow and Friends. The hard part always with Machine Learning, Artificial Intelligence in general is asking good questions and getting good data. A lot of times the exciting aspects of what's the news covers and a lot of the exciting aspects of what is published in the prestigious conferences, in an archive, in a blog post is the methodology.
The hard part is applying that methodology to solve real world problems, to solve fascinating interesting problems and that requires data. That requires asking the right questions of that data, organizing that data, and labeling selecting aspects of that data that can reveal the answers to the questions you ask. So why has this breakthrough over the past decade of the application of neural networks, the ideas in neural networks? What has happened? What has changed? They've been around since the 19 forties and ideas have been percolating even before. The digitization of information, data, the ability to access data easily in a distributed fashion across the world. All kinds of problems have now a digital form. They could be accessed by learning algorithms. Hardware, compute, both the Moore's law Moore's law of CPU and GPU and ASICs, Google's TPU systems, hardware that enables the efficient effective large scale execution of these algorithms. Community.
People here, people all over the world being able to work together, to talk to each other, to feed the fire of excitement behind Machine Learning. GitHub and beyond. The tooling, as we'll talk about TensorFlow, PyTorch and everything in between that enables the a person with an idea to reach a solution in less and less and less time. Higher and higher levels of abstraction empower people to solve problems in less and less time with less and less knowledge. Where the idea and the data become the central point not the effort that takes you from idea to the solution. And there's been a lot of exciting progress. Some of which we'll talk about from face recognition to the general problem of scene understanding image classification to speech, text, natural language processing, transcription, translation in medical applications and medical diagnosis. And cars being able to solve many aspects of perception in autonomous vehicles with drivable area, lane detection, object detection, digital assistance, ones on your phone and beyond the ones in your home.
Ads recommender systems from Netflix to search to social, Facebook, and of course the deep reinforcement learning successes in the playing of games from board games to Starcraft and DOTA. Let's take a step back. Deep Learning is more than a set of tools to solve practical problems. Pamela McCordick said in 79, AI began with the ancient wish to forge the Gods. Throughout our history, throughout our civilization, human civilization we've dreamed about creating echoes of whatever is in this mind of ours in the machine, and creating living organisms. From the popular culture in the 1800 with Frankenstein to Ex Machina. This vision, this dream of understanding intelligence and creating intelligence has captivated all of us. And deep learning is at the core of that, because there's aspects of it the learning aspects that captivate our imagination about what is possible.
Given data and methodology, what learning, learning to learn and beyond how far that can take us. And here visualized is just 3% of the neurons and 1 millionth of this synapses in our own brain. This incredible structure that's in our mind and there's only echoes of it. Small shadows of it in our artificial neural networks that we're able to create, but nevertheless those echoes are inspiring to us. The history of Neural Networks on this pale blue dot of ours started quite a while ago. With summers and winters, with excitements and periods of pessimism starting in the forties with neural networks and the implementation of those neural networks as a perceptron in the fifties with ideas of back propagation, restricted Boltzmann Machines, Recurrent Neural Networks in the 70s 80s with convolutional Neural Networks and the MNIST dataset with data sets beginning to percolate in LSTMs, bidirectional RNNs in the nineties, and the rebranding and the rebirth of neural networks under the flag of deep learning and deep belief nets in 2006. The birth of ImageNet, the data set that on which the possibilities of what deep learning can bring to the world has been first illustrated in the recent years in 2009. And AlexNet, the network that on ImageNet performed exactly that with a few ideas like dropout that improve neural networks over time every year by year improving the performance of neural networks.
In 2014, the idea of GaNS that Jan LaCoon called the most exciting idea of the last 20 years. The generative adversarial networks, the ability to with very little supervision generate data, to generate ideas after forming representation of those it from the understanding, from the high level abstractions of what is extracted in the data be able to generate new samples. Create. The idea of being able to create as opposed to memorize is really exciting. And on the applied side the, in 2014 with DeepFace the ability to do face recognition. There's been a lot of breakthroughs on the computer vision front that being one of them. The world was inspired, captivated in 2016 with AlphaGo and 17 with AlphaZero, Beating with less and less and less effort the best players in the in the world at Go. The problem that for the most of the history of artificial intelligence thought to be unsolvable.
And new ideas with capsule networks and this year is the year 2018 was the year of natural language processing. A lot of interesting breakthroughs. Of Google's BERT and others that will talk about breakthroughs on ability to understand language, understand speech and everything including generation that's built all around that. And there's a parallel history of tooling starting in the 60s with the Perceptron and the wiring diagrams. They're ending with this year with PyTorch 1.0 and TensorFlow 2.0. These really solidified exciting powerful ecosystems of tools that enable you to do very to do a lot with very little effort. The sky is the limit, thanks to the tooling. So let's then from the big picture taken to the smallest.
Everything should be made as simple as possible. So let's start simple with a little piece of code before we jump into the details and a big run through everything that is possible in deep learning. At the very basic level with just a few lines of code, really 6 here. Six little pieces of code. You can train a neural network that understand what's going on in an image. The classic that I will always love MNIST data set. The handwritten digits where the input to a neural network or machine learning system is the picture of a handwritten digit and the output is the number that's in that digit. It's as simple as in the first step import the library TensorFlow.
2nd step, import the dataset MNIST. 3rd step, like Lego bricks stack on top of each other. The neural network layer by layer with a hidden layer, an input layer, an output layer. Step 4: train the model as simple as a single line model fit. Evaluate the model in step 5 on the testing data set and that's it. In step 6 you're ready to deploy, you're ready to predict what's in the image. It's as simple as that. And much of this code obviously much more complicated or much more elaborate and rich and interesting and complex we'll be making available on GitHub, on our repository that accompanies these courses.
Today we've released the first tutorial on driver scene segmentation. I encourage everybody to go through it. And then on the tooling side in one slide before we dive into the Neural Networks and deep learning. The tooling side amongst many other things TensorFlow is a deep learning library, an open source library from Google. The most popular one to date. The most active with a large ecosystem. It's not just something you import in Python and to solve some basic problems. There's an entire ecosystem of tooling.
There's different levels of APIs. Much of what we'll do in this course will be the highest level API with Keras. But there's also the ability to run-in the browser with TensorFlow JS, on the phone with TensorFlow Lite, in the cloud without any need to have a computer hardware or anything, any of the library set up on your own machine. You can run all the code that we're providing in the cloud with Google Colab Collaboratory. And the optimized ASICs hardware that Google is optimized for TensorFlow with their TPU, Tensor Processing Unit, ability to visualize TensorBoard, models are provided in TensorFlow Hub. And there's just just an entire ecosystem including most importantly I think documentation of blogs that make it extremely accessible to understand the fundamentals of the tooling that allow you to solve the problems from natural language processing to computer vision to GANs, generative adversarial neural networks and everything in between with deep reinforcement learning and so on. So that that's why we've, we're excited to sort of work both in the theory in this course, in this series of lectures and in the in the tooling and the applied side of TensorFlow. It really makes it exceptionally these ideas exceptionally accessible.
So deep learning at the core is the ability to form higher and higher level of abstractions of representations in data and raw patterns. Higher and higher levels of understanding of patterns. And those representations are extremely important and effective for being able to interpret data. Under certain representations data is trivial to understand. Cat versus dog, blue dot versus green triangle. Under others it's much more difficult. In this in this task drawing a line under polar coordinates is trivial. Under Cartesian coordinates is very difficult and well impossible to do accurately.
And that's a trivial example of a representation. So our task with Deep Learning with Machine Learning in general is forming representations that map the topology this the whatever the topology the rich space of the problem that you're trying to deal with of the raw inputs. Map it in such a way that the final representation is trivial to work with. Trivial to classify, Trivial to perform regression. Trivial to generate new samples of that data. And that representation of higher and higher levels of representation is really the dream of artificial intelligence. That is what understanding is. Making the complex simple like like Einstein back in a few slides ago said.
And that with Jurgen Schmidt Huber and whoever else said it, I don't know the that's been the dream of all of science in general. Of the history of science is the history of compression progress, of forming simpler and simpler representations of ideas. The models of the universe of our solar system with the earth at the center of it is much more complex to perform to do physics on than a model where the sun is at the center. Those higher and higher levels of simple representations enable us to do extremely powerful things. That has been the dream of science and the dream of artificial intelligence. And why Deep Learning? What is so special about Deep Learning in the grander world of Machine Learning and Artificial Intelligence? It's the ability to more and more remove the input of human experts. Remove the human from the picture. The human costly inefficient effort of human beings in the picture.
Deep learning automates much of the extraction from the, gets us closer and closer to the raw data without the need of human involvement, human expert involvement. Ability to form representations from the raw data as opposed to having a human being needing to extract features as was done, in the eighties nineties and the early odds to extract features with which then the machine learning algorithms can work with. The automated extraction of features enables us to work with larger and larger data sets removing the human completely except from the supervision labeling step at the very end. It doesn't require the human expert. But at the same time there is limits to our technologies. There's always a balance between excitement and disillusionment. The Gartner hype cycle as much as we don't like to think about it applies to almost every single technology. Of course the magnitude of the peaks and the drawouts is different.
But I would say we're at the peak of inflated expectation with deep learning. And that's something we have to think about as we talk about some of the ideas and exciting possibilities of the future. And with self driving cars that we'll talk about in future lectures in this course, we're at the same. In fact we're a little bit beyond the peak. And so it's up to us, this is MIT and the engineers and the people working on this in the world to carry us through, the draw, to carry us through the future as the ups and downs of the excitement progresses forward into the plateau of productivity. Why else not deep learning? If we look at real world applications especially with humanoid robotics, robotic manipulation and even, yes, autonomous vehicles. Majority of the aspects of the autonomous vehicles do not involve to an extensive amount Machine Learning to date. The problems are not formulated as data driven learning.
Instead they're model based optimization methods that don't learn from data over time. And then from the speakers that these follow these couple of weeks we'll get to see how much machine learning starting to creep in. But the examples shown here with the Boston with amazing humanoid robotics and Boston Dynamics, To date almost no machine learning has been used except for trivial perception. The same with autonomous vehicles, almost no machine learning and deep learning has been used except with perception. Some aspect of enhanced perception from the visual texture information. Plus what's becoming what's starting to be used a little bit more is use of recurring neural networks to predict the future, to predict the bait the intent of the different players in the scene in order to anticipate what the future is. But these are very early steps. Most of the success of the you see today, the 10000000 miles of Waymo has achieved has been attributed mostly to non machine learning methods.
Why else not deep learning? Here's a really clean example of unintended consequences. Of ethical issues we have to really think about. When an algorithm learns from data based on an objective function, a loss function, the power, the consequences of an algorithm that optimizes that function is not always obvious. Here's an example of a human player playing the game of coast runners with it's a boat racing game where the task is to go around the racetrack and try to win the race. And the objective is to get as many points as possible. There are 3 ways to get points. The finishing time, how long it took you to finish. The finishing position, where you were in ranking and picking up turbos those little green things along the way, they give you points.
Okay, simple enough. So we design an agent in this case an RL agent that optimizes for the rewards. And what we find on the right here, the optimal the agent discovers that the optimal actually has nothing to do with finishing the race or the ranking. That you can get much more points by just focusing on the turbos and collecting those those little green dots because they regenerate. So you go in circles over and over and over slamming into the wall collecting the the green turbos. Now that's a very clear example of a well reasoned a formulated objective function that has totally unexpected consequences at least without sort of considering, considering those consequences ahead of time. And so that shows the need for AI safety for a human in the loop of machine learning. That's why not deep learning exclusively.
The challenge of deep learning algorithms, of deep learning applied is to ask the right question and understand what the answers mean. You have to take a step back and look at the difference, the distinction, the levels, degrees of what the algorithm is accomplishing. For example, image classification is not necessarily scene understanding. In fact, it's very far from scene understanding. Classification may be very far from understanding. And the datasets can vary drastically across the different benchmarks in the datasets used. The professionally done photographs versus synthetically generated images versus real world data. And the real world data is where the big impact is.
So oftentimes the one doesn't transfer to the other. That's the challenge of deep learning. Solving all of these problems of different lighting variations, of pose variation, inter class variation, all the things that we take for granted as human beings with our incredible perception system all have to be solved in order to gain greater and greater understanding of a scene. And all the other things we have to close the gap on that we're not even close to yet. Here's an image from the Andrei Karpathy blog from a few years ago of former president Obama stepping in a scale. We can classify, we can do semantic segmentation of the scene, we can do object detection, we can do a little bit of 3 d reconstruction from a video version of the scene. But what we can't do well is all the things we take for granted. We can't tell the images in the mirrors versus in reality as different.
We can't deal with the sparsity of information. Just a few pixels on President Obama's face we can still identify him as the president. The three d structure of the scene That there's a foot on top of a scale that there's human beings behind with from a single image. Things we can trivial you do using all the common sense semantic knowledge that we have cannot do. The physics of the scene that there's gravity. The and the biggest thing, the hardest thing is what's on people's minds and what's on people's minds about what's on other people's minds and so on. Mental models of the world, being able to infer what people are thinking about. Be able to infer there's been a lot of exciting work here at MIT about what people are looking at.
But we're not even close to solving that problem either. But what they're thinking about we're not even we haven't even begun to really think about that problem and we do it trivially as human beings. And I think at the core of that I think I'm harboring on the visual perception problem because it's one we take really for granted as human beings especially when trying to solve real world problems, especially when trying to solve autonomous driving, is we've have 540,000,000 years of data for visual perception so we take it for granted. We don't realize how difficult it is. And we kind of focus all our attention on this recent development of a 100000 years of abstract thought, being able to play chess, being able to reason. But the visual perception is nevertheless extremely difficult at all the at every single layer of what's required to perceive, interpret and understand the fundamentals of a scene. And a trivial way to show that is just all the ways you can mess with these image classification systems by adding a little bit of noise. The last few years there's been a lot of papers, a lot of work to show that you can mess with these systems by adding noise here with 99% accuracy predict a dog, add a little bit of distortion, you immediately the system predicts with 99% accuracy that's an ostrich.
And you can do that kind of manipulation with just a single pixel. So the that's just the clean way to show the gap between image classification on an artificial data set like ImageNet and real world perception that has to be solved. Especially for life critical situations like autonomous driving. I really like this Max Tegmark's visualization of this rising sea for that of the landscape of human competence from Hans Marowak. And this is the difference as we progress forward and we discuss some of these machine learning methods is there is the human intelligence, the general human intelligence. Let's call Einstein here. That's able to generalize over all kinds of problems, over all kinds of from the common sense to the incredibly complex. And then there is the way we've been doing especially data driven machine learning which is Savants, which is specialized intelligence.
Extremely smart at a particular task but not being able to transfer except in the very narrow neighborhood on this little landscape of different of arts, cinematography, book writing at the peaks and chess arithmetic and theorem proving envision at the at the bottom in the lake. And there's this rising sea as we solve problem after problem. The question can the methodology in and the approach of deep learning of everything we're doing now keep the sea rising? Or do fundamental breakthroughs have to happen in order to generalize and solve these problems? And so from the specialized where the successes are, the systems are essentially boiled down to given the data set and given the ground truth for that data set here's the apartment cost cost in the Boston area. Be able to input several parameters and based on those parameters predict the apartment cost. That's the basic premise approach behind the successes successful supervised deep learning systems today. If you have good enough data there's good enough ground truth and can be formalized we can solve it. Some of the recent promise that we will do an entire series of lectures in the 3rd week on deep reinforcement learning showed that from raw sensory information with very little annotation through self play where their systems learn without human supervision are able to perform extremely well in these constrained contexts. The question of a video game.
Here Pong 2 pixels being able to perceive the raw pixels of this Pong game as raw input and learn the fundamental physics of this game. Understand how it is this game behaves and how to be able to win this game. That's kind of a step toward general purpose artificial intelligence. But it is a very small step because it's in a simulated very trivial situation. That's the challenge that's before us. With less and less human supervision be able to solve huge real world problems. From the up top, supervised learning where majority of the teaching is done by human beings throughout the annotation process, through labeling all the data by showing different examples. And further and further down to semi supervised learning, reinforcement learning, unsupervised learning, removing the teacher from the picture and making that teacher extremely efficient when it is needed.
Of course data augmentation is one way as we'll talk about. So taking a small number of examples and messing with that set of examples, augmenting that set of examples through trivial and through complex methods of cropping, stretching, shifting and so on including through generative networks, modifying those images to grow a small data set into a large one to minimize to decrease further and further the input that's a human is, the input of the human teacher. But still that's quite far away from the incredibly efficient both teaching and learning that humans do. This is a video and there's many of them online for the first time a human baby walking. We learn to do this you know it's one shot learning. One day you're on 4 all fours and the next day you put your 2 hands up and then you figure out the rest. One shot. Well, you can kind of ish.
You can kind of play around with it. But the point is you're extremely efficient with only a few examples are able to learn the fundamental aspect of how to solve a particular problem. Machines in most cases need 1,000,000 and sometimes more examples depending on the life critical nature of the application. The data flow of, of supervised learning systems is there's input data, there's a learning system and there is output. Now in the training stage for the output we have the ground truth. And so we use that ground truth to teach the system. In the testing stage when it goes out into the wild there's new input data over which we have to generalize with a learning system and have to make our best guess. In the training stage that the processes with neural networks is given the input data for which we have the ground truth, pass it through the model, get the prediction.
And given that we have the ground truth we can compare the prediction to the ground truth, look at the error and based on the error adjust the weights. The types of predictions we can make is regression and classification. Regression is a continuous and classification is categorical. Here if we look at what if we look at weather, the regression problem says what is the temperature going to be tomorrow and the classification formulation of that problem says is it going to be hot or cold? With some threshold definition of what hot or cold is. That's regression classification. On the classification front it can be multi class which is the the standard formulation where you're tasked with saying what is there's only the a particular entity can be only be one thing. And then there's multi label or a particular entity can be multiple things. And overall the input to the system can be not just a single sample of the particular data set and the output doesn't have to be a particular sample of the ground truth data set.
It can be a sequence. Sequence to sequence, a single sample to a sequence, a sequence to sample and so on. From video captioning where it's video captioning to translation to natural language generation to of course the 1 to 1 computer general computer vision. Okay, that's the bigger picture. Let's step back from the big to the small, to a single neuron inspired by our own brain, the biological neural networks in our brain, in the computational block that is behind a lot of the intelligence in our mind. The artificial neuron has inputs with weights on them plus a bias and activation function and an output. It's inspired by this thing. So I showed it before.
Here visualizes the thalamacortical system with 3,000,000 neurons 476,000,000 synapses. The full brain has a 100000000000 neurons and a 1000 trillion synapses. ResNet and some of the other state of the art networks have in tens, 100 of millions of edges of synapses. The human brain has 10,000,000 times more synapses than artificial neural networks and there's other differences. The topology is asynchronous and not constructed in layers. The learning algorithm for artificial neural networks is back propagation, for our biological networks we don't know. That's one of the mysteries of the human brain. There's ideas but we really don't know.
The power consumption human brains are much more efficient than your networks. That's one of the problems that we're trying to solve And ASICs are starting to begin to solve some of these problems. And the stages of learning in the Biological Neural Networks you really never stop learning. You're always learning, always changing both on the hardware and the software. In, artificial neural networks oftentimes there's a training stage, there's a distinct training stage, and there's a distinct testing stage when you release the thing in the wild. Online learning is an exceptionally difficult thing that we're still still in the very early stages of. This neuron takes a few inputs, the fundamental computational block behind neural networks. Takes a few inputs, applies weights which are the parameters that are learned, sums them up, puts it into a nonlinear activation function after adding the bias also also learned parameter and gives an output.
And the task of this neuron is to get excited based on certain aspects of the layers, features, inputs that followed before. And in that ability to discriminate, get excited by certain things and get not excited by other things, hold a little piece of information of whatever level of abstraction it is. So when you combine many of them together you have knowledge. Different levels of abstractions form a knowledge base that's able to represent, understand or even act on a particular set of raw inputs. And you stack these neurons together in layers both in width and depth increasing further on and there's a lot of different architecture variants. But they begin at this basic fact that with just a single hidden layer of a neural network the possibilities are endless. It can approximate in any arbitrary function. Adding a neural network with a single hidden layer can approximate any function.
That means any other neural network with multiple layers and so on is just interesting optimizations of how we can discover those functions. The possibilities are endless. And the other aspect here is the mathematical underpinnings of neural networks with the weights and the differentiable activation functions are such that in a few steps from the inputs to the outputs are deeply parallelizable. And that's why the other aspect on the compute the parallelizability of neural networks is what enables some of the exciting advancements on the graphical processing unit, the GPUs and with ASICs TPUs. The ability to run across across machines, across GPU units in the very large distributed scale to be able to train and perform inference on Neural Networks. Activation functions. These activation functions put together our task with optimizing a loss function. For aggression that loss function is mean squared error usually.
There's a lot of variance. And for classification is cross entropy loss. In the cross entropy loss the ground truth is 1. In the mean squared error it's it's a it's a real number. And so with the loss function and the weights and the bias and the activation functions propagating forward to the network from the input to the output, using the loss function we use the algorithm of back propagation. I wish I did an entire lecture last time. To adjust the weights. To have the airflow backwards to the network and adjust the weights such that once again the weights that were responsible for for producing the correct output are increasing the weights that were responsible for producing the incorrect output were decreased.
The forward pass gives you the error, the backward pass computes the gradients. And based on the gradients the optimization algorithm combined with the learning rate adjust the weights. The learn learning rate is how fast the network learns. And all of this is possible on the numerical computation side with automatic differentiation. The optimization problem given those gradients that are computed and enough backward flow to the network of the gradients is Stochastic Gradient Descent. There's a lot of variance of this optimization algorithms that solve various problems from dying rarely used to vanishing gradients. There's a lot of different parameters and momentum and so on that really just boil down to all the different problems that are solved with nonlinear optimization. Mini batch size, what is the right size of a batch or really it's called mini batch when it's not the entire dataset To you based on which to compute the gradients to just the learning.
Do you do it over a very large amount or do you do it with stochastic gradient descent for every single sample of the data? If you listen to Jan Lecun and a lot of recent literature is small mini batch sizes are good. He says, training with large mini batches is bad for your health. More importantly, it's bad for your test error. Friends don't let friends use mini batches larger than 32. Larger batch size means more computational speed cause you don't have to update the weights as often. But smaller batch size empirically produces better generalization. The problem we're often on the broader scale of learning trying to solve is overfitting. And the way we solve it is the regularization.
We want to train on a data set without memorizing to an extent that you only do well in that trained dataset. So you want it to be generalizable into future into into the future things that you haven't seen yet. So obviously this is a problem for small data sets and also for sets of parameters that you choose. Here shown an example of a sine curve trying to fit a particular data versus a 9th degree polynomial trying to fit a particular set of data with the blue dots. The 9th degree polynomial is overfitting. It does very well for that particular set of samples but does not generalize well in the general case. And the trade off here is as you train further and further at a certain point there's a deviation between the the error being decreased to 0 on the training set and going to 1 on the on the test set. And that's the balance we have to strike.
That's done with the validation set. So you take a piece of the training set for which you have the ground truth and you call it the validation set and you set it aside. And you evaluate the performance of your system on that validation set. And after you notice that your train network is performing poorly on the validation set for a prolonged period of time that's when you stop, that's early stoppage. Basically it's getting better and better and better and then there is some period of time there's always noise of course. And after some period of time is definitely getting worse. And that's we need to stop there. So that provides an automated way to discovering when you need to stop.
And there's a lot of other regularization methodologies. Of course as I mentioned dropout is very interesting approach for and its variance of simply with a certain kind of probability randomly remove nodes in the network. Both the incoming and outgoing edges Randomly throughout the training process. And there's normalization. Normalization is obviously always applied at the input. So whenever you have a dataset has different lighting conditions, different variations, they get different sources and so on, You have to all kind of put it on the same level ground so that we're learning the fundamental aspects of the input data as opposed to the some some less relevant semantic information like lighting variation and so on. So we usually always normalize for example if it's computer vision with pixels from 0 to 255 you always normalize to 0 to 1 or negative one to 1 or normalize based on the mean and the standard deviation. That's something you should almost always do.
The thing that enabled a lot of breakthrough performances in the past few years is batch normalization. It's It's performing this kind of same normalization later on in the network. Looking at the inputs to the hidden layers And normalizing based on the batch of data which which you're training normalized based on the mean and the standard deviation. As batch normalization with batch renormalization fixes a few of the challenges which is given that you're normalizing during the training on the mini batch in the training dataset that doesn't directly map to the inference stage and the testing. And so it allows by keeping a running average it across both training and testing you're able to asymptotically approach a global normalization. So there's this idea across all the weights not just the inputs, across all the weights to normalize this, the normalize the world in the all the levels of abstractions that you're forming. And batch renorm solves a lot of these problems doing inference. And there's a lot of other ideas from layer to weight to instance normalization to group normalization.
And you can play with a lot of these ideas in the TensorFlow playground on playgroundtensorflow.org that I highly recommend. So now let's run through a bunch of different ideas some of which we'll cover in future lectures. Of what is all of this in this world of deep learning? From computer vision to deeper enforcement learning to the different small level techniques to the large natural language processing. So convolutional neural networks, the thing that enables image classification. So these convolutional filters slide over the image and are able to take advantage of the spatial and variance of visual information that a cat in the top left corner is the same as features associated with cats in the top right corner and so on. Images are just a set of numbers and our task is to take that image and produce a classification and use the spatial in the the spatial variance of visual information to make that, to slide a convolution filter across the image and learn that filter as opposed to as opposed to assigning equal value to features that are present in various various regions of the image. And stacked on top of each other these convolution filters can form high level abstractions of visual information and images. With AlexNet as I've mentioned on an ImageNet dataset and challenge captivating the world of what is possible with neural networks have been further and further improved superseding human human performance with of special note, Google Net with the inception module.
There's different ideas that came along ResNet with the residual blocks and SCNet most recently. So the object detection problem is a step the next step in the visual recognition. So the image classification is just taking the entire image is saying what's in the image. Object detection localization is saying find all the objects of interest in the scene and classify them. The region based methods like shown here FastR CNN takes the image, uses convolutional neural network to extract features in that image and generate region proposals. Here's a bunch of candidates that you should look at. And within those candidates, it classifies what they are and generates a four parameters, the bounding box that the that's that thing that captures that thing. So object detection localization ultimately boils down to a a bounding box, a rectangle with a class that's the most likely class that's in that bounding box.
And you can really summarize region based methods as you generate the region proposal, here a little pseudo code and do a for loop over the over the region proposals and perform detection on the on that for loop. The single shot methods remove the for loop. It's a single pass through, you add a bunch of take a for example here shown SSD. Take a pre trained neural network that's been trained to do image classification, stack a bunch of convolutional layers on top. From each layer extract features that are then able to generate in a single pass classes, the bounding boxes, bounding box predictions and the class associated with those bounding box. The trade off here and this is where the popular YOLOv123 come from. The the trade off here oftentimes is in performance and accuracy. So single shot methods are are often less performant especially on in terms of accuracy on objects that are really far away or rather objects that are small in the image or really large.
Then the next step up in visual perception visual understanding is semantic segmentation. That's where the tutorial that we presented here on GitHub is covering. Semantic segmentation is the task of now as opposed to a bounding box or to classify the entire image or detecting the objects as a bounding box is assigning at a pixel level the boundaries of what the object is. Every single in full scene classification, full scene segmentation class what every single pixel which class that pixel belongs to. And the fundamental aspect there is we'll cover a little bit or a lot more on Wednesday is taking a image classification network, chopping it off at some point and then having which is performing the encoding step of compressing a representation of the scene and taking that representation with a decoder, upsampling in a dense way the So taking that representation and upsampling the pixel level classification. So that upsampling there's a lot of tricks that we'll talk through that are interesting but ultimately boils down to the encoding step of forming a representation what's going on in the scene and then decoding step that upsamples the pixel level annotation classification of all the individual pixels. And as I mentioned here the underlying idea applied most extensively, most successfully in computer vision is transfer learning. Most commonly applied way of transfer learning is taking a pre trained neural network like ResNet and chopping it off at some point.
It's chopping off the fully connected layer. Layers, some aspects some parts of the layers. And then taking a data set that a new data set and retraining that network. So what is this useful for? For every single application computer vision in industry when you have a specific application like you want to build a pedestrian detector. If you want to build a pedestrian detector and you have a pedestrian data set, it's useful to take ResNet trained on ImageNet or COCO trained in the general case of vision perception. And taking that network, chopping off some of the layers and then retraining on your specialized pedestrian data set. And depending on how large that data set is, the sum of the previous layers that from the pre trained network should be fixed, frozen and sometimes not depending on how large the data is. And this is extremely effective in computer vision but also in audio speech and NLP.
And so as I mentioned with the pre trained networks, they are ultimately forming representations of the data based on which classifications the regression is made, prediction is made. But a cleanest example of this is the autoencoder of forming representations in an unsupervised way. The output the input is an image and the output is that exact same image. So why do we do that? Well, if you add a bottleneck in the network where there is where the network is narrower at the in the middle than it is on the inputs and the outputs. It's forced to compress the data down into meaningful representation. That's what the autoencoder does. You're training it to reproduce the output and reproduce it with a latent representation that is smaller than the original raw data. That's a really powerful way to compress the data.
It's used for removing noise and so on but it's also just a effective way to demonstrate a concept. It can also be used for embeddings. We have a huge amount of data and you want to form a compressed efficient representation of that data. Now in practice, this is completely unsupervised. In practice, if you want to form an efficient useful representation of the data, you want to train it in a supervised way. You want to train it on a discriminative task where you have labeled data and the network is trained to identify cat versus dog, that network that's trained in a discriminative way on an annotated supervised learning way is able to form better representation. But nevertheless the concept stands. And one way to visualize these concepts is the the tool that I really love projector.tensorflow.org is a way to visualize these different representations, these different embeddings.
You should you should definitely play with and you can insert your own data. Okay. Going further and further in this direction of unsupervised and forming representations is generative adversarial networks from these representations being able to generate new data. And the fundamental methodology of of GaN's is to have 2 networks. 1 is the generator, 1 is the discriminator and they compete against each other in order to for the generator to get better and better and better generating realistic images. The generators task from noise to generate images based on certain representation that are realistic. And the discriminator is the the critic that has to discriminate between real images and those generated by the generator. And both get better together.
The generator gets better and better at generating real images to trick the discriminator and a discriminator gets better and better at telling the different telling the difference in real or fake until the generator until the generator is able to generate some incredible things. So shown here in by the work with NVIDIA, I mean the the ability to generate realistic faces has skyrocketed in the past 3 years. So this the these are samples of celebrities photos that have been able to generate. Those are all generated by a GAN. There's ability to generate temporarily consistent video over time with GANs. And then there's the ability shown at the bottom right in NVIDIA. I'm sure they'll I'm sure I also will talk about the on a pixel level from semantic segmentation being so from from the semantic pixel segmentation on the right being able to generate completely the scene on the left. The all the raw rich high definition pixels on the left.
The Natural Language Processing World, same. Forming representations, forming embeddings with word2vec, ability to from words to form representation that are efficiently able to, then be used to reason about the words. The whole idea of forming representation about the data is taking a huge, you know, vocabulary of a 1000000 words. You want to be able to map it into a space where words that are far apart from each other are in in a Euclidean sense, in the Euclidean distance between words are are semantically far apart from each other as well. So things that are similar are together in that space. And one way of doing that with skip grams for example is looking at a source text and turning into a large body of text into a supervised learning problem by learning to map, predict from the words, from a particular word to all its neighbors. So train a network on the connections that are commonly seen in natural language. And based on those connections you're able to know which words are related to each other.
Now the main thing here is and I won't get into too many details but the the main thing here with the input vector representing the words and the output vector representing the probability that those words are connected to each other, the main thing both are thrown away in the end. The main thing is the middle, the hidden layer. The Well, that representation gives you the embedding that represent these words in such a way where in the Euclidean space the ones that are close together are semantically together and the ones that are not are semantically far apart. And natural language and other sequence data, text speech audio video relies on recurrent neural networks. Recurrent neural networks are able to learn temporal data, temporal dynamics in the data, sequence data, and are able to generate sequence data. The challenge is that they're not able to learn long term context. Because when unrolling a neural network it's trained by unrolling and doing back propagation without any tricks, the back propagation of the gradient fades away very quickly. So you're not able to memorize the context in a longer form of the sentences Unless there's extensions here with with LSTMs and GRUs long term dependency is captured by allowing the network to forget information, allow it to freely pass through information in time.
So what to forget, what to remember and every time decide what to output. And all of those aspects have gates that are all trainable with sigmoid and 10h functions. Bi directional real recurrent neural networks from the nineties is an extension often used for providing context in both direction. So, recurrent neural networks simply define vanilla way is learning representations for what happened in the past. Now in many cases you're able you it's not real time operation in that you're able to also look into the future. You look into the data that falls after the sequence. So benefits you do a forward pass through the network beyond the current and then back. The encoder decoder architecture in recurrent neural networks used very much when the sequence on the input and the sequence on the output are not relied to be of the same length.
The task is to first with the encoder network encode everything that's came everything on the input sequence. So this is useful for machine translation for example. So encoding all the information the input sequence in English and then in the language you're translating to, given that representation keep feeding it into the decoder recurrent neural network to generate the translation. The input might be much smaller or much larger than the output. That's the encoder decoder architecture. And then there's improvements. Attention is the improvement on this encoder decoder architecture that allows you to as opposed to taking the input sequence forming a representation of it and that's it. It allows you to actually look back at different parts of the input.
So not just relying on the on the the single vector representation of all the the entire input. And a lot of excitement has been around the idea as I mentioned some of the dream of artificial intelligence and machine learning in general has been to remove the human more and more and more from the picture. The being able to automate some of the difficult tasks. So AutoML from Google and just the general concept of neural architecture search, NASNET, the ability to automate the discovery of parameters of a neural network and the ability to discover the actual architecture that produces the best result. So with Neural Architecture Search, you have basic basic modules similar to the ResNet modules. And with a recurring neural network, you keep assembling a network together. And the value and assembling in such a way that it minimizes the loss of the overall classification performance. And it's shown that you can then construct a neural network that's much more efficient and much more accurate than state of the art on classification tasks like ImageNet here shown with a plot.
Or at the very least competitive with the state of the art and SCNET. It's super exciting that as opposed to like I said stacking Lego pieces yourself, the final result is essentially you step back and you say here's I have a data set with the with the labels with the ground truth which is what Google the dream of Google AutoML is. I have the data set, you tell me what kind of neural network will do best on this data set. And that's it. So all you bring is the data, it constructs the network through this neural architecture search and returns to you the model and that's it. It solves it's it makes it possible to solve the exception, you know, solve many of the real world problems that essentially boil down to I have a few classes I need to be very accurate on, here's my dataset. And then that converts the problem of a deep learning researcher to the problem of maybe what's traditionally what's more commonly called as sort of a data science engineer where the task is as I said focuses on what is the right question and and what is the right data to solve that question. And deep reinforcement learning taking further steps along the path of decreasing human input, deep reinforcement learning is the task of an agent to act in the world based on the observations of the state and the rewards received in that state.
Knowing very little about the world and learning from the very sparse nature of the reward. Sometimes only when you in in the gaming context when you win or lose or in the robotics context when you successfully accomplish a task or not. With that very sparse reward, I able to learn how to behave in that world. Here with with cats learning how the bell maps to the food and a lot of the amazing work at OpenAI and DeepMind about the robotics manipulation and navigation through self play in simulated environments. And of course the best are our own deep reinforcement learning competition with deep traffic that all of you can participate. And I encourage you to try to win that with no supervised knowledge, no human supervision through sparse rewards from the simulation or through self play constructs able to learn how to operate successfully in this world. And those are the steps we're taking towards general towards Artificial General Intelligence. This is the exciting from from the breakthrough ideas that we'll talk about on Wednesday.
Natural Language Processing to generate adversarial networks. Able to generate arbitrary data, high resolution data, create data really from this understanding of the world, to deep reinforcement learning being able to learn how to act in the world with very little input from human, supervision. It's taking further and further steps and there's been a lot of exciting ideas going by different names. Sometimes misused, sometimes overused, sometimes misinterpreted of transfer learning, meta learning and the hyperparameter architecture search. Basically removing a human as much as possible from the menial task and involving a human only on the fundamental side as I mentioned with the racing boat on the ethical side. And the things that us humans at least pretend to be quite good at which is understanding the fundamental big questions, understanding the data that empowers us to solve real world problems and understand the ethical balance that needs to be struck in order to solve those problems well. And as on the bottom right I show that's our job here in this room, our job for all the engineers in the world to solve these problems and progress forward through the current summer and through the winter if it ever comes. So with that I'd like to thank you and you can get the videos, code and so on online, deeplearning.
Mit.edu. Thank you very much guys.