Deep neural networks have achieved great success in many applications. However, it lacks rigorous theoretical explanations why deep networks work such well for a long time. In this talk, I will introduce my recent work on the optimization theory for training deep neural networks with Rectified Linear Unit (ReLU) activation function. In particular, I focus on binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. I will further discuss our understanding of the optimization process for training deep ReLU networks, as wells as the basic idea for proving the theoretical results.