What is the 0-1 loss function? Why can't the 0-1 loss function or classification error be used as a loss function for optimizing a deep neural network?

0-1 loss function is a basic method used in classification to count the number of misclassified samples for a hypothesis prediction function (machine learning model) on given training data. Every data point would accumulate a loss of 1 if they were misclassified and 0 if correctly classified. The normalized zero-one loss returns the fraction of misclassified training samples, also referred to as the training error.

Why can’t the 0-1 loss function or classification error be used as a loss function for optimizing a deep neural network?

The zero-one loss is often used to evaluate classifiers in multi-class/binary classification settings. However, it is rarely helpful to guide optimization procedures because 0-1 loss is non-convex and discontinuous, so (sub)gradient methods cannot be applied.