Is soft-max cross-entropy the best for categorical outcomes in deep learning?
I have been touching deep learning and TensorFlow for a while. Amazingly, the majority of textbooks or lectures take soft-max_cross-entropy algorithm as the lost function for categorical outcome as if it is the default and the best. To my mathematical intuition, the nonlinearity of Log will interfere with the optimal cost calculation by biasing toward extreme values and leading to some optimization oscillation. So I set forth to design a new lost function based on cosine similarity with grade 2 normalization. This function will regard each outcome dimension equally,and does not bias toward extremity as in cross-entropy algorithm.
Here is the result on MNIST dataset with a usual CNN algorithm as in many deep learning textbooks. What I do is just replace the soft-max_cross-entropy with the lost function I wrote. Each function has been tested for five times. The averages of five results were used as the final result:
1)tf.nn.softmax_cross_entropy_with_logits: 0.98408 (error rate: 0.01592)
2)tf.nn.softmax_cross_entropy_with_logits_v2: 0.98622 (error rate: 0.01378)
3)tf.nn.sigmoid_cross_entropy_with_logits: 0.98276 (error rate: 0.01724)
4)My lost function based on Cosine similarity: 0.99178(error rate: 0.00822)
Here is my lost function code, entire python script is available by request:
def costFunction(logits, labels):
inter=tf.reshape(tf.reduce_sum(tf.square(tf.add(logits,0.0001)),1),[tf.shape(logits)[0],1])
s_out=tf.div (logits,tf.sqrt(inter))
cost = tf.negative(tf.reduce_sum(tf.multiply(s_out, labels)))
return cost
cost =costFunction(logits=pred,labels=y)
From the results we can see that my function will reduce the error rate almost by half compared with those cross_entropy functions from TensorFlow package. I have tried this on simple one-layer NN on MNIST other than CNN and my function also gave the best result. I welcome and encourage other researchers to test this function on other data sets.
If I didn’t make any fatal mistakes, the results can prove that simple but better lost functions really exist other than soft-max_cross-entropy. I am curious why so many textbooks and lectures take soft-max_cross-entropy as default for categorical outcome without giving any alternatives. Perhaps, these authors have been intimidated by the name of Shannon. Maybe, in this huge tide of AI, most people are busy catching it other than thinking of it, not mention improving it.