So bad, it’s good…… What is Anti-Learning?
I am currently researching this topic and would like to explain it to a wider audience based an extract of one of my recent papers [0] and my youtube video. Please share your thoughts about it!!
Much of the teaching of machine learning revolves around using local knowledge, in N dimensional space, to iterate towards a global solution. This relies upon the premise that there will be a fitness ‘hill’ to climb on the way to the fitness ‘peak’ [4]. There are many examples of algorithms reaching local minima and methods to reset from or avoid these fitness sub-peaks [5], but overall there is an assumption that for supervised, semi-supervised and unsupervised learning methodologies that use local information will eventually reach the best possible solution and therefore provide the best possible model of the data.
However, there exists a range of problems where the use of local information to iteratively improve the accuracy of a model fails entirely. In fact, it fails so badly that the resulting model predicts categories in an unseen validation set at a rate much worse than guessing, this is called Anti-Learning [6]. Data that exhibits this kind of behaviour comes from a range of sources both real and synthetic. Problems in the areas of cancer biomarker modelling, yeast genomics and chemical regulation are all real-world examples where there exist anti-learnable datasets. Synthetic data exhibiting this phenomenon includes some traditional matrices and various versions of exclusive-OR data. We think anti-learning should be a crucial part of any machine learning curriculum, given it elegantly explains some key features such as the requirement for validation and the use of high granularity cross-validation.
The meaningfulness of nearest neighbour style clustering in highly dimensional data has be discussed previously [2] and it has been argued that there are serious problems with using this approach alone [3], but this is exactly what happens in most undergraduate modules on machine learning.
Anti-learning is the situation when, no matter how hard you optimise your algorithms, the performance of your machine learnt model, on data not used for its training, is reproducibly worse than the probability of guessing the answer. Thus, for a two-class problem, if your model is consistently less than 50% accurate on unseen data, the underlying model needs to be anti-learnt. Most people are baffled by this scenario, but its explanation ironically lies in one of the simplest, standard problems used to teach the need for multilayer perceptrons [1].
The exclusive-or problem is a trivial problem that is used by many teachers to explain why linear modelling solutions are insufficient to model non-linear problems [7]. It says that when inputs two inputs are both true or both false then the output should be false, while when the first and second input are different, the output should be true [8]. The figure above displays how the hyperplane needs to be separated (from [8]).
This problem requires a curved hyperplane to be generalised to accurately represent the data, and as such is a non-linear problem. The main stumbling block here though is that all of the dataset is required to build the model, so therefore there will be no data left to test the model. If you leave one value out of the training set, the model built will always give the incorrect answer for these excluded values. For the Exclusive-Or problem where we know we have all available data, modelling using all the data is possible in theory. However, this is a trivial case, as for most problems we only model the existing data in an effort to be able to approximate or predict outcomes for unrepresented or future scenarios, so therefore to test if we have generalised we need to assess performance on data not used to train the model.
REFERENCES
[1] P, Sankar K., and S Mitra. "Multilayer Perceptron, Fuzzy Sets, Classifiaction." (1992).
[2] C. Roadknight, U. Aickelin, A. Ladas, D. Soria, J. Scholefield, and L. Durrant. "Biomarker clustering of colorectal cancer data to complement clinical classification." In Computer Science and Information Systems (FedCSIS), 2012 Federated Conference on, pp.187-191. IEEE, 2012.
[3] K. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, "When is nearest neighbor meaningful?" in: Proc. Int. Conf. Database Theory, 1999, pp. 217-235.
[4] H. Kaizhu, H. Yang, I. King, and M. Lyu. "Global Learning vs. Local Learning." Machine Learning: Modeling Data Locally and Globally (2008): 13-27.
[5] D. Gorse, A. Shepherd, and J. G. Taylor. "A classical algorithm for avoiding local minima." In Proceedings of the World Congress on Neural Networks, pp. 364-369. 1994.
[6] A. Kowalczyk, and O. Chapelle. "An analysis of the anti-learning phenomenon for the class symmetric polyhedron." In International Conference on Algorithmic Learning Theory, pp. 78-91. Springer, Berlin, Heidelberg, 2005.
[7] Z. Yanling, D. Bimin, and W. Zhanrong. "Analysis and study of perceptron to solve XOR problem." In Autonomous Decentralized System, 2002. The 2nd International Workshop on, pp. 168-173. IEEE, 2002.
[8] M. Lewin. “All about XORâ€. In Overload Journal #109 - June 2012. Programming Topics. 2012. https://accu.org/index.php/journals/1915
Graduated from the University of Kentucky
1 å¹´I realize this is an old article now, but I'm just coming across it. I have managed to run into this anti-learning phenomenon twice now in my career and the information out there on how to deal with it is abysmal. It can also be hard to notice if you aren't looking at the distribution of your model scores. For instance I have a data set where my neural network has an AUC score max of 0.94, min of 0, and STD of 0.3, meaning on some cross validations it trains normally and others it anti-learns. If you just look at the average you will be misled. I am going to check out your references, but seeing as it has been a few years do you have a guide or resource you can recommend for how to train a model with data that has a propensity to anti-learn?
A servant leader, Innovator, Intellectually curious
5 å¹´An interesting read! Thank you for sharing.?
Functional Business Analyst | Systems Analyst | Six Sigma Green Belt
6 å¹´Awesome articles Uwe. Can I send an email as well so that you send me some more papers on this topic ?
Professor & Associate Dean, College of Computing & Information Sciences at Karachi Institute of Economics & Technology
6 å¹´Great article Uwe. Can you recommend some more articles on this topic, please? I would like to share these with my students as well.
Data & AI Scientist | PhD | ML with Impact | Certified Generative AI Specialist | Multi-Industry Experience
6 å¹´Interesting! (fyi, Caihao Cui)