Adding Human Rationality to MNIST via Neural Cellular Automata

Adding Human Rationality to MNIST via Neural Cellular Automata

In my doctorate, back in 2006, I studied Business Administration and had an idea of mixing Electronic Engineering concepts with business. At that time, I knew interdisciplinarity was the way to generate innovative solutions. This decision allowed me to get a full scholarship from the university. After doing extensive literature review during my two years of masters, where I studied social networks, I came up with the idea of simulating human interactions that occurred in the real world, specifically the perception of quality in professional services.

In the beginning I was exploring possibilities and got in touch with genetic algorithms, studying the work of Melanie Mitchell (1996) and John Holland. As I previously worked as a dentist, I knew genetics, crossover, mutation and other concepts from biology. The concept of evolution and natural selection was inspiring and opened my eyes to other vital concepts: complexity, environment feedback, emergence, closed and open systems (Bertalanffy, 1950).?

Nowadays, these are one of the areas of research of the Santa Fe Institute, in New Mexico, where Melanie teaches. These concepts, along with epistemology classes approaching the work of Popper, Bachelar, Kant, Descartes and others led me to Edgar Morin, a french philosopher that has extensive work about complexity. Morin criticizes reductionism, stating that it isolates phenomena from context and disregards randomness. He supports recursive scientific methods and transdisciplinarity, arguing that the whole is contained in each one of the parts of a system. This is directly related to complexity, where the whole system is usually bigger than the sum of its individual parts.

That’s when I approached cellular automata, inspired by a friend of mine. Simple and local rules that generate complex patterns and emergence of phenomena. As I will explain ahead, cellular automata is an iterative closed system whose evolution in time depends upon the context (environment) in which individual parts (cells, individuals) are inserted. Latest deep learning works, like GFlowNets, from Bengio et al. (2021), present a quite interesting similarity with cellular automata concepts, given that the flow of information follows paths inside a given system.

But how exactly cellular automata relate to perception of quality? In fact,? human perception of quality depends upon tangible (comfort, physical appearance) and intangible (honesty, willingness to help, confidence) aspects of a business. People interact with each other and in many cases, word-of-mouth advertising plays a critical role in spreading positive (or negative) information about a business. This flow of information can be modeled with cellular automata.

So, after studying theoretical concepts from quality perception in the literature, I did market research with real people (adults) in two different times: one at time zero and the second at time zero plus 4 months. That’s called longitudinal market research.

My idea was to develop a model whose initial condition was inherited from the results of the first market research response values and evolve it to simulate human interactions that led to the values of the second market research. Then, compare results of the model with the second market research, as seen in Figure 1.

No alt text provided for this image

Figure 1. Methodology of the simulation comparison

Epistemologically speaking, I tried to get rid of a priori bias considering concepts from complexity, emergence of global phenomena from simple rules in a way that could be modeled via agent-based modeling, without trying to force the result I wanted to happen. Risky. I didn't have any preconceived ideas about the final output. I mean, although I extensively studied theoretical concepts from business domain and complexity, the whole experiment could go wrong. There was a possibility I could reject the hypothesis that stated that my model could simulate human interactions with reasonable accuracy.

The market research was applied to aesthetic service customers using a Likert scale (categorical) of 5 points, from disagree completely (0) to agree completely (4), and a central neutral response, following SERVQUAL structure (Parasuraman et al, 1985, 1988) and validated through qualitative research (interviews). The full methodology and results of the study are detailed at Zimbres (2009), published at Elsevier.?

These values in the responses, 0, 1, 2, 3 and 4 were used as the initial state of the cellular automaton for each individual.?

No alt text provided for this image

Figure 2. Likert scale from disagree completely (0) to agree completely (4).

Below you can see the evolution of a cellular automaton regarding the Honesty indicator. Starting from a single horizontal line (time=0), individuals (square cells) interact via local rules and generate the second line below (time=1) and so forth. It’s interesting to see how dissatisfaction (red squares) spreads through individuals. It’s also possible to find a periodic contour condition, the pattern inside the black squares, a configuration that repeats over time influencing different sets of individuals.

No alt text provided for this image

Figure 3. Five state cellular automaton evolving for the quality perception indicator Honesty.

An interesting behavior of cellular automata that emerged, shown below, is the oscillatory behavior of the sum of cell states towards a decrease of the euclidean distance of vectors of different cell states for indicators among individuals. This suggests the emergence of consensus, given that the difference in quality perception among individuals decreases, which is properly simulated by the cellular automaton model.

No alt text provided for this image
No alt text provided for this image

Figure 5. Oscillatory behavior given by the sum of cellular automata cell values and stabilization in the tenth cycle.


Cellular Automata

Let’s now explain how these interactions occur. Cellular automata are finite state machines where cells, usually in a lattice, interact with each other following simple rules. As a complex system, individuals’ states evolve according to their own states and? neighborhood (environment) they interact with. From simple rules, complex patterns emerge, and many times, these complex patterns have little relation with the simplicity of initial conditions.

In my thesis I used a 5 state cellular automaton (CA) rule, but let’s start from the basics, a 2 state one-dimensional cellular automaton. Consider that one central cell has two neighbors, one on the right and one on the left. As it can only have 2 states, these states can be 0 or 1. For rule 232 of CAs, a majority rule, we have the following Truth Table:

No alt text provided for this image

Figure 6. Truth table of 3 bits.

The rule then can be obtained by the sum of the multiplications of the final states of the cell by the positional multipliers. That can be translated into the following transition table:

No alt text provided for this image

Figure 7. Transition table for one-dimensional two states cellular automaton rule 232.

That is, for each cell in the lattice with its position c(i,j) where i and j are the row and the column respectively, a function Sc(t)=S(t;i,j) is associated with the lattice to describe the cell c state in time t. So, in a time t+1, state S(t+1,i,j) is given by:

S(t+1;i, j ) = [S(t;i,j)+δ]mod k

where ? k ≤ δ ≤ k and k is the number of cell c states. The formula for δ is:

δ = μ if condition (a) is true

δ = -S(t;i,j) if condition (b) is true

δ = 0 otherwise

where ?a and b change according to the rule.

This means that the central cell, according to its own state and its neighbors’ state on the left and right at time=0, updates its value in time t +1, sometimes spreading the information over the lattice.?

No alt text provided for this image

Figure 8. Cellular automaton evolution of one-dimensional rule 126.

In 3D, using two dimensional CAs (Packard and Wolfram, 1984), these rules can generate very complex patterns, like a pyramidal structure:

No alt text provided for this image

Figure 9. Evolution of a two-dimensional cellular automaton.

In my thesis, although the search space of possible rules was at the order of 10^80 (exactly 2350988701644575015937473074444491

355637331113544175043017503412556834518909454345703125 possible rules) and computational resources were limited to an AMD with 512MB of RAM, it was possible to find a cellular automaton rule that could mimetize human interactions with an accuracy of 73.80%. In fact it was not luck, but what I call “random guided search” during 4 months. I used Mathematica to do the modeling.?

Nowadays, some researchers are developing Neural Cellular Automata (NCA), i.e., cellular automata integrated in Deep Learning architectures.? Alexander Mordvintsev (Google), Ettore Randazzo (Google), Eyvind Niklasson (Google) and Michael Levin (Allen Discovery Center) developed a regenerative NCA based on morphogenesis, that is “the process of an organism’s shape development”. In their work, they iterate generative neural networks in a way that the final error with respect to the desired output serves as an incremental update (loss for backpropagation) to the following iteration, using 16 channel images, a sobel filter for the 2D convolution and the concept of residual neural networks, as seen on Figure 10.

No alt text provided for this image
No alt text provided for this image

Figure 10. Neural Cellular Automata. Source: Mordvintsev, Randazzo, Niklasson and Levin (2020)

Given that I could find a cellular automaton rule that reasonably mimetizes human rationality within all existent limitations, I was curious to know what happens if we add this rationality to a discriminative neural architecture with a simple dataset, as MNIST. Interestingly, the neural net behaved like a human. I will explain.


The Experiment

Consider that we have the following 5x5 kernel for a regular deep learning architecture:

No alt text provided for this image

Figure 11. Regular kernel 5 x 5.

Now, imagine that we will apply a one-dimensional cellular automaton rule to each row of this kernel (t=0), evolving each row vertically, in depth and take the next configuration (t=1). For the cellular automaton, I used a toroid configuration, where the last cell on the left interacts with the last cell on the right. To make this simple in Python, I padded the original 5x5 kernel on both sides, artificially creating a toroid, like the one seen on the picture below (left), where a two dimensional lattice (right) folds around itself.

No alt text provided for this image

Figure 12. Toroid (left) created by folding a two dimensional lattice (right) where the cellular automaton runs.

I then ran the rule of my thesis, rule:

2159062512564987644819455219116893945895958528152021228705752563807959237655911950549124 for one step vertically on each row of the above kernel.?

Below, its transition table is shown:

No alt text provided for this image

Figure 13. Transition table for rule 21590625125649876448194552

19116893945895958528152021228705752563807959237655911950549124

This cellular automaton iteration on the original kernel created a new kernel, asymmetric, as seen below:

No alt text provided for this image

Figure 14. Neural cellular automata kernel 5 x 5.

Although the initial condition (original kernel) was symmetric, this altered kernel is evidently asymmetric, meaning that the “human” kernel I created is biased towards the left side. This makes sense and may be considered natural, as humans have their bias in decisions, motivated by self-interest. This also corroborates the Nash equilibrium (1950), where each agent pursues what is best for herself/himself, generating a suboptimal outcome for the collective.

Python code for cellular automata:


regra=215906251256498764481945521911689394589595852815202122870575256380795923765591195054912

base1=5
states=np.arange(0,base1)
dimensions=5

## KERNEL 5x5

kernel=[[1, 0, 1, 0, 1],
       [0, 1, 0, 1, 0],
       [0, 0, 1, 0, 0],
       [0, 1, 0, 1, 0],
       [1, 0, 1, 0, 1]]

def cellular_automaton():
    global kernel

    lista=states
    kernel=np.pad(kernel, (1, 1), 'constant', constant_values=(0))
    kernel[0]=kernel[1]
    kernel[-1]=kernel[-2]

    kernel2=np.transpose(kernel)
    kernel2[0]=kernel2[1]
    kernel2[-1]=kernel2[-2]

    kernel=np.transpose(kernel2)

    all_possible_states=np.array([p for p in itertools.product(lista, repeat=3)])[::-1]

    zeros_all_possible_states = np.zeros(all_possible_states.shape[0])
    final_states = [int(i) for i in np.base_repr(int(regra),base=base1)]
    zeros_all_possible_states[-len(final_states):]=final_states
    length_rules=np.array(range(0,len(zeros_all_possible_states)))

    final_state_central_cell=[]
    for i in range(0,len(zeros_all_possible_states)):
        final_state_central_cell.append([0,int(zeros_all_possible_states[i]),0])

    initial_and_final_states=[]
    for i in range(0,len(all_possible_states)):
        initial_and_final_states.append(np.array([all_possible_states[i],np.array(final_state_central_cell).astype(np.int8)[i]]))


    def ca(row):
        out=[]
        for cell in range(0,dimensions):
            out.append(final_state_central_cell[next((i for i, val in enumerate(all_possible_states) if np.all(val == kernel[row][cell:cell+3])), -1)][1])
        return out

    kernel=np.array([item for item in map(ca,range(1,kernel.shape[0]-1))])
    return kernel        

For the neural net architecture I used 2 convolutions, max pooling, 2 dropouts, batch normalization and 3 fully connected layers, totaling 3,328,474 parameters. I didn’t do data augmentation to improve results. Regularization was not applied.

class Net(nn.Module)
    def __init__(self,kernel):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(32, 64, 3, 1,bias=False)
        self.conv2 = nn.Conv2d(1, 64, 10, 1,bias=False)
        self.dropout1 = nn.Dropout(0.2)
        self.dropout2 = nn.Dropout(0.4)
        self.fc1 = nn.Linear(3136, 28*28)
        self.fc2 = nn.Linear(28*28, 128)
        self.fc3 = nn.Linear(128, 10)
        torch.nn.init.xavier_uniform(self.conv1.weight)
        torch.nn.init.xavier_uniform(self.conv2.weight)
        torch.nn.init.xavier_uniform(self.fc1.weight)
        torch.nn.init.xavier_uniform(self.fc2.weight)
        torch.nn.init.xavier_uniform(self.fc3.weight)
        self.batch_norm = nn.BatchNorm1d(3136)
        self.conv1.weight = nn.Parameter(kernel,requires_grad=False)


    def forward(self, x):
        res = x.view(batch_size1, 784)
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = nn.MaxPool2d(2, 2)(x)
        x = F.relu(x)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.batch_norm(x)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(torch.mean(torch.stack((x,res)),0))
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc3(x)
        output = F.log_softmax(x, dim=1)
        return output        

I then ran the epochs with the “human” kernel and got a suboptimal result, 98.78% accuracy. Counter intuitive, right ? We always search for the best possible result. However, it’s known that the human performance on MNIST shows an average accuracy of 98.29%. So, at first it looks like my experiment partially succeeded, because adding a “human” cellular automaton kernel to MNIST generated similar results to human accuracy. This suboptimal outcome is easily understandable if you consider what Herbert Simon said back in 1978 regarding bounded rationality: people are unable to reach optimal results because of their cognitive limitation and also because they only get in touch with part of the information, and there is information asymmetry.?

As data scientists, we are used to pursuing optimal results, even in the search for General Artificial Intelligence. But having in mind that humans are suboptimal and inefficient, what kind of parameters should we consider? Could memory improvement we do to teach children boost human performance? Could this strategy be used to improve reinforcement learning models instead of stochastic processes?

Given these questions and in an attempt to improve results, I added a residual layer as a lateral input for the upper fully connected layer. Results jumped to 99.29% accuracy with kernel 5x5 and to 99.38% accuracy with kernel 3x3.?

The result achieved with kernel 3x3 is 0.11% better than Achara et al. (Google AI and Amazon Search) in their 2021 paper entitled “Robust Training in High Dimensions via Block Coordinate Geometric Median Descent” where they solved MNIST with a 1.16 million parameters CNN. The authors use a residual memory mechanism added to the gradient, “inspired by [an] error feedback mechanism”. According to them, “compensating for the loss incurred due to approximation through a memory mechanism is a common concept in the feedback control and signal processing literature”. In their case, L2 regularization was applied. Although they applied corruption to MNIST, the 99.27% accuracy was achieved in a clean dataset.

Results were very similar to Granmo et al. (2019), who got a 99.40% accuracy on MNIST using a? mixed model from the Tsetlin Automaton (Xiao et al., 2017) and the Finite State Learning Automaton (Narendra, 1989) applied to filters. According to authors, the “Nash equilibrium balances false positives against false negative classifications, while combating overfitting with frequent pattern mining principles”.

My proposed model is better than SEER model RG-64gf, from Goyal et al. (Meta AI Research, 2022), who got 99.32% accuracy with a model that has 250 million parameters and 27 layers.

Notice that the neural net I developed is not small, is quite shallow and I used ReLU activation functions so there are NOT problems of vanishing gradients and weight saturation. However, adding a residual layer improved results. Then I started to ponder: is the suboptimal behavior of human beings due to lack of memory? If we refresh human rationality with the original idea (residual layer) could we improve human performance? This looks like a key point in this experiment: if we were dealing with adults, adding memory means saying: “Look, this is a nine, this is an eight, this is a six”. It doesn’t make sense right? Adults already know what is a 6, an 8 or a 2. So, it looks like we are dealing with a child. It makes sense saying to a child: “Look, this is a 6, remember? This is a 8, right?”. Talking to my sister who has kids, she told me children learn numbers by the age of 4–6 years old. Coincidentally, Kosoy et al. also studied children of exactly this age (4-6 years old) in their 2022 paper, relating child development and learning with Reinforcement Learning strategies.

These findings, the effectiveness of the memory refresher as a tool for generalization, as well as the probable age of the child where this strategy is effective are very similar to the results of Moely et al. (1992). According to them, “children of grades 2-3 are unlikely to generate effective strategies in all but [...] are also very amenable to training in memory strategy use”. So it seems that children 7-8 years old are already sensitive to memory stimulation, but considering MNIST, the digit task, we got an age interval of 4-6 years old. Should it be possible that this sensitivity to memory stimulation starts years earlier? Probably yes.

Moely at al. found that “instructions [to students] that promote strategy generalization were rarely seen. Because teachers' strategy suggestions were usually quite task-specific, it is perhaps not surprising that so little instruction in generalization was found”. This is an issue in Reinforcement Learning models (Kosoy, 2022).?

Consider transformers, for instance. Although they are able to do different tasks, such as machine translation (Vaswani et al., 2017), text classification and even computer vision (Khan et al., 2022), they are task-specific, because you basically have to alter their structure according to different tasks, thus not generalizing. In this regard, Gato, from DeepMind (Reed et al., 2022) seems to be the first step into solving this generalization issue given that it is a multi-modal, multi-task neural network that uses the same network with the same weights for each one of the possible tasks that it aims to solve. The only adaptation necessary is input preprocessing.

According to Forsberg, Adams and Cowan (2021), it seems that “working memory ability is related to various measures of educational attainment”. According to the authors, the lack of memory seems to be a “bottleneck for long-term learning, constraining the ability to learn the meaning of new concepts and encode new information”. This probably happens because of the relations we usually do with different concepts. Encoding new information means learning, meaning to use concepts from other domains and past experience, memory, to understand new concepts and ideas. For neural networks, this may mean the ability to update weights given a gradient.

This experiment added human rationality to a neural network via kernel modification and also refreshed the memory of the neural net via residual layer. The neural net behaved as a 4-5 years old child. For me, this finding is very interesting, because it was totally unintended and because MNIST is such a trivial task, totally appropriate for a kid.?

Then, my final question is: if we treat our deep learning models as 5 years old kids and apply infant psychology to make them learn, could we achieve better results and more generalization power? Evidence says yes. Some researchers are already studying this relationship in reinforcement learning models, like Kosoy et al. from Berkeley, DeepMind and MIT (2020, 2022).

The full code of this experiment including the checkpoint of the trained neural network are available at my Github.


Acknowledgements

The author thanks the following organizations for their support and insights provided: Universidade Mackenzie for the scholarships during my masters and doctorate, Wolfram Research for the scholarship during my doctorate, Google Cloud and Google Developers for the sponsorship and continuous support of my activities and DeepMind for the insights.


References

A. Achara, A. Hashemi, U. Topcu, S. Sanghavi, I. Dhillon, P. Jain. Robust Training in High Dimensions via Block Coordinate Geometric Median Descent.? Arxiv:.2106.08882, 2021.?

L. Bertalanffy. An Outline of General System Theory. British Journal of the Philosophy of Science, 1950

Y. Bengio, T. Deleu, E.J. Hu, S. Lahlou, M. Tiwari, E. Bengio. GFlowNet Foundations.? arXiv:2111.09266, 2021.

L.J.M. Coulthard. Measuring service quality: A review and critique of research using SERVQUAL. International Journal of Market Research, v. 46, n. 4, 2004.

A. Forsberg, E. Adams, N. Cowan. The role of working memory in long-term learning: Implications for childhood development. Psychology of Learning and Motivation. Volume 74, 2021, Pages 1-45, Elsevier, 2021.

P. Goyal, Q. Duval, I. Seessel, M. Caron, I. Misra, L. Sagun, A.Joulin, P. Bojanowski. Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision. arXiv:2202.08360, 2022.

S. Khan, M. Naseer, M. Hayat, S.W. Zamir, F.S. Khan, M. Shah. Transformers in Vision: A Survey. arXiv:2101.01169, 2022.

E. Kosoy, A. Liu, J. Collins, D. Chan, J. Hamrick, N. R. Ke, S.H. Huang, B. Kaufmann, J. Canny, A. Gopnik. Learning Causal Overhypotheses through Exploration in Children and Computational Models. arXiv:2005.02880v2, 2022

E. Kosoy, J. Collins, D.M. Chan, S. Huang, D. Pathak, P. Agrawal, J. Canny, A. Gopnik, J.B. Hamrick. Exploring exploration: comparing children with RL agents in unified environments. arXiv:2005.02880v2, 2020.

J.F. Nash Jr. Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, USA 36, 48–9, 1950.

N. Packard, S. Wolfram. Two-dimensional cellular automata. Journal of Statistical Physics volume 38, pages 901–946,1985.

M. Mitchell.? An Introduction to Genetic Algorithms. Cambridge, Mass.: MIT Press, 1996

B.E. Moely, S.S. Hart, L. Leal, K.A. Santulli, N. Rao, T. Johnson, L.B. Hamilton. The Teacher's Role in Facilitating Memory and Study Strategy Development in the Elementary School Classroom. Child Development, Volume 63, Issue 3, 1992.

A. Mordvintsev, E. Randazzo, E. Niklasson, M. Levin. Growing Neural Cellular Automata, 2020. Available at: https://distill.pub/2020/growing-ca/?

K.S. Narendra, A.L. Thathachar. Learning Automata: An Introduction. Prentice-Hall, Inc., 1989.

A. Parasuraman, V.A. Zeithaml, L.L. Berry. A conceptual model of service quality and its implications for future research. Journal of Marketing, v. 49, Fall 1985.

A. Parasuraman, V.A. Zeithaml, L.L. Berry. Servqual: A Multiple-Item Scale For Measuring Consumer Perceptions of Service Quality. Journal of Retailing, v. 64, n. 1, Spring 1988.

S. Reed, K. ?o?na, E. Parisotto, S.G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Giménez, Y. Sulsky, J. Kay, J.T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y. Chen, R. Hadsell, O. Vinyals, M. Bordbar, N de Freitas. A Generalist Agent. arXiv:2205.06175, 2022.

H.A. Simon. Models of Bounded Rationality. Cambridge, Mass.: MIT Press, 1978.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, ?. Kaiser, I. Polosukhin. Attention is all you need. arXiv:1706.03762, 2017.

S. Wolfram. A new kind of science. Canada: Wolfram Media Inc., 2002.

H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747, 2017.

R.A. Zimbres, P.P.B. Oliveira. Dynamics of Quality Perception in a Social Network: A Cellular Automaton Based Model in Aesthetics Services. Electronic Notes in Theoretical Computer Science. Volume 252, 1, Pages 157-180, 2009.

R.A. Zimbres. A dinamica da percep??o de qualidade em servi?os estéticos sob uma perspectiva diádica, 2009 English: Dynamics of quality perception in aesthetic services under a dyadic perspective, 2009. Available at:?https://dspace.mackenzie.br/handle/10899/23257?

Alexandr Romashko, M.Sc, Ms.ISM

ML & AI Practitioner. Integration & Synergy Driver. Multi-Stack Developer. Web3.0 dev. MM, Double Master. Having both Master of Science, and Master of Information Systems Management (MISM) diplomas

2 年

Adding human empathy next? )

要查看或添加评论,请登录

社区洞察

其他会员也浏览了