Using Machine learning for Data anonymization.

Hi, my name is Aamir Mirza, a Data Scientist based in Melbourne Australia. It has been a while since I have posted any new article related to Artificial Intelligence or Machine Learning. Covid-19 related lockdown gave the perfect excuse to put pen to paper. 

In a previous engagement, I had the opportunity to create an ML ( Machine Learning) model, using some third party data. Because of privacy-related concerns, the third party was reluctant to part with the data required. I am sure other Data Scientist or BI analysts would have encountered this problem where their models intersect or require ( as input ) third party data or data with privacy related questions.

When I spoke to some of my colleagues they expressed similar sentiments concerning either third party data or sensitive data from within their organization.

With this in mind, I decided to tackle the problem in an unusual way. The goal was to extract third-party data, anonymize it, and still make sure that it can be used for down-stream ML. Welcome to sparse CNN autoencoders. While this architecture applies to a certain kind of data, the is no technical or mathematical reason why it cannot be applied to other domains. 

The architecture goes like this, we have two VGG 16 CNN networks, one the encoder other the decoder, we train them on specified data which can be encoded as a 3D frame (Input of the CNN). The secret sauce is the sparse layer that connects the encoder and the decoder. 

The input to the neural network is the same as the output, at first, this looks very strange, why would anyone wish to train a neural network whose input and output are the same. This type of architecture is called autoencoder, the idea behind this is for the ANN ( Artificial Neural Network ) to learn the hidden representation ( latent representation ) for the information which is used to train the ANN. If you are still confused, don’t worry I will provide ample of visual drawings to drive the point home.

Once the network is trained to reconstruct the original input, we chop off the decoder part and only keep the encoder plus the sparse layer. The sparse layer is now the output of our trained network. The new architecture looks like VGG16 CNN plus a sparse layer with the decoder part gone.

Here is the visual representation of the same.

No alt text provided for this image

At the middle point, we would insert the sparse layer, which looks like this.

No alt text provided for this image

This a visual representation of the middle layer, at any given moment in time, only 10% of the neurons are active, the decoder has to use this layer to reconstruct the data back.

As the ANN goes through the training process with millions of input, it starts to converge and the accuracy of the network goes up, In my case, the accuracy works around to be 83%, which is still a good outcome as I would explain later. 

Why Sparse.

To understand why sparsity is the secret sauce here, let us first understand dense representation concerning ANN. To illustrate the point let me give you a primer in ASCII, here in order to save space or memory, bits of information is packed together in such a way that if I were to change even a single bit the meaning to the ASCII set will change completely, for example, 88 might become 98. This is a dense representation.

The sparse representation uses sparsity as a function to encode information in a very larger bit space while using around 5 to 2 present of that space. In this sparse space, the advantage is that even if we are off by a few bits it is not going to change the overall representation.

In sparse representation, we do not have such problems as we encountered in the dense representation. If you are interested in exploring this topic a bit further here is the link. To cut a long story short sparsity makes sure that in the latent layer of our ANN there is enough combinatorial space for information to express itself without bit collisions which could potentially happen in dense space.

You would ask what has all of this got to do with data anonymization. As I mentioned before, once the ANN is trained we are only interested in the encoder part plus the sparse layer.

For every confidential or sensitive data as input to the ANN we get a sparse output that represents the original data in this sparse latent space. We can then use the output to train further neural networks without compromising confidentiality.

Similar inputs will create a similar sparse representation, we can measure the similarity as a function of active bits overlaps between different output in latent space.

We can create new inputs to downstream ANN by just performing bit OR, XOR , AND operations while managing fixed sparsity. This is just not possible in regular dense ANN. All of this is explored in great detail in the links below.

Without the decoder part that latent representation can never be used to reconstruct the original data. This sparse representation can now be made freely available to train downstream ANN for classification or regression. We pass on the encoder to a third party to encode and send us the data.

Finally lets come back to why even 83% accuracy is okay for the ANN to achieve its goal. One great thing about sparse networks is their ability to withstand noise, which comes out of the box as a result of sparsity. Even if the ANN is not able to get the complete representation in the input space, it is still good enough for the latent layer to do its job. This also depends upon the size of the sparse latent space and level of sparsity. As a thumb rule, I would not go beyond 10%.

Sparse ANNs are an active area of research, mostly because of the close relationship with the biological neuron. According to NeuroScience, the brain itself represents all information as a series of sparse activation in the cortical neurocircuits. There are tremendous advantages to this which would be another post another day.


For those interested in exploring this a bit further please use the following link.


Aravind Ganapathiraju

VP of Applied Sciences

4 年

Hi Aamir. Good writeup. You start with the problem of anonymization but don't have anything related to the task described here. You also mention accuracy of 85%, is that reconstruction accuracy? Adding some of the task related details would really help make this more relevant to the problem of Anonymization.

回复

要查看或添加评论,请登录

Aamir Mirza的更多文章

  • EV: To be or not to be.

    EV: To be or not to be.

    Hi, my name is Aamir Mirza,( ????? ????? ????? ?????? ????? ) an AI researcher based in Melbourne, Australia. This blog…

  • GPT the next frontier. Multi-Modality a path to AGI.

    GPT the next frontier. Multi-Modality a path to AGI.

    AI-generated pod-cast here on Sound cloud Hi, my name is Aamir, ( ????? ????? ????? ?????? ?????) an AI researcher…

  • Age of truly immersive gaming, GPT meets NPC.

    Age of truly immersive gaming, GPT meets NPC.

    My name is Aamir Mirza an AI researcher based in Melbourne(????? ????? ????? ?????? ?????). In this blog, we shall look…

    1 条评论
  • The Age of Prompt Engineering.

    The Age of Prompt Engineering.

    My name is Aamir Mirza, (????? ????? ????? ?????? ?????) an AI researcher based in Melbourne. In this blog, we shall…

    4 条评论
  • GPT - Hallucination, toxicity, and other electric dreams

    GPT - Hallucination, toxicity, and other electric dreams

    My name is Aamir Mirza,(????? ????? ????? ?????? ?????) an AI researcher based in Melbourne. In this blog we shall…

    2 条评论
  • Chat GPT unpacked.

    Chat GPT unpacked.

    My name is Aamir,( ????? ????? ????? ?????? ?????) and I am an AI researcher based in Melbourne. Welcome to my latest…

    3 条评论
  • Security, What Security?

    Security, What Security?

    There has been a recent spate of high-profile online security breaches in Australia. Private details of millions of…

  • The Age of Declarative Programming.

    The Age of Declarative Programming.

    My name is Aamir Mirza ( ????? ????? ????? ?????? ?????) an AI researcher based in Melbourne Australia. If you are not…

  • Self Driving Cars, what they are and what they are not

    Self Driving Cars, what they are and what they are not

    My name is Aamir Mirza ( ????? ????? ????? ?????? ?????) an AI researcher based in Melbourne Australia. A few years…

  • From Strings to Things.

    From Strings to Things.

    How Data Scientist in @myob are transforming online search. Hi, my name is Aamir Mirza, a Data Scientist and…

    4 条评论

社区洞察

其他会员也浏览了