Words that AI can understand
Unlike humans, AI (Artificial Intelligence) can only operate on numbers and thus human language has to be converted into streams of numerical values. One could think that it’s simple because words are built from distinguished units, which for instance in English are letters that already have numerical codes such as ASCII codes. However, there are two major problems. First, letters have no meaning and only represent sounds. Second, the entire vocabulary of a human language is too vast to build a practical language model that a machine could be trained on. Since letters do not represent meaning and there are too many words the solution has to meet them both in the middle. If words can be broken up into smaller groups of letters that are reused in many words but still carry some meaning, a much smaller vocabulary can be created to represent a natural language.
Similarly, humans do not memorize words but rather groups of sounds and their meanings. Proof of that could be that babies can understand and speak before learning a written language. The same is true with people who are illiterate but can still communicate verbally. For fluency in a specific language, knowledge of those groups of sounds is not sufficient and must be associated with their context of use or in other words semantics. For people to learn any language memorizing words alone is ineffective unless the context is learned at the same time, which can happen in a lot of reading and live verbal communication. Artificial neural networks learn just the same way by processing vast amounts of text. While people match words to known sounds, machines match groups of letters that were mentioned earlier and are called tokens to received words. Same as people and machines must learn the semantics to make sense of sequences of words that they receive as inputs. To learn semantics machines must process a lot of text with valid syntax and context. Otherwise, a machine can mistune its parameters and function incorrectly. There is a lot said about bias in artificial intelligence, but people learn that bias too if they only have access to inappropriate content such as fake news or brainwashing propaganda literature.
To understand the conversion of words into streams of numerical values called embedding, let’s look at the following example shown in Figure 1. Tokenization.
During the first step of preparation of word sequences, such as sentences or paragraphs, to be processed by a neural network all words are being tokenized. This process takes each word from a sequence and breaks it up into one or more tokens to match the word text. Once the tokens are found, the corresponding vectors of numbers are concatenated into a word embedding as shown in Figure 2. Word embedding. Those component vectors are called token embedding.
The general idea in choosing those numbers is to position tokens that are more often nearby in the text to have embedding vectors that are more similar to those that are usually more distant. Since token embedding are vectors and in this example, they have two numbers though in practice more numbers are used, we can picture them as follows.
The measure of similarity that is used most often is cosine similarity and for two vectors T1 and T2 can be expressed as follows.
The numerator is represented by the dot product of those vectors while the denominator is the multiplication of norms of those vectors. In practice, the dot product is sufficient to measure the similarity because the vectors are usually normalized or close to the unit length. The values of embedding vectors can be trained with backpropagation by putting together close and distant tokens in a test group and tuning their coordinates based on similarities.
There are quite a few methods of embedding, but we focus on one that is suitable for transformer neural networks. Although embedding that was just shown can already be processed by an RNN (Recurrent Neural Network) with LSTM (Long Short-Term Memory) or GRU (Gated Recurrent Unit) elements, there is a clear advantage in using a Transformer network. The first advantage is speed because unlike RNN, which is sequential, a Transformer can process words in parallel taking advantage of multiple processing units such as CPU (Central Processing Unit) or GPU (Graphics Processing Unit). The second is the length of a sequence of words, where LSTM or GRU tend to “forget” the context of the beginning of a sequence if it is too long.
However, for a Transformer to learn or recognize the context of a sequence of words, the embedding that was just shown is insufficient. First of all, because words are processed in parallel the Transformer must be informed about the location of each word in a sequence. Again, one could think that it is simple because we could assign a positional index within a sequence to each word but this approach has one fundamental problem. Since values of that index are small at the beginning and grow incrementally when we move further toward the end of a sequence the network would assume that the same word would have greater importance at the end of a sequence rather than the beginning. Several methods of positional encoding can be found in various publications, e.g. Attention Is All You Need, by Ashish Vaswani, et al., but we will focus on sinusoidal with fixed wavelengths as shown in Figure 4. Positional encoding.
Instead of assigning an index, this method generates positional values based on sine and cosine functions at progressing wavelengths. It is similar to binary encoding but instead of assigning zeros and ones, e.g. 01001, 11011, etc., it uses real numbers and thus better fits the format of embedding. Once the vectors of sums of embedding and positional encoding of all words are created, the self-attention vectors can be computed, but this will be the topic for the next article in this series.
Now is the time for the final word to answer the question: what does artificial intelligence have to do with cybersecurity? As we mentioned before, the Vault Platform and Vaulter app are intended to provide protection primarily of the end users’ assets as an augmentation of existing security models, which focus primarily on service providers. Think of Vaulter as the security provider for edge computing devices. This may increase the complexity of the process but also opens up new avenues for how services can be provided to end users. Artificial intelligence can remove many obstacles that the complexity of any process usually creates. Since this system is about the improvement of security on the user’s side, many tasks that are critical for security cannot be delegated to cloud-based AI but must be handled on a user’s device even if the smaller version of AI has a limited ability compared to its big sister on the cloud.
Thank you for reading our newsletter and stay tuned for more updates from Vault Security!