OFFENSIVE AI - When Good Computers Go Bad!

OFFENSIVE AI - When Good Computers Go Bad!

I am always being asked for examples how Machine Learning can be used by the Bad Guys. I have always struggled to find real world examples.

A few months back, while I was searching on Google to discovery how to fix something of a more pressing nature, I spotted the headlines & a few preview lines of an article explaining why web-portals now ask me how many of 9 possible pictures contain Fire Hydrants to prove “I am not a robot”. Apparently, according to the couple of lines I read, AI has proved a threat to “Captcha”. I should have read more or at least checked the date. But at the time it seemed a great answer to this question.

What is CAPTCHA?

CAPTCHA are those annoying words written in bazaar fonts that help sites to reduce the impact of brute force attacks; these attacks could be used to defeat password changes, spam engines and any number of bot-based malfeasance. Even though it is incredible frustrating, some of the phrases could sometimes bring a smile to my lips: “Improve Wages”, “Badger Dispute”, “peacock tap”……..

No alt text provided for this image

This mechanism has been used for what seems like decades and you have all seen it before. What I bet you didn’t know is that apparently CAPTCHA stands for “Completely Automated Public Turing test to tell Computers and Humans Apart” – a made up acronym that screams of being chosen before the supposed full name; and in point of fact the technique bears no similarity to the Turing Test. 

None-the-less, adding CAPTCHA to your web app as a security measure can bolster the sign in, sign up or password change processes etc, it has proved popular. It is mildly effective and not difficult to include.  

Badness

My first impression of the headline was this is one of the rare uses of a machine learning by bad actors in an offensive application. This was noteworthy. Surely AI being used to “recognise” the weird fonts and defeat this authentication mechanism could be exciting. I should have checked the date, the article was talking about defeating the newer version of CAPTCHA. This mundane and altogether less interesting improvement is based on behaviour (whatever that means, that passed me by) and is shown below.

No alt text provided for this image

Reading one line more only ( I could have missed more key points!), I discovered the text-based version had long been defeated. But all was not lost, that was defeated by AI. 

And my heart was set. My thoughts were: “I can design a machine learning proof of concept to do that easily on a back of a fag packet and then write the code” - this mobilised me, who cares if some professor has done it before or I am a little off track.

Clever ?

Well not really – I hate to burst anybody’s bubble, but it isn’t as difficult as creating HAL, the naughty computer in “2001 – space odyssey”. If you look below at these few samples, the task becomes clear.

No alt text provided for this image


Observation shows us that there are a finite set of distinct characters/alphabets being utilised with the aim to defeat workaday “screen-scraping and replay attack” techniques that would otherwise work on more uniform html with an Arial font. Here something "thinking" is needed to interpret the image of the character and then type it in as machine-readable characters – the intended human can do that easily, but so can a Neural Network.

Observation shows that these fonts repeat themselves. It also is clear that each word draws each character from a single font style and there are a limited number of character sets. Take a look at the words where there is are repeated letters, the repeated characters are identical .

No alt text provided for this image

This means the population of character sets is relatively small and the number of characters within that set will be limited to [A-Za-z0-9]. This makes harvesting a representative population for teaching a learning system feasible. The normal screen scraping methods may not be used to directly to defeat the authentication, but it can be used to repeatedly collect the entire population of letters and digits to build a learning dataset. A human operator will then label the character but then it becomes a simple example of supervised learn using a Neural Network.

To emulate that this, I simplified my proof of concept example. Instead of compromising CAPTCHA and similar tools, I imaged a similar situation where I interpret hand written characters in the range 0-9. Surely this is a valid test! After I achieved success, I could always expand the test to include the words above (i.e. the weird captcha word dpbaiajz from above) to check that the technique worked. My clumsy hand written digits are shown below.

No alt text provided for this image

The Experiment

STEP 1: Having penned the digits using my touch screen Lenova Yoga (yes my handwriting is really that bad); I saved each initially as a 32 by 32 pixel Black & White BMP format. Then for reason of compatibility as explained elsewhere (Buy the second edition of my last book in a month or two), I converted pictures into a 28by28 image. This was achieved by the extremely rare and complex MS-paint program.

No alt text provided for this image

Cheating I hear you cry! That’s not automated capture, you are interpreting the character yourself. No, as mentioned before these are distinct characters stored in a one-word image. If you look at the magnified example below, there is clearly white space in between each letter. It would be a simple task to find public domain software that could capture the image and then use another program to extract the individual characters from the word. It is an easy “edge detection” problem.

No alt text provided for this image

STEP 2: The task here was to extract this data into a form that a NN could use. Black & White Bitmap images are stored upside down with each byte representing 8 1 bit pixels. NN generally operate with numbers in the range of -1 thru +1 or 0 thru 1, so I chose the latter to be our data format. All I had to do is write out a Comma Separated (csv) ASCII art file where 0 represented a white background and 1 represented an ink line of our character.

I did a “pretty-print” to show what I mean:

No alt text provided for this image

Above our digits can be seen quite clearly in the ASCII art image produced by our extraction program. 

Achieving this wasn’t too challenging, five minutes on Wikipedia to get the formats for bit map images (BMP) and then 25 hours here or there to do a bit of C with a “for-loop” and “sprintf” sorted it out. I am particularly pleased with this sprintf so I have include it below.

No alt text provided for this image

One digit of raw CSV with the label (“1”) for training data is shown below using MS EXCEL.

No alt text provided for this image

STAGE 3: Back to school, teaching the Neural Network.

This is simple. Produce several variants for each digit 0-9, run the program to produce an image file for each, add the appropriate label, and then pump this into your neural network program -- simples.

This key "magic" of the neural network program is shown below using SKLEARN (simply because I like SKLEARN) – you could easily use data engineering tooling of your choice.

No alt text provided for this image

STAGE 4: Testing the Proof of Concept model

I took the original images, cloned them and produced a small test population of variants by squishing them, thinning or thicken the lines, generally shifting them about a bit. And with virtually no effort I got a success rate in excess of 80%.  

No alt text provided for this image

The SKLEARN is shown above.

So what makes the attack possible.

This old version CAPTACHA solved a problem of the time but times change. There are some flaws derived from absent security basics

  • This authentication used a finite list of words and characters which did not use any known computer font which would defeat basic OCR technology of the day. Great ! But as the lists were finite, there was insufficient/low entropy which made the mechanisms with foreseeable obsolescence. In simple terms, It could be memorised by a NN.
  • The full database of characters was available to all. Once a suitable recognition technology was available, the population of characters was easily farmed for learning. There were few restrictions to any automated program just collecting word after word by pressing the “please give me another button”. The defence to this would be an equivalent to a traditional “password lockout” count.
  • This same mechanism meant that if you did not know the answer to any particular captcha question, you could just press this redo button to get one you do know – you get unlimited attempts. A traditional “password lockout” count would inhibit this again.
  • There was no “login time-out” on entering the CAPCHTA. This gave as much time as needed for our “AI” to chug away and “break the code”.

So because the system did not implement we call traditional controls ( an equivalent of a login retry limit, and password lockout), it was eminently more hackable.

So how do I improve our lab-based simulation to make it more effective.

How can we make the attack better?

There are several things that could be done to make my model more effective.

The first would be to add some pre-filtering to standardise the text position. This, in my day, would be done by a convolutional neural network layer. The convolutional neural network layer is one or more additional layer for filtering in a neural network, most commonly applied to analyzing visual imagery. The take away there is PRE-filter. In our case top-justify and left- justify:

  •  Top-justify: This is simplicity itself “if PIXEL[0][0..27] == 0 then shift up (deleting row 0)”. Effectively, if the row0 contains all zeros, shift up all the rows. The results can be seen below
No alt text provided for this image
  • Left-justify: This is the same operation as the above only it works on column 0. Programatically. if Pixel[0..27] [0] == 0 then shift left
  •  Obviously, increasing the size of the learning database beyond the few dozen used will inevitably improve results. I did this by loading the MNIST characters. It was because of the MNIST project we move from a 32by 32 system to a 28by28 described about.

MNIST

The Modified National Institute of Standards and Technology (MNIST) database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by modifying samples in a NIST dataset taken from American Census Bureau employees.

The MNIST database contains 60,000 training images and 10,000 testing images

No alt text provided for this image

The MNIST data uses 28x28 pixel image using a 0-255 grayscale coding, as shown below.

No alt text provided for this image

I used the code below to convert from the 0-255 into the simple 0 or 1 format.

# convert to ascii art b&w with either 0 or 1s

#output a file ascii art image-with a label  learning

 $x = get-content .\mnist_test.csv # read in a huge file of characters

 $bwchars = @()    # set up an empty array for our new black & white database of characters

for ($i = 1;  $i -lt $x.count; $i++)    #  loop round each line which represents 1 character

    {

    $xx = $x[$i]

    $f = $xx.split(","# make an array of greyscale pixels

    $1char = $f[0]      # training label

     for ($n = 1 ; $n -lt 785; $n++)  # for each “pixel”

           {

           $1char  += ","   

          if ( $f[$n] -gt  0 ) {   # convert from grey scale to 1 or 0 which is Black or white

                $1char += 1   # this pixel is a forground so set to black

                }

           else    {

                $1char += 0 # this is abackground pixel so white

                }

          }

      $bwchars += $$1char

     }

# write out our bw character as learning data for our neural net

$bwchars | out-file -encoding ASCII character-asciiimage-learning-with-label.csv

Conclusion

I have noted in the last week or so, that there seems to be many posts on AI and CAPTCHA. I have made a concerted effort not to read them lest I am accused of plagiarism; I am hoping they are covering different areas or that I have gone too much off track.

At least with this article, you get to see a different perspective. One which steps through the process to demystify some of the "magic".






Chris Brookes-Mann

HM Principal Specialist Inspector | Chemicals, Explosives and Microbiological Hazards Division

5 年

I suppose in terms of proof of principle, the fact that handwritten postcodes/ZIP codes have been read by machines for years (decades?) shows it’s been possible for a long time even if the capability came a great expense in the past. As for the current “breed” of reCAPTCHAs, this xkcd comic probably explains the current situation better than I ever could! https://xkcd.com/1897/

回复

要查看或添加评论,请登录

Mark Osborne的更多文章

  • Chatting on GP & 3rd party assurance response II

    Chatting on GP & 3rd party assurance response II

    Summary – Creating a free to use working OpenAI application and RAG (i.e.

    2 条评论
  • Panning for Gold

    Panning for Gold

    Everybody knows I like working, but I like working smarter, not harder. I also hold strong opinions on security…

  • Chatting with GP & 3rd party assurance response

    Chatting with GP & 3rd party assurance response

    Summary – Creating a free to use working OpenAI application and RAG (i.e.

    3 条评论
  • Luhn algorithm and why it makes CC number DLP a reality.

    Luhn algorithm and why it makes CC number DLP a reality.

    Introduction With PCIDSS 4 dead-lines approaching, I figured it was time to revisit some of the basics of CC payments…

    6 条评论
  • Christmas Pudding or Yule Log(4j)

    Christmas Pudding or Yule Log(4j)

    Christmas Pudding or Yule Log(4j) Nobody aged thirty or younger will understand the following statement: “One of the…

    2 条评论
  • Lex AI those Attestations

    Lex AI those Attestations

    Wietse Venema, Dan Farmer, Marcus Ranum,Marty Roesch, Kris Klaus, W.Richard Stevens, Marc Hause, Ralf Moonen, Rain…

    2 条评论

社区洞察

其他会员也浏览了