Handwritten Text Recognition using Deep Learning (CNN & RNN)

Handwritten Text Recognition using Deep Learning (CNN & RNN)

  1. ABSTRACT

Handwriting detection is the capacity of a computer to accept and analyze legible handwriting input from sources such as paper documents, touch screens, pictures, and so on. One kind of pattern recognition is handwritten text recognition. The goal of pattern recognition is to classify or categories data or objects into one of the groups or categories. Handcrafted characteristics and a vast quantity of past information have been used in traditional handwriting recognition algorithms.

It is difficult to train an optical character recognition (OCR) system based on these conditions. Deep learning algorithms have produced breakthrough results in the area of handwriting recognition research in recent years.

Nonetheless, the increasing increase in the volume of handwritten data, along with the availability of vast computing capacity, necessitates improvements in recognition accuracy and warrants additional exploration. Convolutional neural networks (CNNs) are particularly excellent in perceiving the structure of handwritten characters/words in ways that aid in the automated extraction of distinguishing characteristics, making CNN the best solution for solving handwriting recognition difficulties. This technology will be used to identify texts in various formats.

Handwriting has advanced to the point that numerous types of handwritten characters, such as digits, numerals, cursive script, symbols, and scripts in English and other languages, may be discovered.

Keywords: CNN, RNN, CTC, Tensor Flow, OCR, SoftMax.

2. INTRODUCTION

An OCR system depends mainly on the extraction of features and discrimination/classification of these features (based on patterns). Handwritten OCR have received increasing attention as a subfield of OCR.

Handwritten Text Recognition is a technology that is much needed in this world as of today. Before proper implementation of this technology, we have relied on writing texts with our own hands which can result in errors. It is difficult to store and access physical data with efficiency. Manual labor is required to maintain proper organization of the data. Throughout history, there has been severe loss of data because of the traditional method of storing data. Modern day technology is letting people store the data over machines, where the storage, organization and accessing of data is relatively easier. ?

Adopting the use of Handwritten Text Recognition software, it is easier to store and access data that was traditionally stored. Furthermore, it provides more security to the data. One such example of Handwritten text Recognition software is the Google Lens. The aim of our project is to make an application that can recognize the handwriting using concepts of deep learning. We are thinking by approaching our problem using CNN as they provide better accuracy over such tasks.

3. OBJECTIVE

Handwritten Text recognition is one of the areas of pattern recognition. The purpose of pattern recognition is to categorize or classify data or object of one of the classes or categories. Handwriting recognition is defined as the task of transforming a language represented in its spatial form of graphical marks into its symbolic representation. Each script has a set of icons, which are known as characters or letters, which have certain basic shapes.

The goal of handwriting is to identify input characters or image correctly then analyzed to many automated process systems. This system will be applied to detect the writings of different format. The development of handwriting is more sophisticated, which is found various kinds of handwritten character such as digit, numeral, cursive script, symbols, and scripts including English and other languages. The automatic recognition of handwritten text can be extremely useful in many applications where it is necessary to process large volumes of handwritten data, such as recognition of addresses and postcodes on envelopes, interpretation of amounts on bank checks, document analysis, and verification of signatures. Therefore, computer is needed to be able to read document or data for ease of document processing.

Artificial Intelligence is a branch of computer science that endeavors to replicate or simulate human intelligence in a machine, so machines can perform tasks that typically require human intelligence. Some programmable functions of AI systems include planning, learning, reasoning, problem solving, and decision making.

Artificial intelligence systems are powered by algorithms, using techniques such as machine learning, deep learning and rules. Machine learning algorithms feed computer data to AI systems, using statistical techniques to enable AI systems to learn. Through machine learning, AI systems get progressively better at tasks, without having to be specifically programmed to do so.

Deep Learning is a subset of Machine Learning, which on the other hand is a subset of Artificial Intelligence. Deep learning algorithms attempt to draw similar conclusions as humans would by continually analyzing data with a given logical structure. To achieve this, deep learning uses a multi-layered structure of algorithms called neural networks.

The design of the neural network is based on the structure of the human brain. Just as we use our brains to identify patterns and classify different types of information, neural networks can be taught to perform the same tasks on data.?

Natural language processing (NLP) refers to the branch of computer science and more specifically, the branch of artificial intelligence or AI concerned with giving computers the ability to understand text and spoken words in much the same way human beings can. NLP combines computational linguistics rule-based modelling of human language with statistical, machine learning, and deep learning models. Together, these technologies enable computers to process human language in the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment.

NLP drives computer programs that translate text from one language to another, respond to spoken commands, and summarize large volumes of text rapidly—even in real time. There’s a good chance we’ve interacted with NLP in the form of voice-operated GPS systems, digital assistants, speech-to-text dictation software, customer service chatbots, and other consumer conveniences. But NLP also plays a growing role in enterprise solutions that help streamline business operations, increase employee productivity, and simplify mission-critical business processes.

CNN image classifications take an input image, process it and classify it under certain categories (E.g., Dog, Cat, Tiger, Lion). Computers sees an input image as array of pixels. Based on the image resolution, it will see h x w x d (h = Height, w = Width, d = Dimension). E.g. An image of 6 x 6 x 3 array of matrix of RGB (3 refers to RGB values) and an image of 4 x 4 x 1 array of matrix of grayscale image where 3 and 1 are the number of color values required to represent an image pixel. The followings are the type of layers which as commonly found in the CNNs:

1. Convolution – The main building block of CNN is the convolutional layer which is a mathematical operation to merge two sets of information. The convolution is applied on the input data using a convolution filter to produce a feature map. The first layer to extract features from an input image is convolution.

2. Striding – The number of shifts over the input image is called stride. Example- When stride is 1 then we move the filters to 1 pixel at a time.

3. Non-Linearity – The Activation function generally used is ReLU stands for Rectified Linear Unit for a non-linear operation. The output is:

{F(x) = max(O, x)}

4. Pooling Layer – After a convolution operation we usually perform pooling to reduce the dimensionality. This enables us to reduce the number of parameters, which in turn shortens the training time and reduces chances of overfitting. Pooling layers down sample each feature map independently, reducing the height and width, keeping the depth intact. Contrary to the convolution operation, pooling has no parameters. It slides a window over its input, and simply takes the maximum value in the window. The largest element from the feature map is selected by using this operation.?

4. LITERATURE REVIEW

We have decided to use the dataset of MNIST [1], there have been several accomplishments that has been achieved using this dataset. Even before using Deep learning, Handwritten text recognition has been made possible, however their accuracies were really low, or they had a relatively small dataset as said by Line Eikvil [2]. In this paper, usage of OCR has been discussed such as in Speech Recognition, Radio Frequency, Vision systems, Magnetic Stripes, Bar Code and Optical Mark Reading. A popular machine learning task is classifying the MNIST dataset, which is dataset of numbers. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis by Simard, Steinkraus and Platt is a valuable paper for understanding usage of convoluted neural networks (CNNs). For word recognition, a Paper by Pham et al., used a 2-layer CNN which fed into a bidirectional recurrent neural network (RNN) with Long Short-Term Memory (LSTM) cells [3].

The best model implemented, according to us, is by Graves and Schmiduber with a multidimensional RNN structure. [4] Another paper on ’Handwritten Text Recognition’ by M.J. Castro-Bleda dealt with dataset with slanted words as well and corrected them during pre-processing. [5] Development of English Handwritten Recognition Using Deep Neural Network by Teddy Surya and Ahmad Fakhrur uses a Deep Neural Network model having two Encoding layer and one SoftMax layer on the EMNIST dataset. Their accuracy using DNN was way better than the earlier proposed patterned and feedforward net ANN (Artificial Neural Networks). [6] Handwritten text recognition can also be achieved on basis of Relaxation Convolutional Neural Networks(R-CNN) and alternatively trained relaxation convolutional neural networks (ATRCNN) as done by ChunPeng Wu and Wei Fan. [7] Our model achieved accuracy over 87 percent using Convolutional Neural Networks from Keras[8] library.?

In the past few years, the CNN model has been extensively employed for handwritten digit recognition from the MNIST benchmark database. Some researchers have reported accuracy as good as 98% or 99% for handwritten digit recognition [8]. An ensemble model has been designed using a combination of multiple CNN models. The recognition experiment was carried out for MNIST digits, and an accuracy of 99.73% was reported [9]. Later, this “7-net committee” was extended to the “35-net committee” experiment, and the improved recognition accuracy was reported as 99.77% for the same MNIST dataset. An extraordinary recognition accuracy of 99.81% was reported by Niu and Suen by integrating the SVM (support vector machine) capability of minimizing the structural risk and the capability of a CNN model for extracting the deep features for the MNIST digit recognition experiment [1]. The bend directional feature maps were investigated using CNN for in-air handwritten Chinese character recognition [9]. Recently, the work of Alvear-Sandoval et al. achieved a 0.19% error rate for MNIST by building diverse ensembles of deep neural networks (DNN). However, on careful investigation, it has been observed that the high recognition accuracy of MNIST dataset images is achieved through ensemble methods only. Ensemble methods help in improving the classification accuracy but at the cost of high testing complexity and increased computational cost for real-world application.

The IAM dataset is a collection of handwritten alphanumeric derived from the NIST Special Database 19. Each image is converted to a 28x28 format and dataset structure that directly matches the dataset is used. The training set has 697932 images and test set has 116323 of uppercase and lowercase alphabets and numerals from 0-9 which are mapped to their corresponding classes. The test set and training set is available in the form of list within list. Each item of outer list represents an image and inner list represents the intensity values of 784 pixels (because size of image is 28 x 28 pixels) ranging from 0-255. The test images as well as train images have a white foreground and black background. Both the test images as well as train images are horizontally flipped and rotated 90 degrees clockwise. Y train and Y test both are arrays which contain number ranging from 0 to 61 as there are 10 numerals from 0-9 and 26 uppercase and lowercase alphabets each which adds up to 62. [1]

5. SYSTEM DESIGN

Handwritten Text Recognition (HTR) systems consist of handwritten text in the form of scanned images as shown in figure 1. we are going to build a Neural Network (NN) which is trained on word-images from the IAM dataset. because the input layer (and therefore also all the opposite layers) are often kept small for word-images, NN-training is possible on the CPU (of course, a GPU would be better). For the implementation of HTR, the minimum requirement is TF.

The block diagram of the proposed E handwritten character classification is shown in Figure 2. The proposed recognition technique relies on a convolutional neural network model (CNN) with a feature mapped output layer. Our proposed model will classify the given input out of 10 classes using CNN while classifying the Urdu numeral. Similarly, while classifying the Urdu character, the same model will classify the given?character out of 12 classes (see Figure 2). The detail about the different phases of our proposed model will come in the following subsections.?

Process Model are processes of the same nature that are classified together into a model. Thus, a process model is a description of a process at the type of level. Since the process model is at the type of level, a process is an instantiation of it. The same process model is used repeatedly for the development of many applications and thus, has many instantiations. One possible use of a process model is to prescribe how things must/should/could is done in contrast to the process itself which is really what happens. A process model is roughly anticipation of what the process will look like. What the process shall be determined during actual system development. ?

The goal of a process model is to be:

Descriptive

  • Track what happens during a process.
  • Take the point of view of an external observer who looks at the way a process has been performed and determines the improvements that must be made to make it perform more effectively or efficiently.

Prescriptive

  • Define the desired processes and how they should/could/might be performed.
  • Establish rules, guidelines, and behavior patterns which, if followed, would lead to the desired process performance. They can range from strict enforcement to flexible guidance.

Explanatory

  • Provide explanations about the rationale of processes.
  • Explore and evaluate the several possible courses of action based on rational arguments.
  • ?Establish an explicit link between processes and the requirements that the model needs to fulfil.

Pre-defines points at which data can be extracted for reporting purposes.

?1. Image Preprocessing

The image is preprocessed using different image processing algorithms like Inverting image, Gray Scale Conversion, and image thinning.

2. Segmentation

After preprocessing of the image segmentation is done. This is done with the help of following steps:

  • Remove the borders
  • Divide the text into rows
  • Divide the rows (lines) into words
  • Divide the word into letters

3. Feature Extraction

Once the character is segmented, we generate the binary glyphs and calculate the summation of each rows and columns values as an features.

4. Classification

In this phase, we are going to train and test the Neural Network, to finally identify the answer.

The Model Architecture proposed involves pre-processing the data and then feeding it into the CNN, after which it will be feeded into the RNN network which will then finalise and give us our output.

?CONTROL FLOW DIAGRAM

The large class of applications having following characteristics requires control flow modelling:

  • The applications that are driven by events rather than data.
  • The applications that produce control flow information rather than reports or displays.
  • The application that process information in specific time.?

DEPLOYMENT DIAGRAM

A deployment diagram shows the allocation of processes to processors in the physical design of a system. A deployment diagram may represent all or part of the process architecture of a system.

Training the Model:

  • The relevant dataset chosen by us, MNIST dataset will be loaded and a subset of it will be used, after dividing it into training and testing set.
  • Firstly, we are looking to develop a model architecture with a mixture of CNN, RNN and other desired layers, using DL libraries of Python.
  • Then we are looking to first perform the pre-processing operations required to be done on the dataset, to feed the data then to the Training Model.
  • After the Model will be put under training, and we will train multiple models by changing it’s parameters just so slightly for us to have multiple models to test upon.


Testing the Model:

  • The models trained in the earlier steps will not be tested on the data that it had not seen before.
  • For, different models we will be getting different accuracies and then we will be able to have the best accurate model.
  • We can now use this model, to finally build and deploy our HTR service for the users.

Proposed System

In the proposed system, after training and saving the model, we are looking to deploy the model in the form of an API, or by using any kind of deploying service into the web. Then we will create an interface with the model where the user will be able to access it and then use it accordingly. We are looking to develop a website kind of an interface at first whereas per the user’s requirement the model will be called to perform and identify the characters.?

CONCLUSION

Using modern day techniques like neural networks to implement deep learning to solve basic tasks which are done with a blink of an eye by any human like text recognition is just scratching the surface of the potential behind machine learning. There are infinite possibilities and application of this technology. Traditional OCR used to work like biometric device. Photo sensor technology was used to gather the match points of physical attributes and then convert it into database of known types. But with the help of modern-day techniques like convolution neural networks we are able to scan and understand words with an accuracy never seen before in history.

We aim to achieve highest level of accuracy that is possible for a lightweight easy to train model, with the help of freely available GPU resources and Compute time. We will be using and finalizing the best performing CRNN architecture for our model.


Reference: Khalkar, Rohini & Dikhit, Adarsh & Goel, Anirudh. (2021). Handwritten Text Recognition using Deep Learning (CNN & RNN). IARJSET. 8. 870-881. 10.17148/IARJSET.2021.86148.?



要查看或添加评论,请登录

Adarsh Singh Dikhit的更多文章

社区洞察

其他会员也浏览了