Text detection in a natural scene image


This blog is a simplistic attempt to give an in-depth explanation on text localization techniques in a real world natural scene image. We will try to implement a text detection algorithm from scratch and use to to detect text.

Table of contents:

  1. Description
  2. Business Problem and use case
  3. Problem statement and expected solution
  4. Data Overview
  5. Deep learning approach
  6. EAST architecture and model description
  7. Training and loss
  8. Implementation and results
  9. Model quantization
  10. Further improvements and future work
  11. References


Recognizing text in an image or scene has been a challenging task. We humans have no problem in detecting and understanding text in the real world, be it in any shape, size, orientation and color. Along with this the image may have noise, blur, low light conditions etc.

So our challenge here will be to solve the problem of detecting text in an image (real world image with text) automatically.

Note that it is different from general OCR techniques where the text is structured and usually with an even background.

Business problem and use cases

Given an image if we are able to detect text in a scene containing noise, it will open doors to further solutions for many other use cases like text recognition, text translation etc. This will help in reducing cost of manual labor in many sectors. For example from a phone camera we can detect text in a language and translate it into English or native language of the person, so that they can read address or names of roads in a foreign environment for self-guidance. This can also be used in advertisements industry.

One more use case for OCR is where a visually disabled person can be helped with the technology advancements in text-to speech along with text detection, which can detect text in any format/font/orientation and convert it into speech for them to understand.

Problem statement

Given a set of natural scene images having text in them, we have to detect text in it with different orientation, colors, sizes. The detected text will be bounded by a rotated bounding box rectangle, which will try to enclose the text within it as accurately as possible.

Data source

Data is downloaded from Synthtext dataset .Annotations are given in a .mat file from which we will extract it to a pandas dataframe.

Dataset overview

This data contains 858,750 synthetic scene-image files (.jpg) split across 200 directories out of which we have filtered 5860 images randomly for training. This number can be increased for even more intensive training.

Annotations of each image has image names, bounding box coordinates() and texts at word level.

Bounding box coordinates -> word-level bounding-boxes for each image, represented by tensors of size 2x4xNWORDS_i, where:

  • the first dimension is 2 for x and y respectively,
  • the second dimension corresponds to the 4 points(clockwise starting from top-left)
  • And the third dimension of size NWORDS_i, corresponds to the number of words in the ith image.

For more understanding about the data please visit this link - Synthtext dataset .

Deep learning problem

Using a set of real world scene images with word level text in them annotated by a bounding box, we have to train a deep learning model(CNN) which can detect text at multiple word level separately given a new image. Different words in same line is detected when the model learns to separate two consecutive words by space.

We will use EAST text detector model’s architecture to implement our own CNN based model along with the help Opencv for image processing and to generate rotated bounding box.

Now we will understand each component used to solve this problem.

EAST — An Efficient and Accurate Scene Text detector

EAST is a simple yet powerful pipeline that yields fast and accurate text detection in natural scenes. The pipeline directly predicts words or text lines of arbitrary orientations and shapes, eliminating unnecessary intermediate steps with a single neural network.


Model Architecture:

The model is a fully-convolutional neural network adapted for text detection that outputs dense per-pixel predictions of words or text lines. This eliminates intermediate steps such as candidate proposal, text region formation and word partition. The post-processing steps only include thresholding and NMS on predicted geometric shapes.

EAST adopt the idea from U-shape (U-net) to merge feature maps gradually, while keeping the up-sampling branches small. Together we end up with a network that can both utilize different levels of features. The U-net architecture helps the model to learn from the previous layer’s output information.

The model can be decomposed in to three parts: Feature extractor stem, Feature merging branch and output layer as show in fig. above.

Feature extractor stem:

In original paper PVAnet was used but I will be using pre-trained Resnet50 model’s weights as feature extractor stem. This branch helps in extracting basic information from images like shapes, edges, color and patterns. This helps in making further process of learning about feature maps to detect text easier for the model.

Feature merging branch:

In each merging stage, the feature map from the last stage is first fed to an unpooling layer to double its size, and then concatenated with the current feature map. Next, a conv1×1 bottleneck cuts down the number of channels and reduces computation, followed by a conv3×3 that fuses the information to finally produce the output of this merging stage. Following the last merging stage, a conv3×3 layer produces the final feature map of the merging branch and feed it to the output layer.

output layer:

The final output layer contains several 1×1 convolution operations to project 32 channels of feature maps into 1 channel of score map Fs and a multi-channel geometry map Fg. Here Fg has 5 channel output, 4 for the distances of edges from a pixel and 1 for rotation angle.

Training and loss:

During training we have two outputs/target on which we have to compute and decrease the loss.

Score map:

The positive area of the bounding box/quadrangle on the score map is designed to be roughly a shrunk version of the original one. It is nothing but a binary mask having area of bounding box as positive.

Shrunk positive area of bounding box

Geometry map:

The geometry map is either one of RBOX(Rotated Box) or QUAD. I have used the RBOX implementation as it gave me better results than the QUAD style geometry map. Final shape of the map is same as that of image shape.

Geometry map generation intution

For each pixel that has positive score in score map, on the corresponding pixel of geometry map we calculate its distances to the 4 boundaries of the text box, and put them to the 4 channels of RBOX ground truth. 1 more channel is used for keeping the angle information by which the box is rotated.

Loss function:

Loss for Score map: This balanced cross-entropy loss is discussed in the paper for the score map loss calculation, as we are having huge unbalance between text and background pixels.

However, I have used Dice coefficient loss which works better. It is a commonly used loss function for semantic segmentation. It also takes into account global and local composition of pixels, thereby providing better boundary detection than a weighted cross entropy.


The equation for it is as below:

P-true being input target score map and P-pred being predicted output score map.

Loss for Geometry map: We cannot use L1 or L2 loss directly to regress the 4 distances and 1 angle value because the sizes for texts in natural scenes vary greatly and would lead the loss bias towards larger and longer text regions. So we need the regression loss to be scale invariant.

So, we will take the IOU loss since it is invariant against objects of different scales. IOU is nothing but Intersection over Union. As we have d1, d2, d3, d4 distances to left, right, top and bottom distances of box from a pixel, both the intersection/union area can be computed easily.

- log of intersection over union

Next, the loss of rotation angle is computed as below:

Loss for rotation angle

Finally, the overall geometry loss is the weighted sum of IOU loss and angle loss, given by Lg = IOU + Lambda*L-theta. Where Lamda is set to 20 in my implementation.


The network is finally trained using ADAM optimizer. I started the learning rate at 1e-3 till 3rd epoch, after that kept it to 1e-4 till 30th epoch. After noticing the loss is not decreasing much, I changed the learning rate to 1e-5 after 30th epoch. This was done using keras callbacks.

The batch size was kept to 24 and each epoch trained completely on 4688 images, and 1172 out of total 5680 images were kept for validation test at each epoch. The model was trained for a total of 40 epochs.

Obtaining final results from predictions

The geometries obtained after thresholding are then merged using NMS i.e Non Maximum Suppression.

As we know that geometries from adjacent pixels are highly corelated and are often for same bounding box, we will use locality-aware NMS instead of standard NMS.

In locality-aware NMS the geometries are merged row by row and the merged bounding box are weight-averaged by the scores of two given bounding boxes on their corresponding coordinates in score map. Psuedo code for NMS is as shown below:

locality aware NMS

This was all what was needed for our understanding about the entire architecture for developing EAST text detection model. Now let us dive into the implementation part.


After importing the annotations we will have our final dataframe looking something like this:

final dataframe

We have ‘imnames’ column having image names along with the word level text in ‘txt’ column and the bounding boxes(bbox) in list of lists corresponding to it.

Input Pipeline

We will be using Tensorflow data pipeline using ‘tensorflow.data.Dataset’ module.

Loss :

Then we will be defining the losses for our model.

After everything is defined, we will compile our model


After our model is trained we need to see our results what is predicted. The inference of model is done using below code.

The output of our trained model:

It is also performing well on the type of images where the text form and background are very different from our training dataset(Synthtext). For example on an image of ICDAR2015 image.

Another image from a very different type of dataset

As we can see in above results, our model is detecting the unstructured texts in the images very well given the less amount of training with a very simplistic approach. Combined with other image processing techniques and more training I am sure it will give even better results.

Model analysis:

After training our model we got the following losses on our train and test data.

Train loss:

train loss

test loss:

test loss

Now we will analyze the train and test losses distribution based on per epoch while training.

As we can see the loss in both test and train are having high density when loss is around 0.3 .

Let us categorize the losses into 3 categories:

We will analyze how data is distributed with best losses being in top 33 percentile of training loss and worst being in 75 percentile and above, as lower values of loss means better results.

Categorizing training loss:

Categorizing test loss:

As we can see in above plots for test losses, our model is performing better on test data. We can infer in bar plot of test data that around 44% of data is from best category, 36% average and only around 19% from worst.

From these observations we can conclude that our model is performing well on test images.

Model Quantization

Quantization refers to techniques for performing computations and storing tensors at lower bit widths than floating point precision. This greatly reduces the size of the trained model compared to tensorflow’s model and with comparable performance.

Keras model size:

model in Mb: 278.0773

Quantization of our model:

Quantized model’s sizes:

After quantization we got the model sizes reduced as given below:

Float16 model in Mb: 92.0973
Dynamic model in Mb: 23.6607

As we can see our quantized model’s sizes are very less than original keras model size.

This enables us to use it on limited memory devices such as IOT devices, mobile phones, cameras or even in cloud. Know more about it here.

In above experiments I found our quantized model is also giving similar results.

Here is a video of detecting text in real time using trained models:

Check the full implementation code with deployment on my github repository.

Further improvements and future work

In this experiment our main focus was to test if our implementation of setting up a text detection system from scratch works reasonably well or not. I have done various experiments to make it work well, but there is a lot of room for improvement.

We can greatly improve this model to work in even more arbitrary conditions by using a even bigger dataset containing images from different types of environment and text styles. Also training for more epochs will surely yield better results and would work at par with industry standards.

My future work includes further extending the model’s output to recognize text after detection. After recognizing we can also translate the detected words to other languages.

I will update this blog with my future work soon.

My github repository.

My Linkedin profile.




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store