Best Practices for Automating with Image and Text Recognition

Kasper Fehrend

Senior Product Evangelist at Leapwork

Image and text recognition make up the backbone of automating virtual desktop applications. This article covers some of the basics of working with image and text recognition. We'll also present best practices and solutions for tackling some of the challenges inherent to image and text recognition.

Leapwork comes with two categories of building blocks for automating with image and text recognition:

How does Leapwork image recognition work?

Image recognition is the "art" of finding one image within another image. Typically you will have one image that is defined at design time (captured into your Leapwork automation flows) and one image which is a screen shot of the actual application when the automation flow is running. What Leapwork will do when the automation flow is running is look for the captured image in the screenshots and act according to the defined flow.

Sign up for webinar: No-code test automation with Leapwork

Technically, image recognition compares a matrix of numbers with another matrix of numbers and returns if the first matrix is part of the second matrix. One of the challenges is that the two matrices can change if the screen resolution changes. E.g. if the automation flow is executed on another machine or the resolution has changed, then the accuracy in finding the captured image in the screenshot can decrease, which can lead to less robust automation flows.

An image consisting of 3 times 3 pixels, shown on its own, and as part of a larger image.Image recognition looks for an image within an image, or a matrix within a matrix.

How does Leapwork text recognition work?

Text recognition is based on pattern recognition, which means that Leapwork searches an area on the screen for a pattern that matches letters. Letters can come in different fonts, colors, and sizes, and a text's background can be an image or a gradient pattern which makes it harder to recognize the actual letters and numbers on the screen.

In the following, we'll present best practices for handling common challenges related to working with text and image recognition. The best practices address:

  • How to capture icons when backgrounds change
  • How to find images that move
  • Defining part of the screen as an area of focus to speed up the recognition process
  • Using image collections for more robust recognition
  • Remote design and execution
  • Adjusting the precision of image recognition
  • Configuring the OCR engine for better text recognition

Following these tips will significantly improve the quality of your automation flows that rely on image and text recognition. 

Capturing icons when backgrounds change

The background color behind an icon can change, so don’t include parts of the background when capturing an icon.

A “hover” effect can change how the icon looks when hovered by a mouse pointer, for instance showing a brighter or darker version. This can usually be handled by closing all open windows as part of the test run, by setting the 'Action' property on the Start building block to "Close all windows".

The Start Building Block in LEAPWORK

 A “selected” or “opened” effect can change how the icon looks when selected. For instance, a Chrome icon in the Windows task bar looks different before Chrome is opened compared to when browser instances are already open. This can typically be solved by using the Image collection feature (see later).

No browsers open:The Chrome browser icon when the application is not open

At least one browser open: The Chrome browser icon when the application is open

Finding images that move

One situation that can occur for all types of applications is that an image is shown first in one place and is then moved to another. For example, on some websites all resources are first loaded into the page and are then "boot-strapped" into position. Another example could be a dialog box in a desktop application that is shown and then centered on the screen.

In both cases, Leapwork can find the image in the initial position and then continue the test flow. However, if the image changes position as part of the application the automation flow will fail.

Checking the “Await no movement” property on the Click image building block solves this problem. This will tell the image recognition engine to wait until the screen has not changed for a period of time before starting to search for the image.

The click image building block

Defining part of the screen as an area of focus

For image recognition - and especially for text recognition - it is best practice and highly recommended to use "Areas". An "Area" is a sub-section of the entire screen and is used to tell the image/text recognition engine to limit its search for the captured image or a specific text/text pattern in the specified area. Typically you will define an area at the part of the screen where you expect the image or text to appear, including some margin.

Specifying an area has two main purposes:

  • You ensure that you are looking for the right instance of the captured image/text. If the word appears multiple times on a screen you could get a list of the occurrences instead of the "right" one.
  • The speed of execution is considerably higher if the Leapwork image/text recognition engine only has to search a fraction of the screen instead of the whole screen.

More information about using and defining areas.

Image collections

The Image Collection feature allows you to capture two or more images into a collection and then use the collection when searching for an image. This means, for example, that you can capture the same button in different states (no focus, in focus, hovered, pressed etc.), add all the captured images into one collection, and then just have the automation flow click or find the button regardless of the state of the button. This increases the robustness and the tolerance for changes in a flow.

In the example below we have captured the search button - "Go" - from a Windows desktop application.

A simple button in a desktop application

The button can have four different looks depending on focus and hover effect:

A button in four different states

All four states have been captured, and the images are now located as resources under the flow in the asset menu:

A collection of images in leapwork

In the example above, the images are renamed to make it easier to identify the images. Hovering an image in the asset menu will popup a thumbnail view of the image.

To create an Image Collection click "New" + "Capture" + "Image collection". This will create a new, empty Image collection in the Asset Menu. It is also possible to simply right-click the folder where the Image Collection should be located in and select "Capture" + "Image Collection".

Image Collections can be identified by this logo in the asset menu:

Image collecions logo

Once added, it is best practice to rename the image collection to something meaningful to make it easier to maintain and reuse the image collection across multiple flows.

Adding images to an Image Collection is really simple: Just drag-n-drop images from anywhere in the asset menu on top of the image collection. To view the images in the collection, double-click the collection to open the "Edit image collection" dialog.

screenshot of the edit image collection module

In the dialog it is possible to edit and change the images individually if needed.

screenshot of the edit image module

You can now use the collection in a building block by dragging the collection onto the image field in the building block:

a leapwork flow using an image collection

When the Find Image block is executed, it will search the screen for the images in the collection one by one. If it finds one of the images, it will click it and then stop the search and hand over the execution to the next building block in the flow.

The image collection can also be used to handle different resolutions of the same icon/image if you know you will run the same automation flow in different resolutions. It can also be used to handle different states of icons.

The image resources are shared within a project, so the collections can be used in multiple automation flows. This means you can create e.g. a "Chrome icon" collection that contains all relevant states of the Chrome icon in the windows task bar, and then use this collection across all automation flows that operate with Chrome. This comes with the bonus that you only have to maintain the image collection in one place instead of in all the automation flows.

Remote design and execution

A typical setup of Leapwork consists of a number of workstations with Leapwork Studio installed, a Controller installed on a common/shared server to make sharing easy, and then one or more machines entirely used to execute the automation flows. When automation flows using image and text recognition run, they will interact with the actual screen. This means that if you run image and text recognition on your local machine you can't work on it at the same time. This is the reason for using "remote machines" for running automation flows.

To make your automation flows independent of differences in the screen resolution between machines where the flows can be executed, you can define an Environment pointing to a "remote machine". You can then use the "remote machine" to capture images on instead of your local workstation. This way you will end up capturing images directly on the machine where you will execute the automation flow, securing that the screen resolution is always the same.

To create a "remote machine" you need to install the Leapwork Agent on a dedicated workstation that is accessible from both Leapwork Studio and the Leapwork Controller. Once the remote machine is up and running you can define an Environment in Studio pointing to this machine. You can find more info here.

When the environment is created you can select it in the 'Preview environment' on the design canvas. In the example below, "Amazon Cloud Remote" is an environment pointing to a cloud-hosted (Amazon) server where the Leapwork  Agent is installed.

When the 'Preview Environment' points to a remote machine, a "terminal" window will popup when you capture new images, allowing you to capture directly on the remote machine instead of on your local machine.

Adjusting precision of image recognition

The building blocks using Image recognition have a property named Precision. This configuration is accessible by expanding the building blocks. The Precision property has two sub-properties:

  • Pixels: The level of tolerated accuracy in the image recognition.
  • Color: This property specifies the sensitivity to changes in the color density. The color density of the same set of pixels can change due to the hardware used. 

In this section, you can set the accepted level of accuracy for the image recognition. Default is "Pixel perfect" which means that there has to be a perfect match, pixel by pixel, before the captured image is considered found on the screen. In some cases a higher level of tolerance is needed. The advice is to start with 'Pixel Perfect' for both properties and then change them one level at a time until the image recognition works as intended.

Configuring the OCR engine

For the building blocks using OCR (text recognition), you can change the settings for the OCR engine to optimize how the characters are recognized.

Choosing the engine

You can choose between two different built-in OCR engines in the building block configurations:

  • "Default": This is based on Tesseract version 3.5 which is an open-source engine used by literally all OCR engines. 
  • "Default (new)": This engine is based on Tesseract version 4.0 which uses a neural network architecture (LSTM) to optimize the engine. This architecture is considered to be the future within all types of recognition software (images, speech, video, text etc.)

Both "Default" and "Default (new)" are working engines, but because of the different technologies, one engine might be a better fit for some applications. In case the OCR building blocks are not behaving as expected, one option is to try to change to the other engine.

In case the built-in OCR engine in Leapwork is not matching your requirements, it is possible to change the engine to ABBYY.

ABBYY is the world-leading OCR engine, but this requires a separate ABBYY license with. Also be aware that ABBYY itself requires some infrastructure work to be set up, so in most cases the built-in engines is the best option.

Contact our Customer Success Team to get started with ABBYY.

Choosing OCR Mode:

You can choose between two different OCR Modes. In short, it's a choice between speed and quality.

  • Fast Speed: The OCR engine performs two recognition runs in parallel: One in a normal color scheme (black text on white background) and one using inverted colors. This mode is faster than the "High quality" setting, so if the characters are found correctly, simply keep using this setting.
  • High Quality: The OCR engine performs four recognition runs in parallel: Two in a  normal color scheme and two in inverted colors. This setting is slower than the "Fast speed" setting, but might be required if the OCR engine is not returning the characters correctly.

Adjusting OCR precision levels:

OCR precision sets the accuracy of the OCR results on a character level. This means, a higher OCR precision level requires a higher confidence in the OCR engine before a certain character is matched.

With a high precision you can be very confident that the characters found are the correct characters.

On the other side a high precision can result in some characters are not found. Setting a lower precision means that, in general, more characters are found, but the assurance that it’s the right characters is lower than with a high precision. So, the right setting is a balance between finding all the right characters and not include too much that will pollute the results. The right setting will depend on the font, colors, background, and size of the text.

The precision can be set on a scale from 0 to 100. 0 will return everything that was recognized by the OCR engine and 100 will return the best possible recognized result. 

The default Precision Levels are:

  1. High: This is the highest Confidence factor or precision which the user are sure that the character is large and visible enough (not hazy or compacted) to be recognized by an OCR engine. The predefined value is 70.
  2. Medium: This is the medium Confidence factor which user can opt when they think the character may or may not be recognized by an OCR engine, so they set this. This tells engine to search look for the possible characters in the defined area. The predefined value is 50.
  3. Low: This is the Low Confidence factor which user can opt when they are less sure that the character can be recognized by an OCR engine, so they set this. This tells engine to search for relatively possible characters in and outside the dictionary in the defined area where the precision to identify is low. The predefined value is 30.
  4. Very Low: This is the Lowest Confidence factor which user can opt when they are least sure that the character can be recognized by an OCR engine, so they set this. This tells engine to search for relatively all possible characters in and outside the dictionary in the defined area where the precision to identify is least. The predefined value is 20.
  5. Custom: This can be used to set the custom Precision value/Confidence factor. It is ranging from 0-100.

Learn more about Leapwork and no-code test automation in our webinar.

Watch the no-code test automation webinar