ABBYY Teaches Computers to Read Burmese

The Republic of the Union of Myanmar, formerly known as Burma, is a country in South East Asia. From 1962 to 2010, Burma was ruled by a military junta, but in the past five years it has been opening up to the outside world, establishing trade and cultural links with other nations.

The Burmese language comprises many dialects, but all of them share a core alphabet that is used for official texts and by the printed media. This shared alphabet has 33 consonants and 12 auxiliary characters. Regional dialects may also use other characters, and the complete list is about three times the size of the core alphabet. Luckily, our job was to recognize standard Burmese texts written using the popular Myanmar 3 font at least 10 points in size. Text images could be grayscale, black-and-white or color, and their resolution had to be at least 300 dpi. This is what a typical Burmese text looks like:

1

In the preliminary stage of the project we had to achieve an OCR accuracy of 75%, the minimum target accuracy being 94%.

The Burmese script is a so-called alphasyllabary, where each consonant letter also conveys a “default” vowel sound. Other vowel sounds are transcribed using special characters and diacritics above, below, before, after or even around a consonant.

The letters are mostly made up of semi-circles, because in the past texts were written on palm leaves, which could be easily damaged by straight-line incisions.

Burmese is a tonal language. There are three main tones — high, low, and creaky, and two secondary tones — checked and falling.

2

Since tones also have to be transcribed in writing, the Burmese script effectively has two kinds of diacritic symbols, which may be placed above, below or both above and below the main letter. This two tier-diacritic system poses a serious challenge for OCR software, but more on this later.

To make things even more complicated, some combinations of letters can be fused together to form a new character.

In most general terms, optical character recognition proceeds as follows. When an OCR software receives an image, it performs some preliminary processing with OCR technology, converting the image to black-and-white and correcting any visible distortions. Next, it detects zones that contain different kinds of text (headings, body text, footnotes), pictures, and tables. The text blocks are then parsed into lines, lines into words, and words into letters. After the individual letters are recognized, the document is reassembled, bottom up. Image processing and block detection are the same for Burmese texts as for texts in most other languages, but detecting lines is a tricky business.

Because of the abundant diacritics, it was very hard to teach the program to identify short lines, and here is why. Our algorithms use a number of features that characterize a line of text, and one of these features is an imaginary base line on which all main characters sit. The program needs to know where to draw a base line in order to be able to generate plausible hypotheses about individual characters.

The program uses statistical data to detect base lines. To gather the necessary statistics, it looks for peaks on histograms generated for the black dots that make up the letters. On histograms for European alphabets, there are three clearly visible peaks which correspond to the base line and to the height of the lower-case letters:

3

In Burmese, however, the numerous diacritics outside the normal width of the line result in additional statistically meaningful peaks in the histogram. For this reason, our algorithms, which were originally geared toward European scripts, failed to correctly identify the important parameters of Burmese text lines.

In the figure below, the program has correctly detected the first two base lines but failed to detect the third:

4

Some adjustments had to be made to the line detection algorithm to make it work on Burmese texts as well.

Once the lines are detected, we can start looking for gaps between words and letters. This time, a horizontal histogram is used, with larger gaps assumed to be spaces between words (or, in the case of Burmese, clauses) and smaller gaps interpreted as spaces between letters. Detection of gaps in Burmese texts presented almost no problems, unlike, say, in Thai, where there are almost no gaps. (Yes, our technology can recognize texts written in Thai, and in about 200 other languages.)

Once we have divided the lines into smaller fragments, we attempt to divide the fragments into individual characters. Once again, we look for peaks and troughs on a histogram, troughs corresponding to possible gaps between letters. Some of the gaps can be detected with a very high degree of certainty, while others have to be verified by means of various heuristics.
The figure below shows a histogram for an English word.

5

The large number of semi-circled characters in the Burmese script produces many “false” peaks and troughs, making it harder to detect gaps, but the histogram method works for Burmese, too.

6

Now we can attempt to recognize individual characters, or graphemes, to be precise. A grapheme is a graphical representation of a character, but it is not a one-to-one correspondence. In European texts, one grapheme may correspond to more than one character (e.g., the upper-case “C” and the lower-case “c” are one grapheme) and one character may be conveyed by several graphemes (e.g., the letter “a” may be represented by different graphemes in different fonts).

7

There are no standard lists of graphemes, so we compile them manually, specifying all possible characters for each grapheme. Graphemes are translated into characters at a later stage, when word candidates are generated.

There are no standard lists of graphemes, so we compile them manually, specifying all possible characters for each grapheme. Graphemes are translated into characters at a later stage, when word candidates are generated.

As we noted earlier, there are a great many diacritic characters in the Burmese script, and many of them can be fused with their main letter to form a new character:

8

If a diacritic mark is physically separate from its letter, we first recognize the letter, then the diacritic, and finally combine the results to obtain a grapheme. If a diacritic and its letter form an indivisible unit, we attempt to recognize it in its entirety.

Fused characters are so common in the Burmese writing system that we had to teach our technology to recognize 3,500 new graphemes, which is a great deal more than we usually add for a new language.

After we have recognized a grapheme, we must translate it into Unicode characters, which will then make up words. The process is pretty straightforward for European languages, where we recognize characters one by one and then translate them into Unicode. The Burmese fused characters, however, require special treatment.

There is a certain correct order in which Burmese letters must be entered from the keyboard so that Windows can join them together. Some characters have to be typed after all the other characters have been typed, so that Windows can put them in their right place at the beginning of the syllable.

For example, to type this word in a text editor:9

a user must key in the following sequence of characters:

10

We have added a special post-correction module to our technology which ensures that resulting words comply with these typing rules. Once all of the text has been recognized, the module reads it once again to check that the order of characters is correct. Burmese is a very well structured language, and there are enough formal rules to do these checks.

We have briefly outlined the main challenges that we faced when teaching our technology to read texts in Burmese. The project took us four months to complete, yielding a recognition accuracy of 97% (the customer’s requirement was at least 94%). Support for more Burmese fonts may be added in the future.