Types of PDFs and why it is important to use OCR to make your PDFs work for you

6470e_Behave_PDFs_1_AllHave you ever encountered PDFs which simply did not want to do what you wanted? For example, not letting you select the text to copy it out? Or when searching the PDF for a word you know exists in the document, did not get any results?
The reason for this is very simple, and with the right tools the problem is easy to solve.

Why do PDFs behave differently?

PDF documents can be categorised into three different types, depending on the way the file is created. How it was originally created defines whether the content of the PDF (text, images, tables) can be accessed or whether it is “locked” in an image of the page.

To understand the PDF structure you should think of it in layers. The top layer is just the image, a photograph on top. If you want to be able to access the text you need to have a second layer, a text layer, which sits under the image layer, hidden.

“True” or Digitally Created PDFs “Image-only” or Scanned PDFs Searchable Scanned PDFs
6470e_Behave_PDFs_1_Edit 6470e_Behave_PDFs_1_Image 6470e_Behave_PDFs_1_All
Created using software such as Microsoft® Word, Excel® or via the “print” function within a software application (virtual printer). Created by scanning paper documents on all-in-one devices and office scanners, or when converting an image such as jpg or tiff into a PDF. Result through the application of OCR (Optical Character Recognition) to scanned PDFs or other image-based documents.
Consist of text and images. Contain just the scanned or photographed images of pages, without an underlying text layer. The content is “locked” in a snapshot-like image. A text layer is added to the image layer, usually placed underneath.
  • Searchable
  • Content can be accessed to annotate and re-use
  • Not searchable
  • Content cannot be accessed
  • Made searchable using OCR
  • Content can be accessed to annotate and re-use, some limitations can occur i.e. with graphical elements and images

What is OCR and how is it relevant for working with PDFs?

Many scanners are capable of creating PDF documents, but all a scanner can do is create an image or a snapshot of the document. It is nothing more than a collection of black and white or coloured dots, known as a raster image, there is no other data. In order to extract and repurpose data from scanned documents or “image-only” PDFs, you need an OCR software such as ABBYY FineReader or a PDF tool with integrated OCR such as ABBYY PDF Transformer,  which will recognise the letters on the image, put them into words and then make the words into sentences. It is only after this process you can then access and edit the content of the original “image-only” document.

Optical Character Recognition (OCR) or text recognition unlocks the information “trapped” in a scanned/photographed image of a document. OCR software “reads” the content of a document (text and structure) by interpreting character images and assigning them their text equivalent. This makes it possible to transfer the content and layout of the document into searchable and editable formats.

OCR_diagramm_860x200

What does this mean for your daily work with PDFs?

So now you know: every time you have tried to select text in a PDF document, but you could simply not do it, or you tried searching for a keyword and there were no results – you were almost certainly dealing with a scanned “image-only” PDF.
With OCR, possible through the use of tools such as ABBYY FineReader, you can convert scanned “image-only” PDF documents, into PDFs containing selectable and searchable text. Which enables easy management, copying and indexing of the content as well as full-text search.

Your work with PDF documents becomes easier and more productive because:

You can deal with scanned paper documents and “image-only” PDFs almost in the same manner as with digitally created PDFs. You can select text to highlight, comment, and make annotations when collaborating with your colleagues.
You can find and access information from your documents much faster, without the need to dig through piles of paper. You can use the “Search and Redact” function to redact multiple instances of where confidential information appears in your documents.
You can reuse information from your documents without manually retyping it. You can simply do your work instead of struggling with that PDF file.