Software

 

 

Portable Document Format (“PDF”) and Optical

Character Recognition (“OCR”)

 

 

Written By Foo Juyuan

First published on 22 February 2018


A. INTRODUCTION

What is Portable Document Format (“PDF”)?

Portable Document Format (PDF) is a file format used to present and exchange documents reliably, independent of software, hardware, or operating system. Invented by Adobe, PDF is now an open standard maintained by the International Organization for Standardization (ISO). PDFs can contain links and buttons, form fields, audio, video, and business logic. They can also be signed electronically and are easily viewed using free Acrobat Reader DC software [1] or other third party PDF viewers.

For a lawyer, being able to convert files into PDF format is a necessity, especially as it is the only acceptable file format for eLitigation online upload of documents. This can usually be done via the standard “Save as PDF” option on your word processing software. PDF files are more than images of documents. Files can embed type fonts so that they are available at any viewing location, and this is useful in situations where you want to preserve the document’s printed appearance. This way, you need not be concerned that the intended recipient receives a document with formatting errors due to potential software compatibility issues.

What is Optical Character Recognition (“OCR”)?

OCR is the process of converting the imageof a scanned document into actual text that can be selected, copied, pasted and most importantly, indexed and searched. 

By default, scanned and faxed documents such as PDF files are not stored as text, but rather stored as images. Documents scanned in the office or received from external parties by email are often image-based and not text-based. This means that the PDF file or scanned file is an image of the document. As a result, the PDF file cannot be indexed by the computer’s operating system or by any document management system (“DMS”). A search will never be able to find this document, nor be edited by any word processing software. 

OCR technology will effectively convert an image-based document, such as a scanned PDF, to a text-enabled document. With a good OCR software, it is possible to identify all the text in the document. The end result is that searching for information becomes far easier, faster and more efficient. It is even possible to insert text from a document directly into a word processing document such as Microsoft Word, thus adding a whole new level of functionality. The user will easily be allowed to cut and paste the text that has been converted. 

Some might argue that OCR is no longer a “nice to have” in a law firm. When the reading, understanding, an interpretation of words is the bread and butter for a law firm, even a character error can lead to a loss of meaning or misinterpreted context. Accordingly, the question has now shifted from “whether OCR software is needed” to “how accurate is the OCR software I am getting”.

B. SCANNERS’ BUILT-IN OCR

Some scanners come with pre-installed OCR software. The benefit of this is that the OCR process will happen immediately when a document is scanned without the need for any action by the end-user. The drawback is that this method will only OCR documents that you scan and not scanned documents that you receive from other parties, for example clients, clerks, opposing council, etc. 

C. STAND-ALONE OCR SOFTWARE

You may wish to purchase and install OCR software on all the computers in your firm and instruct them to always be sure to OCR every document. Adobe Acrobat (not Adobe Reader) is one of the most commonly used software that has the ability to OCR PDF files, without creating a new file in the process. Nuance’s Power PDF is such a solution provider. 

However, this method relies too heavily on the discretion of every person in your firm, and it also means you will have to install and maintain an OCR software on every single computer your firm uses – a costly decision. Some solution providers, such as Nuance’s Power PDF, provide additional features (at a cost) where PDF and OCR creation can be automated simply by dragging and dropping files into a dedicated folder. 

While there are many free OCR software available on the internet, these freewares will require that you create and save a new word file in the process of its text-identification.

D. INTEGRATED, AUTOMATIC OCR

It is possible that the DMS that you use to store, organize and manage your matter documents does the OCR for you automatically. The best way to achieve a paperless law office is to implement OCR in a way that ensures every scanned document: PDF, image or otherwise, is OCR’ed, every time, without any user intervention, and without having to install extra software. This way, it will not matter how the document reaches your firm. 


Profile of Author(s):

https://www.linkedin.com/in/juyuanfoo/

[1] https://acrobat.adobe.com/us/en/acrobat/about-adobe-pdf.html