With the shift from physical to digital documents, extracting data from scanned documents through OCR and machine learning has become crucial for convenience. To enable accurate data extraction from scans, research facilities and corporations have advanced computer vision and Natural Language Processing (NLP).
Deep learning now allows extracting far beyond just text from scans – tables, key-value pairs, and more can be extracted. Many OCR data extraction solutions provide products to extract data from scanned documents, meeting the needs of individuals and businesses for document data extraction.
This article explores current technology for extracting data from scanned documents. We'll look at a Python tutorial for this purpose and popular market solutions offering top-notch scanned document data extraction capabilities through OCR and machine learning.
Scanned documents → .txt Turn scanned PDFs into editable text files with ease.Data extraction is the process of converting unstructured data into interpretable information by programs, which allows humans to process the data further.
Here, we list several of the most common types of data to be extracted from scanned documents.
The most common and the most important task in data extraction from scanned documents is extracting text. This process, while seemingly straightforward, is, in fact, very difficult as scanned documents are often presented in the format of images. In addition, the methods of extraction are highly dependent on the types of text.
While the text is present in densely printed formats the majority of the time, the ability to extract sparse text from less well-scanned documents or from handwritten letters with drastically varying styles is equally important. Such a process will allow programs to convert images to machine-encoded text, where we can further organize them from unstructured data ( without certain formatting) into structured data for further analysis.
Want to understand the deep learning algorithms that power such processes? Head on to our LayoutLM Explained blog
Tabular forms are the most popular approach for data storage, as the format is easily interpretable with human eyes. The process of extracting tables from scanned documents requires technology beyond character detection – one must detect the lines and other visual features in order to perform a proper table extraction and further convert that information into structured data for further computation.
Computer vision methods (described in detail in the following sections) are heavily used to achieve high-accuracy table extraction.
Key-value pairs (KVPs) are a common alternative format used for data storage in documents.
KVPs are essentially two data items -- a key and a value -- linked together as one. The key is used as a unique identifier for the value to be retrieved. A classic KVP example is the dictionary, where the vocabularies are the keys and the corresponding definitions are the values. These pairs, while usually unnoticed, are actually being used very frequently in documents: questions in surveys such as name, age, and prices of items in invoices are all implicitly KVPs.
However, unlike tables, KVPs often exist in unknown formats and are sometimes even partially handwritten. For example, keys could be pre-printed in boxes and values are handwritten when completing the form. Therefore, finding the underlying structures to automatically perform KVP extraction is an ongoing research process even for the most advanced facilities and labs.
Finally, it is also very important to extract or capture data from figures within a scanned document. Statistical indicators such as pie charts and bar charts often include crucial information for scanned documents. A good data-extracting process should be able to infer from the legends and numbers to partially extract data from figures like barcodes or QR codes for further use.
Tired of manually extracting data from scanned documents??Reclaim your time and sanity with intelligent and accurate OCR with built-in post-processing tools and a wide range of integrations. Streamline your entire workflow and boost productivity today.
Data extraction involves Optical Character Recognition (OCR) and Natural Language Processing (NLP).
OCR extraction converts text images into machine-encoded text, while the latter analyzes the words to infer meanings. Other computer vision techniques, such as box and line detection, are often accompanied by OCR to extract aforementioned data types, such as tables and KVPs, for more comprehensive extraction.
The core improvements behind the data-extraction pipeline are tightly connected to the advances in deep learning that have contributed greatly to computer vision and natural language processing (NLP).
Deep learning has a major role behind the hype of the artificial intelligence era and has been constantly pushed to the forefront in numerous applications. In traditional engineering, our goal is to design a system/function that generates an output from a given input; deep learning, on the other hand, relies on the inputs and outputs to find the intermediate relationship that can be extended to new unseen data through the so-called neural network.
A neural network, or a multi-layer perceptron (MLP), is a machine-learning architecture inspired by how human brains learn. The network contains neurons, which mimic biological neurons and “activate” when given different information. Sets of neurons form layers, and multiple layers are stacked together to create a network to serve the prediction purposes of multiple forms (i.e., image classifications or bounding boxes for object detections).
In computer vision, a type of neural network variation is heavily applied – convolutional neural networks (CNNs). Instead of traditional layers, a CNN adopts convolutional kernels that slide through tensors (or high-dimensional vectors) for feature extraction. Together with conventional network layers, CNNs are very successful in image-related tasks and further form the basis for OCR extraction and other feature detection.
On the other hand, NLP is reliant on another set of networks, which focuses on time-series data. Unlike images, where one image is independent of the other, text prediction can be largely beneficial if words before or after are also considered. In the past few years, a family of networks, namely long short-term memories (LSTMs), has taken previous results as inputs to predict the current results. Bilateral LSTMs were also often adopted to enhance the prediction output, where both results prior and after were considered. In recent years, however, the concept of transformers that use an attention mechanism has risen due to their higher flexibility, leading to better results than traditional networks handling sequential time series.
The main goal of data extraction is to convert data from unstructured documents to structured formats, in which a highly accurate retrieval of text, figures, and data structures can be very helpful for numerical and contextual analysis.
Business corporations and large organizations deal with thousands of pieces of paperwork with similar formats on a daily basis – Big banks receive numerous identical applications, and research teams have to analyze piles of forms to conduct statistical analysis. Therefore, automation of the initial step of extracting data from scanned documents significantly reduces the redundancy of human resources and allows workers to focus on analyzing data and reviewing applications instead of keying in information.
Struggling to extract data from scanned documents? Nanonets' AI-powered OCR extracts data accurately and uses built-in tools and automated workflows to streamline post-processing.
To provide a clearer view of how to perform data extraction, we show two sets of methods for performing data extraction from scanning documents.
One may build a simple data-extracting OCR engine via PyTesseract engine as the following:
try: from PIL import Image except ImportError: import Image import pytesseract # If you don't have tesseract executable in your PATH, include the following: pytesseract.pytesseract.tesseract_cmd = r'' # Example tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract' # Simple image to string print(pytesseract.image_to_string(Image.open('test.png'))) # List of available languages print(pytesseract.get_languages(config='')) # French text image to string print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra')) # In order to bypass the image conversions of pytesseract, just use relative or absolute image path # NOTE: In this case you should provide tesseract supported images or tesseract will return error print(pytesseract.image_to_string('test.png')) # Batch processing with a single file containing the list of multiple image file paths print(pytesseract.image_to_string('images.txt')) # Timeout/terminate the tesseract job after a period of time try: print(pytesseract.image_to_string('test.jpg', timeout=2)) # Timeout after 2 seconds print(pytesseract.image_to_string('test.jpg', timeout=0.5)) # Timeout after half a second except RuntimeError as timeout_error: # Tesseract processing is terminated pass # Get bounding box estimates print(pytesseract.image_to_boxes(Image.open('test.png'))) # Get verbose data including boxes, confidences, line and page numbers print(pytesseract.image_to_data(Image.open('test.png'))) # Get information about orientation and script detection print(pytesseract.image_to_osd(Image.open('test.png'))) # Get a searchable PDF pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf') with open('test.pdf', 'w+b') as f: f.write(pdf) # pdf type is bytes by default # Get HOCR output hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr') # Get ALTO XML output xml = pytesseract.image_to_alto_xml('test.png')
For more information regarding the code, you may checkout their official documentation.
In simple words, the code extracts data such as texts and bounding boxes from a given image. While fairly useful, the engine is no where as strong as the ones provided by advanced solutions due to their substantial computational power for training.
def async_detect_document(gcs_source_uri, gcs_destination_uri): """OCR with PDF/TIFF as source files on GCS""" import json import re from google.cloud import vision from google.cloud import storage # Supported mime_types are: 'application/pdf' and 'image/tiff' mime_type = 'application/pdf' # How many pages should be grouped into each json output file. batch_size = 2 client = vision.ImageAnnotatorClient() feature = vision.Feature( type_=vision.Feature.Type.DOCUMENT_TEXT_DETECTION) gcs_source = vision.GcsSource(uri=gcs_source_uri) input_config = vision.InputConfig( gcs_source=gcs_source, mime_type=mime_type) gcs_destination = vision.GcsDestination(uri=gcs_destination_uri) output_config = vision.OutputConfig( gcs_destination=gcs_destination, batch_size=batch_size) async_request = vision.AsyncAnnotateFileRequest( features=[feature], input_config=input_config, output_config=output_config) operation = client.async_batch_annotate_files( requests=[async_request]) print('Waiting for the operation to finish.') operation.result(timeout=420) # Once the request has completed and the output has been # written to GCS, we can list all the output files. storage_client = storage.Client() match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri) bucket_name = match.group(1) prefix = match.group(2) bucket = storage_client.get_bucket(bucket_name) # List objects with the given prefix. blob_list = list(bucket.list_blobs(prefix=prefix)) print('Output files:') for blob in blob_list: print(blob.name) # Process the first output file from GCS. # Since we specified batch_size=2, the first response contains # the first two pages of the input file. output = blob_list[0] json_string = output.download_as_string() response = json.loads(json_string) # The actual response for the first page of the input file. first_page_response = response['responses'][0] annotation = first_page_response['fullTextAnnotation'] # Here we print the full text from the first page. # The response contains more information: # annotation/pages/blocks/paragraphs/words/symbols # including confidence scores and bounding boxes print('Full text:\n') print(annotation['text'])
Ultimately, Google's document AI allows you to extract a lot of information from documents with high accuracy. In addition, the service is offered for specific usages, too, including text extraction for both normal and in-the-wild images.
Please take a look here for more.
Besides large corporations with APIs for document data extraction, several solutions provide highly accurate PDF OCR services. We present several options of PDF OCR that are specialized in different aspects, as well as some recent research prototypes that seem to provide promising results*:
*Side Note: Multiple OCR services target tasks such as images-in-the-wild. We skipped those services as we focused on only PDF document reading.
Armed with an understanding of the key concepts, tools, and platforms covered in this article, you'll be well-equipped to implement or enhance data extraction capabilities in your own projects and organizations. As scanned document data extraction technologies continue to evolve, staying on top of the latest advancements will be key to maximizing efficiency, insights, and competitive advantage.
Experience seamless data extraction with Nanonets!See the difference Nanonets can make in your document processing workflow. Book a demo and see how you can get more from your scanned documents.