What is Amazon Textract?

Amazon Textract is a service that automatically extracts text and data from scanned documents. Amazon Textract goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.

Many companies today extract data from documents and forms through manual data entry that’s slow and expensive or through simple optical character recognition (OCR) software that requires manual customization or configuration. Rules and workflows for each document and form often need to be hard-coded and updated with each change to the form or when dealing with multiple forms. If the form deviates from the rules, the output is often scrambled and unusable.

Amazon Textract overcomes these challenges by using machine learning to instantly “read” virtually any type of document to accurately extract text and data without the need for any manual effort or custom code. With Textract you can quickly automate document workflows, enabling you to process millions of document pages in hours. Once the information is captured, you can take action on it within your business applications to initiate next steps for a loan application or medical claims processing. Additionally, you can create smart search indexes, build automated approval workflows, and better maintain compliance with document archival rules by flagging data that may require redaction.

Optical Character Recognition (OCR)

Amazon Textract uses Optical Character Recognition (OCR) technology to automatically detect printed text and numbers in a scan or rendering of a document, such as a legal document or a scan of a book. 

Optical Character Recognition (OCR)

Form Extraction

Amazon Textract enables you to detect key-value pairs in document images automatically so that you can retain the inherent context of the document without any manual intervention. A key-value pair is a set of linked data items. For instance, on a document the field “First Name” would be the key and “Jane” would be the value. This makes it easy to import the extracted data into a database or to provide it as a variable into an application. With traditional OCR solutions, keys and values are extracted as simple text. The relationship between them is lost unless hard-coded rules are written and maintained for each form. 

Key-Value Pair Extraction

Table Extraction

Amazon Textract preserves the composition of data stored in tables during extraction. This is helpful for documents that are largely composed of structured data, such as financial reports or medical records that have column names in the top row of the table followed by rows of individual entries. You can use this feature to automatically load the extracted data into a database using a pre-defined schema. For example, rows of item numbers and quantities in an inventory report will retain their association to easily increment item totals in an inventory management application.

Table Extraction

Bounding Boxes

All extracted data is returned with bounding box coordinates, which is a polygon frame that encompasses each piece of identified data, such as a single word, a line, a table, or even individual cells within a table. This is helpful for being able to audit where a word or number came from in the source document or to guide the user in document search systems that return scans of original documents as the search result. For example, when searching medical records for patient history details, users can easily make note of the source document and quickly take note for future searches.

Adjustable Confidence Thresholds

When information is extracted from documents, Amazon Textract returns a confidence score for everything it identifies so that you can make informed decisions about how you want to use the results. For instance, if you are extracting information from tax documents and want to ensure high accuracy, then you can create business logic to flag any extracted information with a confidence score lower than 95% to be reviewed by a human. However, you may choose a lower threshold for other types of documents where the consequences of an error have little to no negative consequences like processing resumes or digitizing archived documents.

You can use it for free in freeforbook.com at https://ocr.freeforbook.com/