OCR Plugin

Overview

The qbo.Attachment.GoogleDrive IService plugin supports Optical Character Recognition (OCR) of documents stored in a QBO3 system. There are three methods supported:
  • Attachment/OcrToDescription: OCR text is saved to Attachment.Description
  • Attachment/OcrToAttachment: OCR text is saved as a new Attachment
  • Attachment/Ocr: calls OcrToDescription

Ocr To Description

Storing the text of a document to the Attachment.Description column of the database enables SQL queries against the text of the document. To fully leverage complex searching, full text indexing of the Attachment.Description field is recommended. Large volume full-text searching can be measurable performance drain on the database server, so use this feature judiciously.

Ocr To Attachment

Storing the text of a document as another Attachment in QBO3 allows leveraging third party document searching engines, such as Amazon CloudWatch, or internal corporate search appliances. It is more complicated to fully configure than OcrToDescription because you must also orchestrate delivery of the text documents to a third party document searching engine, but it's scalable horizontally.

Extracting Data From OCR Text

The qbo.Score.TextParsingEngine can extract text from OCR data. Give a document type (e.g. a Mortgage Note):
  • Create a Score Template that used the TextParsingEngine, and applies to an Attachment
  • For each data point to be extracted, create an Item under the score, with the Prompt field specifying the pattern of text to be extracted
    • each item should specify a value type: money, date, float, boolean, string
  • Call Score/Calculate?Object=Attachment&ObjectID={AttachmentID}&Template={Score Template Name}
The trick to extracting text is to set up patterns to look for. For example, given the following text:

I promise to pay U.S. $432,987.65 (this amount is called 'Principle'). I will pay interest at a yearly rate of 3.97%.
My monthly payment will be in the amount of U.S. $1,507.55. Payments start on 1/12/17, with late charges starting 15 days later.
Mortgage is issued to John Doe, known as the mortgagor. PMI Required: yes.

create items:
  • LoanAmount: value type is Money, prompt is "I promise to pay U.S. ${Value} (this amount is called 'Principle')"
  • InterestRate: value type is Float, prompt is "I will pay interest at a yearly rate of {Value}%."
  • Payment: value type is Money, prompt is "My monthly payment will be in the amount of U.S. ${Value}."
  • StartDate: value type is Date, prompt is "Payments start on {Value}, with late charges starting"
  • Borrower: value type is String, prompt is "Mortgage is issued to {Value}, known as the mortgagor."
  • Insurance: value type is Boolean, prompt is "PMI Required: {Value}"
The engine will look for the "prompt" string of text, where the {Value} can be any string matching the item's value type. For example, given the prompt "I promise to pay U.S. ${Value} (this amount is called 'Principle')", the following results would be obtained:
  • "I promise to pay U.S. $350,500.25 (this amount is called 'Principle')" => 350000.25
  • "I promise to pay U.S. $350500.25 (this amount is called 'Principle')" => 350000.25
  • "I promise to pay U.S. $3505,00.25 (this amount is called 'Principle')" => no match because 3505,00.25 is not a valid monetary amount
  • "I promise to pay U.S. $12/25/2017 (this amount is called 'Principle')" => no match
  • "I promise to pay U.S. $abc123 (this amount is called 'Principle')" => no match
  • "I promise to pay U.S. 350,500.25 (this amount is called 'Principle')" => no match
    • note the prompt expects a $ before {Value}, and this string does not contain the $
    • you can omit $ from money expressions; the engine is smart enough to equate $350,500.25 with 350,500.25
    • a better prompt would be "I promise to pay U.S. {Value} (this amount is called 'Principle')"
For the geeks in the crowd, the following regular expressions are used to parse this data:
  • money: (\s)*(\${0,1})(\d+|\d{1,3}(,\d{3})*)(\.\d+)(\s)*
  • float: (\s)*(\d+|\d{1,3}(,\d{3})*)(\.\d+)(\s)*
  • date: (\s)*(\d{1,2})[/-](\d{1,2})[/-](\d{2,4})(\s)*
  • boolean: (\s)*(true|false|1|0|yes|no|y|n)(\s)*
  • string: (.)*
You can test these expressions against text using a regex lint website.

Notes

  • This plugin leverage's Google Documents' OCR functionality. The document is not persisted in the Google cloud; it is transmitted, OCRed, downloaded, and deleted, all with encryption over the wire.
  • For full functionality, install the qbo.Attachment.ABCPDF plugin as well to facilitate:
    • converting non-PDF to PDF files prior to OCR, and
    • breaking large PDFs into smaller chunks to OCR documents larger than 2MB
Configurable application settings (qbo.Attachment.GoogleDrive.Properties.Settings) include:
  • X509CertificatePath: path to a Google-provided X12 cert for making Google API calls (defaults to a Quandis account)
  • X509CertificatePassword: password for the Google-provided X12 certificate
  • ServiceAccountEmail: email account of the Google service account used to access the Google Drive API
  • GoogleApplicationName: Google project name authorized to access the Drive API
  • SubscriptionPrefix: ObjectSubscription prefix when creating SubscriberID records for an attachment uploaded to Google Drive
  • DeleteAfterOcr: if true, any OCRed documents will be deleted as soon as their text is retrieved
  • OcrChunkThreshold: size, in bytes, that triggers calling Attachment/SplitPdf, to OCR in chunks
  • OcrThrowOnError: if true, any errors on any chunks will result in raising an error. If false, an Attachment.OnRuleWarning event will be raised instead
  • OcrRetryAttempts: number of retry attempts for a single chunk. Used to handle 'spurious' network errors.
  • OcrErrorMessage: if OcrThrowError == false, this text will be injected into the text stream in place of a failed chunk

Notes

As of 6/8/2017, the Google Nuget packages used by the OCR plugin references Newtonsoft.Json 9.0.*.  qbo.Core references Newtonsoft.Json 10.0.2.*. To resolve this issue, ensure your web.config contains the following under the root configuration node:

<runtime>
  <assemblyBinding xmlns="urn:schemas-microsoft-com:asm.v1">
    <dependentAssembly>
      <assemblyIdentity name="Newtonsoft.Json" publicKeyToken="30ad4fe6b2a6aeed" culture="neutral" />
      <bindingRedirect oldVersion="0.0.0.0-10.0.0.0" newVersion="10.0.0.0"/>
    </dependentAssembly>
  </assemblyBinding>
</runtime>




Comments