OCR Plugin

Overview

The qbo.Attachment.GoogleDrive IService plugin supports Optical Character Recognition (OCR) of documents stored in a QBO3 system. There are three methods supported:
  • Attachment/OcrToDescription: OCR text is saved to Attachment.Description
  • Attachment/OcrToAttachment: OCR text is saved as a new Attachment
  • Attachment/Ocr: calls OcrToDescription

Ocr To Description

Storing the text of a document to the Attachment.Description column of the database enables SQL queries against the text of the document. To fully leverage complex searching, full text indexing of the Attachment.Description field is recommended. Large volume full-text searching can be measurable performance drain on the database server, so use this feature judiciously.

Ocr To Attachment

Storing the text of a document as another Attachment in QBO3 allows leveraging third party document searching engines, such as Amazon CloudWatch, or internal corporate search appliances. It is more complicated to fully configure than OcrToDescription because you must also orchestrate delivery of the text documents to a third party document searching engine, but it's scalable horizontally.

Notes

  • This plugin leverage's Google Documents' OCR functionality. The document is not persisted in the Google cloud; it is transmitted, OCRed, downloaded, and deleted, all with encryption over the wire.
  • For full functionality, install the qbo.Attachment.ABCPDF plugin as well to facilitate:
    • converting non-PDF to PDF files prior to OCR, and
    • breaking large PDFs into smaller chunks to OCR documents larger than 2MB
Configurable application settings (qbo.Attachment.GoogleDrive.Properties.Settings) include:
  • X509CertificatePath: path to a Google-provided X12 cert for making Google API calls (defaults to a Quandis account)
  • X509CertificatePassword: password for the Google-provided X12 certificate
  • ServiceAccountEmail: email account of the Google service account used to access the Google Drive API
  • GoogleApplicationName: Google project name authorized to access the Drive API
  • SubscriptionPrefix: ObjectSubscription prefix when creating SubscriberID records for an attachment uploaded to Google Drive
  • DeleteAfterOcr: if true, any OCRed documents will be deleted as soon as their text is retrieved
  • OcrChunkThreshold: size, in bytes, that triggers calling Attachment/SplitPdf, to OCR in chunks
  • OcrThrowOnError: if true, any errors on any chunks will result in raising an error. If false, an Attachment.OnRuleWarning event will be raised instead
  • OcrRetryAttempts: number of retry attempts for a single chunk. Used to handle 'spurious' network errors.
  • OcrErrorMessage: if OcrThrowError == false, this text will be injected into the text stream in place of a failed chunk


Comments