Document Extraction API Cookbook

Overview

The Document Extraction API provides powerful options for businesses dealing with unstructured or semi-structured documents. Leveraging cutting-edge AI technology, the API helps streamline data extraction processes, though it does come with a learning curve requiring some intuitive understanding.

At the core of the system is the JSON schema, which provides a type-safe target for data transformations. While end-users do not need to fully understand the specification's nuances, those who do will find numerous workflow opportunities available.

API Documentation

For detailed API documentation, visit: Swagger API Documentation

Endpoints

Opinionated Endpoints: Designed for quick, turn-key operations.
Unopinionated Endpoints: Require the use of a JSON schema for more customized extraction.

Simple Discovery Workflow for Invoices

Take a Sample: Select 5-10 invoices.
Run Invoices Through the parse-invoice Endpoint: Use the provided CURL command.
Examine Results and Integrate: Review the JSON output and integrate it into your workflow.

Example CURL Command

curl -X 'POST' \
  'https://documents.teachprotege.ai/parse-invoice' \
  -H 'accept: application/json' \
  -H 'Authorization: <api-token>' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@Invoice83916608.PDF;type=application/pdf' \
  -F 'resolution=HIGH' \
  -F 'pre_process=true' \
  -F 'first_page_only=false' \
  -F 'process_items=true' \
  -F 'process_summary=true'

Sample Response

{
  "summary": {
    "invoiceNumber": "58720435",
    "PON": "PO-XYZ-090755",
    "invoiceDate": "6/12/24",
    "shipDate": "6/12/24",
    "soldToName": "FUTURA INDUSTRIES CORP",
    "soldToAddress": "215 EAST BAY LANE COAST CITY 72855 CA USA"
  },
  "items": [
    {
      "itemNumber": "4960K62",
      "itemSerialNumber": null,
      "itemDescription": "Fixed-Setpoint High-Pressure Flow Switch, Straight, 1/2 NPT Male, Brass Body, 0.5 gpm Set Point",
      "itemPrice": "185.15",
      "itemQuantity": "1",
      "itemTotal": "185.15"
    }
  ],
  "usage": {
    "input_tokens": 3660,
    "output_tokens": 212,
    "num_pages": 2,
    "pre_process": true,
    "resolution": "HIGH",
    "price": 0.074048
  }
}

This workflow handles the majority of use cases effectively. However, some organizations require a specific subset of information from their documents, which is where JSON Schemas come into play.

JSON Schema Discovery Workflow

Take a Sample: Select 5-10 documents.
Craft a JSON Schema: Define the schema to represent the information you want to extract.
Examine Results: Review the extracted data.
Iterate on Your Schema: Refine the schema based on initial results.

Given the nature of AI solutions, the quality of your JSON schema significantly affects the model's ability to extract accurate information. Field descriptions can provide guidelines to the extraction process in natural language, improving extraction quality.

For more information on JSON Schema specifications, visit: JSON Schema Specification

Example JSON Schema for Invoices

Let's say we need to process invoices for goods purchased in Mexico, and an important field to extract is the buyer's RFC Number (Registro Federal de Contribuyentes) for VAT calculations.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Sell-To Information",
  "type": "object",
  "properties": {
    "soldToName": {
      "type": "string",
      "description": "The name of the entity to which products are sold."
    },
    "soldToAddress": {
      "type": "string",
      "description": "The complete address of the entity to which products are sold."
    },
    "rfcNumber": {
      "type": "string",
      "description": "RFC (Registro Federal de Contribuyentes) number for the entity, specific to Mexican companies."
    },
    "invoiceNumber": {
      "type": "string",
      "description": "The unique number identifying this particular invoice."
    },
    "invoiceDate": {
      "type": "string",
      "description": "The date on which the invoice was issued.",
      "format": "date"
    },
    "purchaseOrderNumber": {
      "type": "string",
      "description": "The purchase order number associated with this invoice."
    }
  },
  "required": ["soldToName", "soldToAddress", "rfcNumber", "invoiceNumber", "invoiceDate", "purchaseOrderNumber"]
}

Example CURL Command

curl -X 'POST' \
  'https://documents.teachprotege.ai/document-extract' \
  -H 'accept: application/json' \
  -H 'Authorization: <API-Token>' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@Invoice83916608.PDF;type=application/pdf' \
  -F 'resolution=HIGH' \
  -F 'pre_process=false' \
  -F 'json_schema={   "$schema": "http://json-schema.org/draft-07/schema#",   "title": "Sell-To Information",   "type": "object",   "properties": {     "soldToName": {       "type": "string",       "description": "The name of the entity to which products are sold."     },     "soldToAddress": {       "type": "string",       "description": "The complete address of the entity to which products are sold."     },     "rfcNumber": {       "type": "string",       "description": "RFC (Registro Federal de Contribuyentes) number for the entity, specific to Mexican companies."     },     "invoiceNumber": {       "type": "string",       "description": "The unique number identifying this particular invoice."     },     "invoiceDate": {       "type": "string",       "description": "The date on which the invoice was issued.",       "format": "date"     },     "purchaseOrderNumber": {       "type": "string",       "description": "The purchase order number associated with this invoice."     }   },   "required": ["soldToName", "soldToAddress", "rfcNumber", "invoiceNumber", "invoiceDate", "purchaseOrderNumber"] }'

Sample Output

{
  "data": [
    {
      "soldToName": "Delmonte Industrial Corp",
      "soldToAddress": "100 Montego Way, Nuevo Santander, 43758 Nuevo Leon NL, Mexico",
      "rfcNumber": "DIM881024BA4",
      "invoiceNumber": "98765432",
      "invoiceDate": "7/15/24",
      "purchaseOrderNumber": "PO-124578"
    }
  ],
  "usage": {
    "input_tokens": 1163,
    "output_tokens": 108,
    "num_pages": 1,
    "pre_process": false,
    "resolution": "HIGH",
    "price": 0.028431
  }
}

Troubleshooting

The `document-extract` Endpoint is Returning Incorrect Information.

Check Resolution: Set to HIGH if your document has many pixels, ensuring legibility.
Improve Field Descriptions: Make descriptions clearer to avoid ambiguity.
Split Extractions: Break down extractions into multiple calls if dealing with varied or extensive information across a long document.

The parse-invoice endpoint says it is processing more pages than the document has.

The parse-invoice endpoint is doing a lot for you, so you don't have to think about it! It is designed to efficiently extract an invoice summary and invoice items from a multi-page document.

In order to do this correctly, it processes each page through a series of extractions in parallel. By leveraging our AI-enabled Data Stewardship algorithm, we can reduce overall hallucinations and errors in the output data.

Concretely, this results in twice the number of pages to be processed in a single call to the API, where to achieve similar results you would need to make multiple calls to the document-extract endpoint with different schemas.

This likewise, also results in an increase in the API response time for invoices with complex data or a significant number of items.

Our system currently measures usage as a function of pages and text token usage as follows:

cost_per_image = base_cost * resolution_modifier
price = (input_tokens * cost_per_input_token) + (output_tokens * cost_per_output_token) + (num_pages * cost_per_image)

We are working over time to make the cost structure more "intuitive" however, the rule of thumb is:

Aspect	Implication	Cost
JSON Schema Size	Larger JSON Schema requires more tokens for parsing.	More Tokens -> Higher Cost
Number of Pages	Each page is processed to extract detailed information.	More Pages -> Higher Cost
Resolution	High-resolution documents require more detailed processing.	Higher Resolution -> Higher Cost

The `parse-invoice` endpoint is taking a long time to return results.

Given the nature of our "opinionated endpoints" -- the amount of time it takes is functional to the amount of information being extracted (i.e. the number of output tokens). It is not possible to concretely estimate the duration of a call, however it can be intuitively understood as a function of document pages and JSON schema complexity.

Aspect	Explanation	Duration Impact
Number of Output Tokens	The duration is linked to the amount of information being extracted, represented by the number of output tokens.	More Output Tokens -> Longer Processing Time
Number of Pages	More pages mean more content to process.	More Pages -> Longer Processing Time
JSON Schema Complexity	A complex JSON schema requires more detailed and extensive processing.	Higher Complexity -> Longer Processing Time

Document Extraction API Cookbook

Overview​

API Documentation​

Endpoints​

Simple Discovery Workflow for Invoices​

Example CURL Command​

Sample Response​

JSON Schema Discovery Workflow​

Example JSON Schema for Invoices​

Example CURL Command​

Sample Output​

Troubleshooting​

The document-extract Endpoint is Returning Incorrect Information.​

The parse-invoice endpoint says it is processing more pages than the document has.​

The parse-invoice endpoint is taking a long time to return results.​

Overview

API Documentation

Endpoints

Simple Discovery Workflow for Invoices

Example CURL Command

Sample Response

JSON Schema Discovery Workflow

Example JSON Schema for Invoices

Example CURL Command

Sample Output

Troubleshooting

The `document-extract` Endpoint is Returning Incorrect Information.

The parse-invoice endpoint says it is processing more pages than the document has.

The `parse-invoice` endpoint is taking a long time to return results.