Skip to main content

Document Extraction API Cookbook

Overview

The Document Extraction API provides powerful options for businesses dealing with unstructured or semi-structured documents. Leveraging cutting-edge AI technology, the API helps streamline data extraction processes, though it does come with a learning curve requiring some intuitive understanding.

At the core of the system is the JSON schema, which provides a type-safe target for data transformations. While end-users do not need to fully understand the specification's nuances, those who do will find numerous workflow opportunities available.

API Documentation

For detailed API documentation, visit: Swagger API Documentation

Endpoints

  • Opinionated Endpoints: Designed for quick, turn-key operations.
  • Unopinionated Endpoints: Require the use of a JSON schema for more customized extraction.

Simple Discovery Workflow for Invoices

  1. Take a Sample: Select 5-10 invoices.
  2. Run Invoices Through the parse-invoice Endpoint: Use the provided CURL command.
  3. Examine Results and Integrate: Review the JSON output and integrate it into your workflow.

Example CURL Command

curl -X 'POST' \
'https://documents.teachprotege.ai/parse-invoice' \
-H 'accept: application/json' \
-H 'Authorization: <api-token>' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@Invoice83916608.PDF;type=application/pdf' \
-F 'resolution=HIGH' \
-F 'pre_process=true' \
-F 'first_page_only=false' \
-F 'process_items=true' \
-F 'process_summary=true'

Sample Response

{
"summary": {
"invoiceNumber": "58720435",
"PON": "PO-XYZ-090755",
"invoiceDate": "6/12/24",
"shipDate": "6/12/24",
"soldToName": "FUTURA INDUSTRIES CORP",
"soldToAddress": "215 EAST BAY LANE COAST CITY 72855 CA USA"
},
"items": [
{
"itemNumber": "4960K62",
"itemSerialNumber": null,
"itemDescription": "Fixed-Setpoint High-Pressure Flow Switch, Straight, 1/2 NPT Male, Brass Body, 0.5 gpm Set Point",
"itemPrice": "185.15",
"itemQuantity": "1",
"itemTotal": "185.15"
}
],
"usage": {
"input_tokens": 3660,
"output_tokens": 212,
"num_pages": 2,
"pre_process": true,
"resolution": "HIGH",
"price": 0.074048
}
}

This workflow handles the majority of use cases effectively. However, some organizations require a specific subset of information from their documents, which is where JSON Schemas come into play.

JSON Schema Discovery Workflow

  1. Take a Sample: Select 5-10 documents.
  2. Craft a JSON Schema: Define the schema to represent the information you want to extract.
  3. Examine Results: Review the extracted data.
  4. Iterate on Your Schema: Refine the schema based on initial results.

Given the nature of AI solutions, the quality of your JSON schema significantly affects the model's ability to extract accurate information. Field descriptions can provide guidelines to the extraction process in natural language, improving extraction quality.

For more information on JSON Schema specifications, visit: JSON Schema Specification

Example JSON Schema for Invoices

Let's say we need to process invoices for goods purchased in Mexico, and an important field to extract is the buyer's RFC Number (Registro Federal de Contribuyentes) for VAT calculations.

{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Sell-To Information",
"type": "object",
"properties": {
"soldToName": {
"type": "string",
"description": "The name of the entity to which products are sold."
},
"soldToAddress": {
"type": "string",
"description": "The complete address of the entity to which products are sold."
},
"rfcNumber": {
"type": "string",
"description": "RFC (Registro Federal de Contribuyentes) number for the entity, specific to Mexican companies."
},
"invoiceNumber": {
"type": "string",
"description": "The unique number identifying this particular invoice."
},
"invoiceDate": {
"type": "string",
"description": "The date on which the invoice was issued.",
"format": "date"
},
"purchaseOrderNumber": {
"type": "string",
"description": "The purchase order number associated with this invoice."
}
},
"required": ["soldToName", "soldToAddress", "rfcNumber", "invoiceNumber", "invoiceDate", "purchaseOrderNumber"]
}

Example CURL Command

curl -X 'POST' \
'https://documents.teachprotege.ai/document-extract' \
-H 'accept: application/json' \
-H 'Authorization: <API-Token>' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@Invoice83916608.PDF;type=application/pdf' \
-F 'resolution=HIGH' \
-F 'pre_process=false' \
-F 'json_schema={ "$schema": "http://json-schema.org/draft-07/schema#", "title": "Sell-To Information", "type": "object", "properties": { "soldToName": { "type": "string", "description": "The name of the entity to which products are sold." }, "soldToAddress": { "type": "string", "description": "The complete address of the entity to which products are sold." }, "rfcNumber": { "type": "string", "description": "RFC (Registro Federal de Contribuyentes) number for the entity, specific to Mexican companies." }, "invoiceNumber": { "type": "string", "description": "The unique number identifying this particular invoice." }, "invoiceDate": { "type": "string", "description": "The date on which the invoice was issued.", "format": "date" }, "purchaseOrderNumber": { "type": "string", "description": "The purchase order number associated with this invoice." } }, "required": ["soldToName", "soldToAddress", "rfcNumber", "invoiceNumber", "invoiceDate", "purchaseOrderNumber"] }'

Sample Output

{
"data": [
{
"soldToName": "Delmonte Industrial Corp",
"soldToAddress": "100 Montego Way, Nuevo Santander, 43758 Nuevo Leon NL, Mexico",
"rfcNumber": "DIM881024BA4",
"invoiceNumber": "98765432",
"invoiceDate": "7/15/24",
"purchaseOrderNumber": "PO-124578"
}
],
"usage": {
"input_tokens": 1163,
"output_tokens": 108,
"num_pages": 1,
"pre_process": false,
"resolution": "HIGH",
"price": 0.028431
}
}

Troubleshooting

The document-extract Endpoint is Returning Incorrect Information.

  1. Check Resolution: Set to HIGH if your document has many pixels, ensuring legibility.
  2. Improve Field Descriptions: Make descriptions clearer to avoid ambiguity.
  3. Split Extractions: Break down extractions into multiple calls if dealing with varied or extensive information across a long document.

The parse-invoice endpoint says it is processing more pages than the document has.

The parse-invoice endpoint is doing a lot for you, so you don't have to think about it! It is designed to efficiently extract an invoice summary and invoice items from a multi-page document.

In order to do this correctly, it processes each page through a series of extractions in parallel. By leveraging our AI-enabled Data Stewardship algorithm, we can reduce overall hallucinations and errors in the output data.

Concretely, this results in twice the number of pages to be processed in a single call to the API, where to achieve similar results you would need to make multiple calls to the document-extract endpoint with different schemas.

This likewise, also results in an increase in the API response time for invoices with complex data or a significant number of items.

Our system currently measures usage as a function of pages and text token usage as follows:

cost_per_image = base_cost * resolution_modifier
price = (input_tokens * cost_per_input_token) + (output_tokens * cost_per_output_token) + (num_pages * cost_per_image)

We are working over time to make the cost structure more "intuitive" however, the rule of thumb is:

AspectImplicationCost
JSON Schema SizeLarger JSON Schema requires more tokens for parsing.More Tokens -> Higher Cost
Number of PagesEach page is processed to extract detailed information.More Pages -> Higher Cost
ResolutionHigh-resolution documents require more detailed processing.Higher Resolution -> Higher Cost

The parse-invoice endpoint is taking a long time to return results.

Given the nature of our "opinionated endpoints" -- the amount of time it takes is functional to the amount of information being extracted (i.e. the number of output tokens). It is not possible to concretely estimate the duration of a call, however it can be intuitively understood as a function of document pages and JSON schema complexity.

AspectExplanationDuration Impact
Number of Output TokensThe duration is linked to the amount of information being extracted, represented by the number of output tokens.More Output Tokens -> Longer Processing Time
Number of PagesMore pages mean more content to process.More Pages -> Longer Processing Time
JSON Schema ComplexityA complex JSON schema requires more detailed and extensive processing.Higher Complexity -> Longer Processing Time