Document Extraction API Cookbook
Overview
The Document Extraction API provides powerful options for businesses dealing with unstructured or semi-structured documents. Leveraging cutting-edge AI technology, the API helps streamline data extraction processes, though it does come with a learning curve requiring some intuitive understanding.
At the core of the system is the JSON schema, which provides a type-safe target for data transformations. While end-users do not need to fully understand the specification's nuances, those who do will find numerous workflow opportunities available.
API Documentation
For detailed API documentation, visit: Swagger API Documentation
Endpoints
- Opinionated Endpoints: Designed for quick, turn-key operations.
- Unopinionated Endpoints: Require the use of a JSON schema for more customized extraction.
Simple Discovery Workflow for Invoices
- Take a Sample: Select 5-10 invoices.
- Run Invoices Through the
parse-invoice
Endpoint: Use the provided CURL command. - Examine Results and Integrate: Review the JSON output and integrate it into your workflow.
Example CURL Command
curl -X 'POST' \
'https://documents.teachprotege.ai/parse-invoice' \
-H 'accept: application/json' \
-H 'Authorization: <api-token>' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@Invoice83916608.PDF;type=application/pdf' \
-F 'resolution=HIGH' \
-F 'pre_process=true' \
-F 'first_page_only=false' \
-F 'process_items=true' \
-F 'process_summary=true'
Sample Response
{
"summary": {
"invoiceNumber": "58720435",
"PON": "PO-XYZ-090755",
"invoiceDate": "6/12/24",
"shipDate": "6/12/24",
"soldToName": "FUTURA INDUSTRIES CORP",
"soldToAddress": "215 EAST BAY LANE COAST CITY 72855 CA USA"
},
"items": [
{
"itemNumber": "4960K62",
"itemSerialNumber": null,
"itemDescription": "Fixed-Setpoint High-Pressure Flow Switch, Straight, 1/2 NPT Male, Brass Body, 0.5 gpm Set Point",
"itemPrice": "185.15",
"itemQuantity": "1",
"itemTotal": "185.15"
}
],
"usage": {
"input_tokens": 3660,
"output_tokens": 212,
"num_pages": 2,
"pre_process": true,
"resolution": "HIGH",
"price": 0.074048
}
}
This workflow handles the majority of use cases effectively. However, some organizations require a specific subset of information from their documents, which is where JSON Schemas come into play.
JSON Schema Discovery Workflow
- Take a Sample: Select 5-10 documents.
- Craft a JSON Schema: Define the schema to represent the information you want to extract.
- Examine Results: Review the extracted data.
- Iterate on Your Schema: Refine the schema based on initial results.
Given the nature of AI solutions, the quality of your JSON schema significantly affects the model's ability to extract accurate information. Field descriptions can provide guidelines to the extraction process in natural language, improving extraction quality.
For more information on JSON Schema specifications, visit: JSON Schema Specification
Example JSON Schema for Invoices
Let's say we need to process invoices for goods purchased in Mexico, and an important field to extract is the buyer's RFC Number (Registro Federal de Contribuyentes) for VAT calculations.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Sell-To Information",
"type": "object",
"properties": {
"soldToName": {
"type": "string",
"description": "The name of the entity to which products are sold."
},
"soldToAddress": {
"type": "string",
"description": "The complete address of the entity to which products are sold."
},
"rfcNumber": {
"type": "string",
"description": "RFC (Registro Federal de Contribuyentes) number for the entity, specific to Mexican companies."
},
"invoiceNumber": {
"type": "string",
"description": "The unique number identifying this particular invoice."
},
"invoiceDate": {
"type": "string",
"description": "The date on which the invoice was issued.",
"format": "date"
},
"purchaseOrderNumber": {
"type": "string",
"description": "The purchase order number associated with this invoice."
}
},
"required": ["soldToName", "soldToAddress", "rfcNumber", "invoiceNumber", "invoiceDate", "purchaseOrderNumber"]
}
Example CURL Command
curl -X 'POST' \
'https://documents.teachprotege.ai/document-extract' \
-H 'accept: application/json' \
-H 'Authorization: <API-Token>' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@Invoice83916608.PDF;type=application/pdf' \
-F 'resolution=HIGH' \
-F 'pre_process=false' \
-F 'json_schema={ "$schema": "http://json-schema.org/draft-07/schema#", "title": "Sell-To Information", "type": "object", "properties": { "soldToName": { "type": "string", "description": "The name of the entity to which products are sold." }, "soldToAddress": { "type": "string", "description": "The complete address of the entity to which products are sold." }, "rfcNumber": { "type": "string", "description": "RFC (Registro Federal de Contribuyentes) number for the entity, specific to Mexican companies." }, "invoiceNumber": { "type": "string", "description": "The unique number identifying this particular invoice." }, "invoiceDate": { "type": "string", "description": "The date on which the invoice was issued.", "format": "date" }, "purchaseOrderNumber": { "type": "string", "description": "The purchase order number associated with this invoice." } }, "required": ["soldToName", "soldToAddress", "rfcNumber", "invoiceNumber", "invoiceDate", "purchaseOrderNumber"] }'
Sample Output
{
"data": [
{
"soldToName": "Delmonte Industrial Corp",
"soldToAddress": "100 Montego Way, Nuevo Santander, 43758 Nuevo Leon NL, Mexico",
"rfcNumber": "DIM881024BA4",
"invoiceNumber": "98765432",
"invoiceDate": "7/15/24",
"purchaseOrderNumber": "PO-124578"
}
],
"usage": {
"input_tokens": 1163,
"output_tokens": 108,
"num_pages": 1,
"pre_process": false,
"resolution": "HIGH",
"price": 0.028431
}
}
Troubleshooting
The document-extract
Endpoint is Returning Incorrect Information.
- Check Resolution: Set to
HIGH
if your document has many pixels, ensuring legibility. - Improve Field Descriptions: Make descriptions clearer to avoid ambiguity.
- Split Extractions: Break down extractions into multiple calls if dealing with varied or extensive information across a long document.
The parse-invoice endpoint says it is processing more pages than the document has.
The parse-invoice endpoint is doing a lot for you, so you don't have to think about it! It is designed to efficiently extract an invoice summary and invoice items from a multi-page document.
In order to do this correctly, it processes each page through a series of extractions in parallel. By leveraging our AI-enabled Data Stewardship algorithm, we can reduce overall hallucinations and errors in the output data.
Concretely, this results in twice the number of pages to be processed in a single call to the API, where to achieve similar results you would need to make multiple calls to the document-extract
endpoint with different schemas.
This likewise, also results in an increase in the API response time for invoices with complex data or a significant number of items.
Our system currently measures usage as a function of pages and text token usage as follows:
cost_per_image = base_cost * resolution_modifier
price = (input_tokens * cost_per_input_token) + (output_tokens * cost_per_output_token) + (num_pages * cost_per_image)
We are working over time to make the cost structure more "intuitive" however, the rule of thumb is:
Aspect | Implication | Cost |
---|---|---|
JSON Schema Size | Larger JSON Schema requires more tokens for parsing. | More Tokens -> Higher Cost |
Number of Pages | Each page is processed to extract detailed information. | More Pages -> Higher Cost |
Resolution | High-resolution documents require more detailed processing. | Higher Resolution -> Higher Cost |
The parse-invoice
endpoint is taking a long time to return results.
Given the nature of our "opinionated endpoints" -- the amount of time it takes is functional to the amount of information being extracted (i.e. the number of output tokens). It is not possible to concretely estimate the duration of a call, however it can be intuitively understood as a function of document pages and JSON schema complexity.
Aspect | Explanation | Duration Impact |
---|---|---|
Number of Output Tokens | The duration is linked to the amount of information being extracted, represented by the number of output tokens. | More Output Tokens -> Longer Processing Time |
Number of Pages | More pages mean more content to process. | More Pages -> Longer Processing Time |
JSON Schema Complexity | A complex JSON schema requires more detailed and extensive processing. | Higher Complexity -> Longer Processing Time |