Skip to main content

Structured formats import

Structured formats import allows for importing and processing of non-visual documents such as JSON or XML files. It not only correctly extracts the information from these files, but also renders a minimalistic PDF representation for easier manual reviews.

Installation

warning

The support for ingesting XML or JSON files needs to be enabled by Rossum team.

Rossum team info
  1. In Django admin under Organization Group -> Features check the stored_only_mime_types field and if present, remove the application/xml and text/xml values from the list. If the values are not there or the field does not exist, continue.

  2. In Django admin under Queue -> Queue Settings, add the desired mime types to the accepted_mime_types list:

{
"accepted_mime_types": [
"application/xml",
"text/xml"
// …
]
}

Structured formats import is a webhook maintained by Rossum. In order to use it, follow these steps:

  1. Login to your Rossum account.
  2. Navigate to Extensions → My extensions.
  3. Click on Create extension.
  4. Fill the following fields:
    1. Name: Structured formats import
    2. Trigger events: Upload - Created
    3. Extension type: Webhook
    4. URL (see below)
  5. Click Create the webhook.
  6. Fill Configuration field (see Configuration examples)
  7. Assign an API token user
EnvironmentWebhook URL
EU1 Irelandhttps://elis.task-manager.rossum-ext.app/api/v1/tasks/structured-formats-import
EU2 Frankfurthttps://shared-eu2.task-manager.rossum-ext.app/api/v1/tasks/structured-formats-import
US east coasthttps://us.task-manager.rossum-ext.app/api/v1/tasks/structured-formats-import
Japan Tokyohttps://shared-jp.task-manager.rossum-ext.app/api/v1/tasks/structured-formats-import

Basic usage

Note

The extension supports multiple configurations. Even when using a single configuration, make sure it's defined in an array named configurations.

{
// Various independent configurations that can be conditionally
// triggered via `trigger_condition`:
"configurations": [
{
"trigger_condition": {
// supported values: "xml" and "json"
"file_type": "xml"
},

// Fields to be extracted from the source file
// and assigned to given datapoints:
"fields": [
{
"schema_id": "recipient_name",

// If many selectors are specified, they serve as a fallback list.
// Selectors don't need to specify the root element (see sample XML below)
"selectors": ["Header/Recipient/Name"]
},
...
]
}
]
}

Sample source file:

<Invoice>
<Header>
<Recipient>
<Name>Hello world</Name>
</Recipient>
</Header>
</Invoice>

Available configuration options

{
// Various independent configurations that can be conditionally triggered via `trigger_condition`:
"configurations": [
{
"trigger_condition": {
"file_type": "xml"
},

// Optional. Whether the original XML/JSON file should be split into smaller ones:
"split_selectors": ["/RecordLabel/Productions/Production"],

// Fields to be extracted from the source file and assigned to given datapoints:
"fields": [
{
"schema_id": "document_id",

// If many selectors are specified, they serve as a fallback list.
"selectors": ["./Metadata/ID"]
}
],

// Optional specification of the original PDF file that should be extracted from the source
// file (base64 encoded):
"pdf_file": {
"name_selectors": [
"cac:AdditionalDocumentReference/cac:Attachment/cbc:EmbeddedDocumentBinaryObject/@filename"
],

// Content should be base64 encoded:
"content_selectors": [
"cac:AdditionalDocumentReference/cac:Attachment/cbc:EmbeddedDocumentBinaryObject"
]
}
}
// …
]
}