Structured formats import
Structured formats import allows for importing and processing of non-visual documents such as JSON or XML files. It not only correctly extracts the information from these files, but also renders a minimalistic PDF representation for easier manual reviews.
Installation
The support for ingesting XML or JSON files needs to be enabled by Rossum team.
Rossum team info
-
In Django admin under Organization Group -> Features check the
stored_only_mime_types
field and if present, remove theapplication/xml
andtext/xml
values from the list. If the values are not there or the field does not exist, continue. -
In Django admin under Queue -> Queue Settings, add the desired mime types to the
accepted_mime_types
list:
{
"accepted_mime_types": [
"application/xml",
"text/xml"
// …
]
}
Structured formats import is a webhook maintained by Rossum. In order to use it, follow these steps:
- Login to your Rossum account.
- Navigate to Extensions → My extensions.
- Click on Create extension.
- Fill the following fields:
- Name:
Structured formats import
- Trigger events:
Upload - Created
- Extension type:
Webhook
- URL (see below)
- Name:
- Click Create the webhook.
- Fill
Configuration
field (see Configuration examples) - Assign an API token user
Basic usage
The extension supports multiple configurations. Even when using a single configuration, make sure it's defined in an array named configurations
.
{
// Various independent configurations that can be conditionally
// triggered via `trigger_condition`:
"configurations": [
{
"trigger_condition": {
// supported values: "xml" and "json"
"file_type": "xml"
},
// Fields to be extracted from the source file
// and assigned to given datapoints:
"fields": [
{
"schema_id": "recipient_name",
// If many selectors are specified, they serve as a fallback list.
// Selectors don't need to specify the root element (see sample XML below)
"selectors": ["Header/Recipient/Name"]
},
...
]
}
]
}
Sample source file:
<Invoice>
<Header>
<Recipient>
<Name>Hello world</Name>
</Recipient>
</Header>
</Invoice>
Available configuration options
{
// Various independent configurations that can be conditionally triggered via `trigger_condition`:
"configurations": [
{
"trigger_condition": {
"file_type": "xml"
},
// Optional. Whether the original XML/JSON file should be split into smaller ones:
"split_selectors": ["/RecordLabel/Productions/Production"],
// Fields to be extracted from the source file and assigned to given datapoints:
"fields": [
{
"schema_id": "document_id",
// If many selectors are specified, they serve as a fallback list.
"selectors": ["./Metadata/ID"]
}
],
// Optional specification of the original PDF file that should be extracted from the source
// file (base64 encoded):
"pdf_file": {
"name_selectors": [
"cac:AdditionalDocumentReference/cac:Attachment/cbc:EmbeddedDocumentBinaryObject/@filename"
],
// Content should be base64 encoded:
"content_selectors": [
"cac:AdditionalDocumentReference/cac:Attachment/cbc:EmbeddedDocumentBinaryObject"
]
}
}
// …
]
}