Kafka API

Instead of sending documents through the HTTP Ingester it is also possible to place messages with prepared content directly into the Kafka topic configured for extract requests. In this case the URI and content need to be prepared in advance and supplied as the event value.

Events placed directly onto Kafka are expected to conform to the following schema constraints, events that do not meet this schema will either be discarded off to the Dead Letter Queue (DLQ) configured for the services, or may cause unpredictable behaviour.

This feature is intended for power users who already have some document preparation pipeline in place in their environments.

Headers

The following Kafka headers are used by Smart Cache Documents, only Security-Label is mandatory. Other headers MAY also be included but only the following have specific purpose within the pipeline:

Header name	Description	Required
`Exec-Path`	The name of the service or application that created the message.	No
`Distribution-Id`	The distribution ID for the documents being uploaded.	No
`Data-Source-Reference`	A URL for the original source of the document.	No
`Policy-Information`	The EDH or IDH policy information for the documents being uploaded.	No
`Request-ID`	A unique ID for the request.	No
`Security-Label`	The default security label that applies to the documents being uploaded.	Yes
`Content-Type`	The MIME type of the document.	No

Kafka header example:

{
  "Owner": "Platform Team",
  "Exec-Path": "smart-cache-documents-http-ingester",
  "Distribution-Id": "13bce3bf-7edb-4efb-a54f-574327458dd7",
  "Policy-Information": "{ \"IDH\": { \"apiVersion\": \"v1alpha\", \"uuid\": \"8d3c946b-b371-4906-b5bf-e2413365506e\", \"creationDate\": \"2025-09-30T13:23:00+01:00\", \"containsPii\": false, \"dataSource\": \"example_dataset\", \"access\": { \"classification\": \"S\", \"allowedOrgs\": [\"Telicent\"], \"allowedNats\": [\"GBR\"], \"groups\": [], }, \"ownership\": {\"originatingOrg\": \"Telicent\"}}",
  "Request-ID": "33002c05-a04e-4915-9bf8-9fdf6e9096e6",
  "Security-Label": "(classification=O&(permitted_organisations=Telicent)&(permitted_nationalities=GBR))",
  "Content-Type": "application/pdf",
  "Data-Source-Reference": "http://your-originating-system/docs/123.pdf"
}

NB: While the above example headers are shown as if they were JSON, Kafka headers are not stored as JSON internally. This example is intended purely to be illustrative of a possible set of headers an event might have.

Please refer to the documentation of the Kafka client API, or higher level library, you are using to write events to Kafka to determine how to set the event headers appropriately.

For more guidance on creating Security Labels please refer to the Data Security Labelling documentation.

Key

The event may optionally have a key, if a key is present it MUST be a valid UUID according to Kafka’s default serialization of UUIDs.

A UUID should, as the name suggests, be unique i.e. you SHOULD NOT reuse a UUID for more than one event key and SHOULD generate a new UUID key for each event.

Value

The value of the event is a JSON object with uri, filename and content fields.

The uri field is mandatory, and should be the unique URI for the original document. It is recommended to use a SHA-256 hash of the document content itself, or some other determistic calculation based on the input document, to form the local name portion of the URI. This ensures the same document will always have the same URI and will prevent duplication of documents in the Elasticsearch or OpenSearch index should the same document be ingested more than once.

The filename field is optional, if present it encodes the filename from which the document content originated. This is primarily used for logging when provided. However, it may also be used to help determine file format, and thus file parser, when extracting textual content from the document. This is done using both the filename and the Content-Type header on the event (if present).

The content field is mandatory and contains the Base64 encoded byte sequence representing the document content.

For example here’s a possible event value:

{
  "uri": "https://example.org/ns#473287f8298dba7163a897908958f7c0eae733e25d2e027992ea2edc9bed2fa8",
  "filename": "example.pdf",
  "content": "VGhlIHF1aWNrIGJyb3duIGZveCBqdW1wcyBvdmVyIHRoZSBsYXp5IGRvZy4g"
}