Kafka API
Instead of sending documents through the HTTP Ingester it is also possible to place messages with prepared content directly into the Kafka topic configured for extract requests. In this case the URI and content need to be prepared in advance and supplied as the event value.
Events placed directly onto Kafka are expected to conform to the following schema constraints, events that do not meet this schema will either be discarded off to the Dead Letter Queue (DLQ) configured for the services, or may cause unpredictable behaviour.
This feature is intended for power users who already have some document preparation pipeline in place in their environments.
Headers
The following Kafka headers are used by Smart Cache Documents, only Security-Label is mandatory. Other headers MAY also be included but only the following have specific purpose within the pipeline:
| Header name | Description | Required |
|---|---|---|
Exec-Path | The name of the service or application that created the message. | No |
Distribution-Id | The distribution ID for the documents being uploaded. | No |
Data-Source-Reference | A URL for the original source of the document. | No |
Policy-Information | The EDH or IDH policy information for the documents being uploaded. | No |
Request-ID | A unique ID for the request. | No |
Security-Label | The default security label that applies to the documents being uploaded. | Yes |
Content-Type | The MIME type of the document. | No |
Kafka header example:
{
"Owner": "Platform Team",
"Exec-Path": "smart-cache-documents-http-ingester",
"Distribution-Id": "13bce3bf-7edb-4efb-a54f-574327458dd7",
"Policy-Information": "{ \"IDH\": { \"apiVersion\": \"v1alpha\", \"uuid\": \"8d3c946b-b371-4906-b5bf-e2413365506e\", \"creationDate\": \"2025-09-30T13:23:00+01:00\", \"containsPii\": false, \"dataSource\": \"example_dataset\", \"access\": { \"classification\": \"S\", \"allowedOrgs\": [\"Telicent\"], \"allowedNats\": [\"GBR\"], \"groups\": [], }, \"ownership\": {\"originatingOrg\": \"Telicent\"}}",
"Request-ID": "33002c05-a04e-4915-9bf8-9fdf6e9096e6",
"Security-Label": "(classification=O&(permitted_organisations=Telicent)&(permitted_nationalities=GBR))",
"Content-Type": "application/pdf",
"Data-Source-Reference": "http://your-originating-system/docs/123.pdf"
}
NB: While the above example headers are shown as if they were JSON, Kafka headers are not stored as JSON internally. This example is intended purely to be illustrative of a possible set of headers an event might have.
Please refer to the documentation of the Kafka client API, or higher level library, you are using to write events to Kafka to determine how to set the event headers appropriately.
For more guidance on creating Security Labels please refer to the Data Security Labelling documentation.
Key
The event may optionally have a key, if a key is present it MUST be a valid UUID according to Kafka’s default serialization of UUIDs.
A UUID should, as the name suggests, be unique i.e. you SHOULD NOT reuse a UUID for more than one event key and SHOULD generate a new UUID key for each event.
Value
The value of the event is a JSON object with uri, filename and content fields.
The uri field is mandatory, and should be the unique URI for the original document. It is recommended to use a SHA-256 hash of the document content itself, or some other deterministic calculation based on the input document, to form the local name portion of the URI. This ensures the same document will always have the same URI and will prevent duplication of documents in the Elasticsearch or OpenSearch index should the same document be ingested more than once.
The filename field is optional, if present it encodes the filename from which the document content originated. This is primarily used for logging when provided. However, it may also be used to help determine file format, and thus file parser, when extracting textual content from the document. This is done using both the filename and the Content-Type header on the event (if present).
The content field is mandatory and contains the Base64 encoded byte sequence representing the document content.
For example here’s a possible event value:
{
"uri": "https://example.org/ns#473287f8298dba7163a897908958f7c0eae733e25d2e027992ea2edc9bed2fa8",
"filename": "example.pdf",
"content": "VGhlIHF1aWNrIGJyb3duIGZveCBqdW1wcyBvdmVyIHRoZSBsYXp5IGRvZy4g"
}
Handling large documents
Kafka enforces a 1 MB limit on messages as standard. When encoding messages with Base64 payloads can increase by as much as one third. This means that in practice the actual limit on incoming raw messages is closer to 700KB.
If you expect to be dealing with larger documents, you can configure both Kafka and your producer/consumer clients in the following ways,
- Enable compression in the producer. This will reduce network and storage costs but increase processing time. It will allow larger documents to fit within the limits, provided they compress well. Given our assumption of text-only documentation these gains could be as much as a 50% reduction. Note: The gains made aure under the assumption that the compression of text occurs before Base64 encoding otherwise the gains made will be low.
- Increase producer and consumer limits. The producer validates size before compression, and consumers need larger fetch limits to receive bigger messages.
- Increase the broker and topic size limits. This is the only reliable way to support consistently large documents. It does take up more space though.
Recommended configuration changes (values adjusted to your specific environment):
| Location | Setting | Default | Purpose |
|---|---|---|---|
| Producer | compression.type | none | Enable compression (lz4 is a good default for performance but gzip gives higher compression). |
| Producer | max.request.size | 1048576 | Allow larger uncompressed requests (in bytes). |
| Broker | message.max.bytes | 1048588 | Increase the broker limit for message batches. |
| Broker | replica.fetch.max.bytes | 1048576 | Ensure followers can replicate larger messages. |
| Topic | max.message.bytes | 1048588 | Topic-level override to allow larger messages. |
| Consumer | fetch.max.bytes | 52428800 | Allow larger fetches overall. |
| Consumer | max.partition.fetch.bytes | 1048576 | Allow larger fetches per partition. |
Notes:
Given the configuration affects different components, it is critical that you do not accidentally set conflicting values.
- Set
max.request.sizeto at least the uncompressed size of the Base64-encoded payload. - Topic limits must be less than or equal to the broker
message.max.byteslimit. - Set
replica.fetch.max.bytesto be greater than or equal to the brokermessage.max.bytes. - If documents can be very large or do not compress well, consider splitting content across multiple messages in your own producer/consumer pipeline, as the Telicent pipeline expects a complete document in each event.