Kafka API

Instead of sending documents through the HTTP Ingester it is also possible to place messages with prepared content directly into the Kafka topic configured for extract requests. In this case the URI and content need to be prepared in advance and supplied as the event value.

Events placed directly onto Kafka are expected to conform to the following schema constraints, events that do not meet this schema will either be discarded off to the Dead Letter Queue (DLQ) configured for the services, or may cause unpredictable behaviour.

This feature is intended for power users who already have some document preparation pipeline in place in their environments.

Headers

The following Kafka headers are used by Smart Cache Documents, only Security-Label is mandatory. Other headers MAY also be included but only the following have specific purpose within the pipeline:

Header name Description Required
Exec-Path The name of the service or application that created the message. No
Distribution-Id The distribution ID for the documents being uploaded. No
Data-Source-Reference A URL for the original source of the document. No
Policy-Information The EDH or IDH policy information for the documents being uploaded. No
Request-ID A unique ID for the request. No
Security-Label The default security label that applies to the documents being uploaded. Yes
Content-Type The MIME type of the document. No

Kafka header example:

{
  "Owner": "Platform Team",
  "Exec-Path": "smart-cache-documents-http-ingester",
  "Distribution-Id": "13bce3bf-7edb-4efb-a54f-574327458dd7",
  "Policy-Information": "{ \"IDH\": { \"apiVersion\": \"v1alpha\", \"uuid\": \"8d3c946b-b371-4906-b5bf-e2413365506e\", \"creationDate\": \"2025-09-30T13:23:00+01:00\", \"containsPii\": false, \"dataSource\": \"example_dataset\", \"access\": { \"classification\": \"S\", \"allowedOrgs\": [\"Telicent\"], \"allowedNats\": [\"GBR\"], \"groups\": [], }, \"ownership\": {\"originatingOrg\": \"Telicent\"}}",
  "Request-ID": "33002c05-a04e-4915-9bf8-9fdf6e9096e6",
  "Security-Label": "(classification=O&(permitted_organisations=Telicent)&(permitted_nationalities=GBR))",
  "Content-Type": "application/pdf",
  "Data-Source-Reference": "http://your-originating-system/docs/123.pdf"
}

NB: While the above example headers are shown as if they were JSON, Kafka headers are not stored as JSON internally. This example is intended purely to be illustrative of a possible set of headers an event might have.

Please refer to the documentation of the Kafka client API, or higher level library, you are using to write events to Kafka to determine how to set the event headers appropriately.

For more guidance on creating Security Labels please refer to the Data Security Labelling documentation.

Key

The event may optionally have a key, if a key is present it MUST be a valid UUID according to Kafka’s default serialization of UUIDs.

A UUID should, as the name suggests, be unique i.e. you SHOULD NOT reuse a UUID for more than one event key and SHOULD generate a new UUID key for each event.

Value

The value of the event is a JSON object with uri, filename and content fields.

The uri field is mandatory, and should be the unique URI for the original document. It is recommended to use a SHA-256 hash of the document content itself, or some other deterministic calculation based on the input document, to form the local name portion of the URI. This ensures the same document will always have the same URI and will prevent duplication of documents in the Elasticsearch or OpenSearch index should the same document be ingested more than once.

The filename field is optional, if present it encodes the filename from which the document content originated. This is primarily used for logging when provided. However, it may also be used to help determine file format, and thus file parser, when extracting textual content from the document. This is done using both the filename and the Content-Type header on the event (if present).

The content field is mandatory and contains the Base64 encoded byte sequence representing the document content.

For example here’s a possible event value:

{
  "uri": "https://example.org/ns#473287f8298dba7163a897908958f7c0eae733e25d2e027992ea2edc9bed2fa8",
  "filename": "example.pdf",
  "content": "VGhlIHF1aWNrIGJyb3duIGZveCBqdW1wcyBvdmVyIHRoZSBsYXp5IGRvZy4g"
}

Handling large documents

Kafka enforces a 1 MB limit on messages as standard. When encoding messages with Base64 payloads can increase by as much as one third. This means that in practice the actual limit on incoming raw messages is closer to 700KB.

If you expect to be dealing with larger documents, you can configure both Kafka and your producer/consumer clients in the following ways,

  1. Enable compression in the producer. This will reduce network and storage costs but increase processing time. It will allow larger documents to fit within the limits, provided they compress well. Given our assumption of text-only documentation these gains could be as much as a 50% reduction. Note: The gains made aure under the assumption that the compression of text occurs before Base64 encoding otherwise the gains made will be low.
  2. Increase producer and consumer limits. The producer validates size before compression, and consumers need larger fetch limits to receive bigger messages.
  3. Increase the broker and topic size limits. This is the only reliable way to support consistently large documents. It does take up more space though.

Recommended configuration changes (values adjusted to your specific environment):

Location Setting Default Purpose
Producer compression.type none Enable compression (lz4 is a good default for performance but gzip gives higher compression).
Producer max.request.size 1048576 Allow larger uncompressed requests (in bytes).
Broker message.max.bytes 1048588 Increase the broker limit for message batches.
Broker replica.fetch.max.bytes 1048576 Ensure followers can replicate larger messages.
Topic max.message.bytes 1048588 Topic-level override to allow larger messages.
Consumer fetch.max.bytes 52428800 Allow larger fetches overall.
Consumer max.partition.fetch.bytes 1048576 Allow larger fetches per partition.

Notes:

Given the configuration affects different components, it is critical that you do not accidentally set conflicting values.

  • Set max.request.size to at least the uncompressed size of the Base64-encoded payload.
  • Topic limits must be less than or equal to the broker message.max.bytes limit.
  • Set replica.fetch.max.bytes to be greater than or equal to the broker message.max.bytes.
  • If documents can be very large or do not compress well, consider splitting content across multiple messages in your own producer/consumer pipeline, as the Telicent pipeline expects a complete document in each event.

[EARLY DRAFT RELEASE] Copyright 2020-2025 Telicent Limited. All rights reserved