Compaction

Compaction API

What is compaction?

Smart-Cache Graph stores RDF datasets in Apache Jena TDB2 on disk and maintains an ABAC label store (RocksDB) for access control. Over time, especially after heavy ingestion, deletes, or Kafka catch‑up-on-disk data can become fragmented and larger than necessary (“bloat”).

This is due in part to the copy-on-write strategy used by Jena, described in more detail here, which can create duplicate entries in order to avoid locking data from running queries when uploading data.

Compaction rewrites and re-indexes data to:

  • Reclaim disk space
  • Improve I/O locality and query performance
  • Remove obsolete and out-dated segments & files

In Smart-Cache Graph, compaction covers both:

  • The TDB2 dataset(s) which back each configured /dataset.
  • The ABAC label store (RocksDB) which stores the security labels for the instance.

If a dataset is not TDB2 on disk, it is skipped.

NOTE: Compaction requires exclusive access to the dataset while it runs. In practice, that means write operations are paused and some reads may be temporarily blocked. Ideally, compaction should be scheduled during low traffic windows.


Automatic Compaction

On startup, Smart-Cache Graph registers an Initial Compaction module that may run twice:

  1. Before Kafka connectors start (cleans up any previous bloat)
  2. After Kafka catch‑up begins (cleans up any immediate bloat created by the Kafka data load)

To avoid unnecessary churn, Smart-Cache Graph remembers the last compacted size per dataset and skips a second run if the size hasn’t grown.

NOTE: Automatic compaction is enabled by default unless you set DISABLE_INITIAL_COMPACTION=true.

In some instances, perhaps for datasets that will never be updated - you may wish not to have compaction take place.


REST Endpoint

There exists a REST endpoint that can be called to manually control the compaction operation within a running Smart-Cache Graph instance.

NOTE: This endpoint cannot be called by end-users and is only accessible via internal systems access by a suitable administrator.

The description below is purely for informative purposes.

POST /$/compactall

Trigger compaction for all datasets and the ABAC label store, immediately.

  • Method: POST
  • Auth: None - not exposed publicly.
  • Body: None
  • Response: 200 OK on success with an empty body. On error, Smart-Cache Graph returns an error status code (for example, a 4xx/5xx) with a short message.

What gets compacted? Every configured TDB2 dataset is compacted in turn. After each dataset compacts, Smart-Cache Graph also compacts the ABAC label store (RocksDB).

Idempotency: If a subsequent call finds a dataset size has not increased since the last compaction, SCG will log that it’s already maximally compacted and skip rework. If a call is made while a compaction is already underway - it will be ignored.


Operational guidance

  • When to run it:
    • After large ingestion or Kafka catch‑up
    • After significant deletes/updates
    • When monitoring shows unusual disk growth or slower queries
  • How long it takes:
    • Depends on dataset size and storage performance.
    • Smart-Cache Graph logs start/end messages with durations so that can be established.
  • Availability impact:
    • Compaction acquires an exclusive lock on the dataset.
    • Plan a maintenance window or run during off‑peak hours.
  • Concurrency:
    • Don’t trigger multiple compactions at once. Subsequent calls will be ignored.
    • Call /$/compactall once and wait for it to return.
  • Restarting:
    • The simplest way to carry out a compaction is to restart the relevant Smart-Cache Graph instance (as it does so during start-up)

Monitoring & logs

Smart-cache Graph logs lines with the io.telicent.core.FMod_InitialCompaction logger, for example:

[Compaction] >>>> Start compact /myDataset, current size is 123.4 GB (132,497,408,000)
[Compaction] <<<< Finish compact /myDataset. Took 95 seconds.  Compacted size is 97.2 GB (104,389,615,616)
[Compaction] <<<< Start label store compaction.
[Compaction] <<<< Finish label store compaction. Took 3 seconds.

These messages confirm:

  • Which dataset is being compacted
  • Before/after sizes
  • Total time taken
  • Label store compaction status

Consider scraping logs into your observability stack and wiring alerts for abnormal durations or repeated failures.


Troubleshooting

401 Unauthorized/404 Not found

  • You’re likely making a call to the endpoint externally which will not work
  • Call /$/compactall directly on the Smart-cache Graph.

No size change after compaction

  • SCG skips repeated work if the dataset hasn’t grown since the last compaction, and it will log that it’s already maximally compacted.

Long runtimes or timeouts

  • Check storage performance and available free space.
  • Run during off‑peak and avoid other heavy I/O.

Summary

  • Compaction reclaims space and improves performance for the underlying TDB2 and ABAC label store.
  • Smart-Cache Graph runs automatic compaction at startup (twice) unless disabled via DISABLE_INITIAL_COMPACTION.
  • Only Administrators can trigger compaction anytime using POST /$/compactall.
  • Compaction should only be scheduled during low‑traffic periods.

[EARLY DRAFT RELEASE] Copyright 2020-2025 Telicent Limited. All rights reserved