Tuning
This page describes the various tuning parameters available as well as recommendations on JVM Tuning.
JVM Tuning
The Indexer should require relatively little JVM tuning, JVM parameters can be customised by setting the JAVA_OPTIONS
environment variable for the indexer deployment.
The main consideration is JVM Heap Size, all the indexers memory usage will be on-heap so the JVM Heap max size -Xmx
should be set relative to the memory limits applied to the indexer deployment, allowing some overhead for OS memory usage.
The two main consumers of heap space are the data structures used to build, and eventually generate, the JSON documents that get indexed into the underlying search index, and the duplicate suppression cache. Note that the latter has a linear memory footprint with size of the cache since the cache stores hashes of previously indexed documents, and not the documents themselves.
The documents, and associated data structures, are the main driver of memory usage, and this is somewhat unpredictable as it depends on the data. Data with larger RDF literals, e.g. long descriptions or other string properties, will naturally generate larger documents and require more memory available. So choosing the right heap size will typically require some observation of the JVM metrics emitted by the indexer when operating on your data to see actual heap usage, if the heap is constantly at maximum then consider increasing it accordingly.
Parameters
Parameter | Env Variable | Default | Purpose |
---|---|---|---|
--duplicate-cache-size | 1000000 | Controls duplicate document cache size | |
--index-batch-size | INDEX_BATCH_SIZE | 100 | Controls bulk indexing batch size |
--flush-per-batches | 10 | Controls how frequently writes are flushed | |
--max-idle-time | 60 | Controls max time between indexing if batch size not reached |
These parameters can be set either via environment variables (where supported) for the indexer deployment, or by customising the args
passed to the indexer container. If both the argument and the environment variable are set then the argument takes precedence.
NB If no environment variable is shown in the above table then the parameter can only be supplied via arguments currently. Future releases will make more of these parameters controllable via environment variable.
--duplicate-cache-size
The --duplicate-cache-size <cache-size>
option takes a value indicating the desired size of the duplicate suppression cache. The default value of this is 1000000
, i.e. 1 million documents may be cached in memory. If you know that duplicates will be infrequent for your data then this should be reduced accordingly. Conversely if your data has a lot of common reference entities for things like locations consider making this larger provided that the indexer is provisioned with sufficient Java heap.
NB As noted under duplicate suppression the cache contains document content hashes, not the document contents themselves, so has a fixed memory footprint for each cache entry. Thus even if the generated documents are large there should be no need to reconfigure this setting on that account.
--index-batch-size
This option controls the batch size for the Bulk Indexing step of the indexing pipeline, it defaults to 100
, i.e. will attempt to index 100 documents in one batch.
NB If the batch size is too large then the underlying index MAY reject the request and WARN
level messages will be generated in the indexer logs indicating this is occurring. In this case the indexer retries with the request split into smaller chunks using an exponential backoff-retry strategy. Therefore it is important not to set this value too high otherwise the indexer will waste time splitting requests into chunks and retrying them.
--flush-per-batches
This option controls how frequently the indexer explicitly flushes writes to the underlying index, it defaults to 10
meaning after every 10 batches it will explicitly flush writes. So with the default batch size of 100
this would mean every 1000
documents the indexer would explicitly flush writes.
NB The underlying index will naturally flush writes itself over time anyway without an explicit flush being needed, but for high volume pipelines explicit flushes can make newly indexed documents visible in search results sooner.
--max-idle-time
The --max-idle-time <idle-seconds>
option takes a value indicating the maximum idle time allowed between indexing operations, the default is 60
seconds i.e. 1 minute.
If this time is exceeded then the indexer will index the current batch of documents into the underlying index regardless of whether the configured batch size has been reached. Thus setting a suitable value for this ensures new documents are regularly indexed regardless of the performance and data flow in the pipeline.
When indexing is completely up to date with the input Kafka topics then this effectively serves to put a maximum bound on the delay between a RDF event being read from Kafka and the entities extracted from it being visible in the search index.