Add Some Data

This how-to walks through the onboarding of a dataset into Telicent CORE. You can build data pipelines within Telicent CORE using telicent-lib, our Python SDK for moving data through Telicent CORE. Telicent-lib must be installed to your Python environment before writing your first adapter, mapper or projector.

The stages outlined below cover getting data into CORE (and adapter) and converting that data to an ontology (in this case IES) using a mapping. In this case there is no need to write a projector as CORE already has a Smart-Cache GRAPH which has its own projector. If you’re running the enterprise version of CORE, the mapped data will also be picked up and indexed by CORE. The stages are:

create sample dataset
write an adapter to bring the data into CORE
write a mapper to convert the data and push to the KNOWLEDGE topic

From there, the converted data will be picked up by the Smart-Caches so you can query it.

Sample source data

[
    {
        "id":"1",
        "name":"Professor Dumbledore",
        "worksFor":"Hogwarts School"
    },
    {
        "id":"2",
        "name":"Professor Snape",
        "worksFor":"Hogwarts School"
    },
    {
        "id":"3",
        "name":"Lord Voldemort",
        "worksFor":"Death Eaters",
        "educatedAt":"Hogwarts School"
    },
    ...
]

Getting data into CORE

Adapters are components that bring data into CORE from external sources - i.e. the import data into a Kafka topic. There are two types of Adapters in telicent-lib, an Adapter & an AutomaticAdapter. The difference between the two is explained here. In the example below, we’ll build an Adapter.

config = Configurator()

target_topic = config.get("TARGET_TOPIC", required=True,
                    description="Specifies the Kafka topic the mapper pushes its output to", default="example-topic")

def get_json_data():
    # ... read in and process data
    return serialized_data

def run(adapter):
    adapter.run()
    while True:
        data = get_json_data()
        # ... more work
        record = Record(headers, None, data, None)
        adapter.send(record)
        sleep(10)
    adapter.finished()

sink = KafkaSink(target_topic)
adapter = Adapter(
    target=sink,
    name="Example Adapter"
)

run(adapter)

This basic structure sets up the polling of a data source and releases a Record to the Kafka Sink. The target_topic is where the Adapter will send data to. In most cases, this will be to a raw topic where a mapper will then pick up the data and process it further. For simplicity, in this example we’ll look at data going straight into the knowledge topic

The Knowledge Topic

This topic is reserved within Telicent CORE. It represents the gold standard of data in CORE, this is held within the IES ontology by default (but it can be others) and represented in RDF (whether formatted in turtle, n-triples or other valid representations). This is the topic that Smart-Caches collect data from and ingest into their underlying storage. It should be high controlled and continually validated.

Registering a Dataset

Protecting Data in CORE

CORE makes use of Attribute Based Access Control (ABAC). This is enforced on every query for data in the platform. Data brought into CORE, by principle requires a Security-Label. This label represents the attributes a user must have to access the data. Each piece of data must carry this everywhere it goes within the platform. To assist in the creation of the label, we have created an addition module called label-builder.

Using the Information Data Header (IDH) Label

The IDH is a way to describe the handling requirements for a piece of data, more can be read here. Each piece of data in the platform must have an IDH (or a similar label format) and its equivalent Security-Label.

The label-builder module has a IDHModel, which maps the IDH object to a Security-Label. This will then be enforced within applications. Messages on CORE must contain a Security-Label and should have a policyInformation header.

Applying Headers to Data in CORE

The Adapter component applies a specific set of headers to the Records produced. These include:

Request-Id, a unique identifier for the Record within CORE
Exec-Path, a record of the processes the data has been through
Content-Type, the content type held within the data

From this point, the engineer is responsible for setting the Security-Label and the policyInformation.

If the data ingested contains a data handling label, the label builder can be used to create a model for mapping the label from an object to a valid Security-Label. In some cases the data will not have a handling label and therefore must be created - every piece of data in CORE must be passed with a Security-Label and, particularly when the federation capability is being used, some policyInformation.

If we go back to our example:

data_source_base_idh = {
        "apiVersion": "v1alpha",
        "containsPii": False,
        "dataSource": "Example Data Source",
        "ownership": {
            "originatingOrg": "Telicent"
        },
        "access": {
            "classification": "O" ,
            "allowedOrgs": ["Telicent"],
            "allowedNats": ["GBR"],
            "groups": ["urn:telicent:groups:example"]
        }
    }


def create_data_idh():
    idh = data_source_base_idh.copy()
    idh["uuid"] = generate_some_uuid()
    idh["creationDate"] = datetime.now()
    return {"idh" : idh}

def run(adapter):
    adapter.run()
    while True:
        # ... get data
        data_label = create_data_idh()
        headers = RecordUtils.to_headers({
            "policyInformation": data_label,
            "Security-Label": IDHModel(**data_label['idh']).build_security_labels()
        })
        record = Record(headers, None, data, None)
        adapter.send(record)
        sleep(10)
    adapter.finished()

An example of the CORE record which this kind of pattern would create is:

record:
  headers:
    - Request-Id: "example-topic:da75703d-811e-4fad-ad90-a5ad988b5cce"
    - Content-Type: "application/json"
    - Exec-Path: "Example Adapter"
    - policyInformation:
        - idh: { ... }
    - Security-Label: "classification=O&(permitted_nationalities=GBR)&(permitted_organisations=Telicent)&(urn:telicent:groups:example:and)"
  data: |
    "<serialised_json>"

Purpose of Data Labelling

Data labelling is fundamental to Telicent CORE and its ability to do fine grained data access control. The Security-Label represents a policy which, upon requesting access to a piece of data, the user must statify.

Telicent ACCESS is where user entitlements can be stored.

A Note on Groups

Groups can be created and Users can be added to Groups within ACCESS. A Group is created with the following attributes:

Name: This is the way the group is identified by an administrator. When a name is provided it is used within the Group URN, which is used as part of the Data Access decision.
Description: This is a description of the group, to explain to administrators its purpose, scope or use within the system.

Example

A Group being created with:

Name: “telicent_developers”
Description: “A group to represent all developers at Telicent within CORE”

Will create a group:

group:
  - active: true
  - description: "A group to represent all developers at Telicent within CORE"
  - label: "telicent_developers"
  - group_id: "urn:telicent:groups:telicent_developers"

NB: The group_id is the string which should be used as part of the IDH group field.