Common Graph.Build Architectures

Graph.Build is a simple, scalable platform to get all your data into any Graph Database. Each of the Transformers are designed to run as part of a larger end to end system, with the end result of data being uploaded into a Semantic Knowledge Graph or Property Graph. As part of this process, Kafka message queues are used to communicate between services.

Each use case will require a different combination of Transformers and Writers, here are some of the most common solutions.

Ingesting XML/CSV/JSON files

Semi-Structured Transformer + Kafka + Graph Writer

Once you have your Semi-Structured Transformer and your Graph Writer up and running, explained below is an example of an end-to-end enterprise-ready highly-scalable system showing what is required to ingest your structured files (CSV/XML/JSON) into your Knowledge or Property Graph. The intended flow of your data through the systems is as follows:

Source File System → Kafka →Semi_Structured Transformer → Kafka → Graph Writer → Graph Database

The first thing to determine is where your source data files are stored. Whether locally, remotely, or in an S3 Bucket, the Semi-Structured Transformer must be told where to retrieve these file from. Here we utilise message queues in the form of Apache Kafka. By setting up a Kafka Producer which subscribes to the topic name as specified in the Transformer’s KAFKA_TOPIC_NAME_SOURCE config variable (defaults to “source_urls”), you are able to directly send file URLs to the Transformer. Once set up, each message sent from the Producer must consist solely of URL of the file, for example, s3://examplebucket/folder/input-data.csv. Additionally, if you are using Kafka and S3 Buckets, you can use our AWS Lambda to automatically push a message to the Kafka Queue whenever a new file is uploaded to your S3 Bucket, which in turn will automatically trigger the Transformer to ingest and transform this data.

Once the Semi-Structured Transformer has ingested and transformed your input source data, the generated graph data will then be uploaded to the output directory location specified in the OUTPUT_DIR_URL config option. It will also push a success message to the Kafka Topic you specified in KAFKA_TOPIC_NAME_SUCCESS, defaults to “success_queue”. This means that the Graph Writer is able to pick up the newly published message from the Kafka Queue, ingest the generated Graph Data file (RDF or CSV) and publish this to the Graph Database specified in your Writer’s GRAPH_DATABASE_ENDPOINT config value.

When provenance data is generated in the Semi-Structured Transformer, you have the option to specify a separate output directory location for the generated provenance RDF, and separate a Kafka Success Queue via the PROV_OUTPUT_DIR_URL and PROV_KAFKA_TOPIC_NAME_SUCCESS config options respectively. In the case where you wish to have your provenance data uploaded to a separate Graph Database, an additional Graph Writer is required, this will be configured so its Kafka topic will be directed to this separate provenance topic specified in the Transformer.

Ingesting from a SQL Database

SQL Transformer + Kafka + Graph Writer

Once you have your SQL Transformer and Graph Writer up and running, explained below is an example of an end-to-end enterprise-ready highly-scalable system showing what is required to ingest data from your Relational SQL Databases into your Knowledge or Property Graph. The intended flow of your data through the systems is as follows:

Relational SQL Database → Cron Scheduler / API Endpoint → SQL Transformer → Kafka → Graph Writer → Graph Database

As seen in the SQL Transformer User Guide, the connection to your Database lies within the mapping files that you have created. The process in order for the Transformer to start ingesting data from your DB can be triggered in three ways. First is to Trigger the SQL Transformer from the Graph.Build, another is to use the exposed API Endpoint, this is simply a GET request targeting the Transformer, for example, http://<Transformer-ip>:<Transformer-port>/process. The final way is to use a Cron Expression to set up a time-based job scheduler which will schedule the Transformer to ingest your specified data from your database(s) periodically at fixed times, dates, or intervals.

Once the SQL Transformer has ingested and transformed your input source data, the generated Graph data will then be uploaded to the output directory location specified in the OUTPUT_DIR_URL config option. It will also push a success message to the Kafka Topic you specified in KAFKA_TOPIC_NAME_SUCCESS, defaults to “success_queue”. This means that the Graph Writer is able to pick up the newly published message from the Kafka Queue, ingest the generated Graph data file, and publish this to the Graph Database specified in your Writer’s GRAPH_DATABASE_ENDPOINT config value.

When provenance data is generated in the SQL Transformer, you have the option to specify a separate output directory location for the generated provenance RDF, and separate a Kafka Success Queue via the PROV_OUTPUT_DIR_URL and PROV_KAFKA_TOPIC_NAME_SUCCESS config options respectively. In the case where you wish to have your provenance data uploaded to a separate Graph Database, an additional Graph Writer is required, this will be configured so its Kafka topic will be directed to this separate provenance topic specified in the Transformer.

Last updated