Configuring the Transformer

Configuring the Transformer

Each of the Transformers have a wide array of user configuration, all of which can be set and altered both before the startup of the Transformer and during the running of a Transformer. The former is done through the use of environment variables in your Docker container or ECS Task Definition, and latter is done through the use of an exposed endpoints, as seen below or through the Transformer configurations section of Graph.Build Studio. For a breakdown of every configuration option in the SQL Transformer, see the full list here.

Configuration Access

Accessing the Config

Once a Transformer has started and is operational, you can request to view the current configuration by calling the /config endpoint. This is expanded upon below, including the ability to specify specific config properties.

Editing config

As explained below, the configuration on a running Transformer can be edited through the /updateConfig endpoint.

Backup and Restore Config

A useful feature in the Transformer, is the ability to backup and restore your configuration. This is particularly beneficial when you’ve made multiple changes to the config on a running Transformer, and want to be able to restore this without rerunning any update config commands. To backup your config, simply call the /uploadConfigBackup endpoint, and all changes you’ve made to the config will be uploaded to the storage location specified in your CONFIG_BACKUP env var.

To restore your configuration, this must be done on the startup of a Transformer, therefore, by setting the CONFIG_BACKUP config option as an environment variable in your startup script / task definition. This should be a remote directory such as S3, when being run as a task on ECS. If run locally, then ensure the container is mounted to a local volume, otherwise all container storage will be deleted when it is stopped.

Configuration Categories

Mandatory Configuration (Local Deployment)

  • Transformer License - TRANSFORMER_LICENCE

    • This is the license key required to operate the Transformer. Request your new unique license key here.

Transformer Directories Configuration

  • Transformer Directory - TRANSFORMER_DIRECTORY

    • This is the directory where all Transformer files are stored (assuming individual file dir config haven’t been edited). On Transformer startup, if this has been declared, it will create folders at the specified location for mapping, output, yaml-mapping, provenance output, and config backup.

    • By default, this option is set to a local directory within the docker container (file:///var/local/) so isn't mandatory. As with all directories in the Transformer, this can be either local or on a remote S3 bucket - we recommend using S3 when running the Transformer on AWS (for example - s3://example-bucket/sqltransformer/)

  • Mapping Directory URL - MAPPINGS_DIR_URL

    • This is the directory where your mapping file(s) is located. All mapping files within this directory are downloaded and added to the store for processing.

  • Output directory URL - OUTPUT_DIR_URL

    • This is the directory where all generated RDF files are saved to. This also supports local and remote URLs.

  • Provenance Output Directory URL - PROV_OUTPUT_DIR_URL

    • Out of the box, the SQL Transformer supports Provenance and it is generated by default. Once generated, the Provenance is saved to separate output files to the transformed source data. This option specifies the directory where provenance RDF files are saved to, which also supports local and remote URLs.

    • If you do not wish to generate Provenance, you can turn it off by setting the RECORD_PROVO variable to false. In this case, the PROV_OUTPUT_DIR_URL option is no longer required. For more information on Provenance configuration, see below.

  • Config Backup - CONFIG_BACKUP

    • The Transformer supports functionality to backup your configuration in the scenario where you wish to reboot your Transformer. Upon calling the upload config endpoint, your configuration settings will be backed up to the URL directory specified here. It must be a remote directory such as S3 to support rebooting of the Transformer.

Directories in Transformers

The Transformers are designed to support files and directories from an array of sources. This includes both local URLs and remote URLs including cloud-based technologies such as AWS S3. The location should be expressed as a URL string (Ref. RFC-3986).

  • To use a local URL for directories and files, both the format of file:///var/local/graph-build-output/ and /var/local/graph-build-output/ are supported.

  • To use a remote http(s) URL for files, https://example.com/input-file.csv is supported.

  • To use a remote AWS S3 URL for directories and files, s3://example/folder/ is supported where the format is s3://<bucket-name>/<directory>/<file-name>. If you are using an S3 bucket for any directory you must specify an AWS access key and secret key.

SQL Configuration

The configuration concerned with connecting to your SQL Database credentials is taken care of in the mapping file, this also includes any SQL queries you may want to execute in order to retrieve your data. To aid with the performance of the SQL Transformer, additional configuration values have been devised to split up the processing of larger data set.

  • SQL Limit - SQL_LIMIT

    • The SQL Limit, when set to a value less than the total number of records in your database, forces the Transformer to execute an iterative processing operation. The max number of records that is processed in any one iteration is determined by this configuration option. The Transformer will batch process the records from the query and output multiple RDF files until the end of the database is reached.

    • This value must be an integer greater than zero. It defaults to zero, meaning that iterative queries are switched off.

  • SQL Offset - SQL_OFFSET

    • The SQL Offset provides the ability to offset the start index of the iterative processing. This defaults to zero.

  • Concurrent Threads - CONCURRENT_THREADS

    • When the Transformer is executing an iterative query, these iterations will be run in parallel. This option allows you to specify the number of threads that will be used for concurrent iterative executions. If you specify a number of threads greater than what is available, the maximum number will be used.

  • Continue on Error - CONTINUE_ON_ERROR

    • If there is an error during an iteration, this determines whether the execution will continue or halt. This defaults to false.

AWS Configuration

When running the Transformer in ECS, these settings are not required as all credentials are taken directly from the EC2 instance running the Transformer. If you wish to use AWS cloud services while running the Transformer on-prem, you need to specify an AWS Access Key and Secret Key, and AWS Region. By providing your AWS credentials, this will give you permission for accessing, downloading, and uploading remote files to S3 Buckets. The S3 Region option specifies the region of where in AWS your files and services reside. To do this, the Transformers utilise the AWS Default Credential Provider Chain, allowing for a number of methods to be used. The simplest is by setting the environment variables for AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION.

Kafka Configuration

One of the many ways to interface with the Transformer is through the use of Apache Kafka. With the SQL Transformer, a Kafka Message Queue can be used for managing the output of data from the Transformer, and to trigger a transformation. To properly set up your Kafka Cluster, see the instructions here. Once complete, use the following Kafka configuration variables to connect the cluster with your Transformer. If you wish to use Kafka, you must switch it on by setting the variable TRANSFORMER_RUN_STANDALONE to false.

The Kafka Broker is what tells the Transformer where to look for your Kafka Cluster, so set this property as follows: <kafka-broker>:<kafka-port>. The recommended port is 9092.

Provenance Configuration

As previously mentioned, Provenance is generated by default, this can be turned off by setting the RECORD_PROVO variable to false, otherwise the prov output files will be stored at the dir specified by PROV_OUTPUT_DIR_URL. If you wish to store this Provenance remotely in an S3 Bucket, then you are required to specify your region, access key, and secret key, as explained previously in the AWS Configuration section.

If you wish to manage the Provenance output files through Kafka, then you can choose to use the same brokers and topic names as with the previously specified data files, or an entirely different cluster.

Logging Configuration

When running the Transformer locally from the command line using the instructions below, the Transformer will automatically log to your terminal instance. In addition to this, the archives of logs will be saved within the docker container at /var/log/graphbuild/archive/current/ and /var/log/graphbuild/json/archive/ for text and JSON logs respectively, where the current logs can be found at /var/log/graphbuild/text/current/ and /var/log/graphbuild/json/current/. By default, a maximum of 7 log files will be archived for each file type, however this can be overridden. If running a Transformer on cloud in an AWS environment, then connect to your instance via SSH or PuTTY, and the previously outlined logging locations apply. Alternatively, configuring CloudWatch Logs is the easiest way to view your Len’s live logging.

By default, the Transformer logs at INFO level, this can be changed by overriding the LOG_LEVEL_TRANSFORMER option, however can only be done on Transformer startup and will require a reboot if not.

Optional Configuration

There is also a further selection of optional configurations for given situations, see here for the full list.

Last updated