Triggering the Transformer / Ingesting Data

The SS Transformer supports a number of ways to ingest your data files. While all five supported file types, CSV, XML, JSON, XLSX, and ODS, are ingested in the same way, there may be some additional parameters you wish to set for CSV and XML each as detailed below.

Executing From The Studio

The SS transformer can be executed directly from the Studio. Instructions on how to do this are found in the getting started tutorial here. Slightly different configuration options are required when executing on a LPG model which can be found here.

It can be confusing when running on a local machine to know the path to give to your input file. Instructions on the path to use can be found here.

RESTful API Endpoint

Outside of using the Graph.Build Studio, the easiest way to ingest a file into the SS Transformer is to use the built-in APIs. Using the process GET endpoint, you can specify the URL of a file to ingest, along with the logicalSource correlating to your mapping, and in return, you will be provided with the URL(s) of the generated Graph data file(s).

The structure and parameters for the GET request is as follows: http://<tran-ip>:<tran-port>/process?inputFileURLs=<input-file-url>&logicalSources=<logical-source>, for example, http://127.0.0.1:8080/process?inputFileURLs=file:///var/local/input-data.csv&logicalSources=sample.csv, where the response is a success report in the form of a JSON object.

For ingestions where you wish to transform an input source file with a specific mapping file or mapping files within a specific directory you can also pass in the URL of your mapping file or mapping directory as a part of the GET request. For example: http://<tranx-ip>:<tranx-port>/process?inputFileURLs=<input-file-url>&logicalSources=sample.csv&mappingURL=<mapping-url>. Otherwise, not specifying a mapping file URL in the request will cause the Transformer to run normally, importing all mapping files within the mapping directory URL specified in the configuration.

In addition, you may also wish to ingest multiple input source files in instances where there are joins in your mapping, creating links between your data. This is simply done by including multiple inputFileURLs and logicalSources params in your endpoint. For example, to ingest two files would look like the following: /process?inputFileURLs=file:///var/local/employees-123.csv&logicalSources=employees.csv&inputFileURLs=file:///var/local/jobRoles-123.xml&logicalSources=jobRoles.xml. This can also be extended by also including a specific mapping to the request, in the same way as the example above.

Kafka

The second, and the more versatile and scalable ingestion method, is to use a message queue such as Apache Kafka. To set up a Kafka Cluster, follow the instructions here, but in short, to ingest files into the SS Transformer you must set up a Producer and connect to your cluster by setting the KAFKA_BROKERS variable to the same as your Kafka’s endpoint, and ensure the TRANSFORMER_RUN_STANDALONE configuration is set to false. The topic name for which this Producer subscribes to must be the same name that you specified in the KAFKA_TOPIC_NAME_SOURCE config option (defaults to “source_urls”). Once set up, each message sent from the Producer must be a JSON object containing the URL(s) of the input file(s), the correlating logical sources, as well as optionally a mapping file or mapping directory if you which to given a specific mapping per process.

The JSON object must be structured as follows:

{
  "input": [
    {"inputFileURL": "", "logicalSource": ""},
    {"inputFileURL": "", "logicalSource": ""}
  ],
  "mappingURL": ""
}

S3 Lambda

If you wish to use Kafka, and you are also using S3 to store your source data, we have developed an AWS Lambda to aid with the ingestion of data into your SS Transformer. The Lambda is designed to monitor a specific Bucket in S3, and when a file arrives or is modified in a specific directory, a message is written to a specified Kafka Topic containing the URL of the new/modified file. Subsequently, this will then be ingested by the Transformer. If you wish to use this Lambda, please contact us for more information.

CSV Splitting and Validation

While ingesting CSV files are the same as with XML and JSON, there are a couple of points to note. A very large CSV file with a large number of rows will be split into chunks and processed separately, by default every 100,000 lines. This allows for better performance and continuous output of RDF files. When processed using Kafka, messages are continuously pushed to the Success Queue, however when using the Process endpoint, the response will only be returned once the entire file transformation has been completed. This file chunking size can be overridden with the configuration option MAX_CSV_ROWS, or conversely turned off by setting this to 0, however this is not recommended unless your machine / instance has a significant amount of RAM.

In addition, CSV files are validated by default before being processed, and any erroneous lines will be removed and not transformed. This has a negligible effect on the performance, however can be turned off by setting VALIDATE_CSV to false. Please note that multiline CSV records are not supported with this validation functionality, so please turn this feature off if multiline is present in your dataset.

XML Parsing

Ingesting XML files are also the same process as with CSV and JSON files, however there are currently two different parsing methodologies in which this can be done. This is explained more when you create your model using our guide, but SAXPath can be used for faster transformation speed, and XPath can be utilised when complex iterators are used including accessing parent nodes.

PreviousAll Configuration Options NextInput File URL When Running On a Local Machine

Last updated 11 months ago