Ingesting Graph Data (RDF and CSV)

The Graph Writer is designed to ingest RDF NQuads (.nq) data files for Semantic Graphs, and CSV Nodes and Edges data files for Property Graphs. This ingestion can be done in a number of ways.

Directories in the Writer

When ingesting files, the Graph Writer is designed to support files from an array of sources. This includes both local URLs and remote URLs including cloud-based technologies such as AWS S3. These locations should always be expressed as a URL string (Ref. RFC-3986).

To use a local URL for directories and files, both the format of file:///var/local/graphbuild-output/ and /var/local/graphbuild-output/ are supported.
To use a remote http(s) URL for files, https://example.com/input-rdf-file.nq is supported.
To use a remote AWS S3 URL for directories and files, s3://example/folder/ is supported where the format is s3://<bucket-name>/<directory>/<file-name>. If you are using an S3 bucket for any directory, you ensure your Writer has the necessary credentials / permissions to access S3.

Also included in the Writer, is the ability to delete your source data files after they have been ingested into your Graph Database. This is done by setting the DELETE_SOURCE config value to true. Enabling this means that your S3 Bucket or local file store, will not continuously fill up with RDF / CSV data generated from your Transformers.

Endpoint

First, the easiest way to ingest an RDF file into the Graph Writer is to use the built-in APIs. Using the process GET endpoint, you can specify the URL of a data file to ingest, and in return, you will be provided with the success status of the operation.

The structure and parameters for the GET request is as follows: http://<writer-ip>:<writer-port>/process?inputRdfURL=<input-rdf-file-url>, for example, http://127.0.0.1:8080/process?inputRdfURL=/var/local/input/input-rdf-data.nq. Once an input file has successfully been processed after being ingested via the Process endpoint, the response returned from the Writer is in the form of a JSON Object. Within the JSON response are two elements containing both the input data URL and the URL of the target Graph Database, for example:

{
    "input": "/var/local/input/input-rdf-data.nq",
    "graphDatabaseEndpoint": "https://graphdb.example.com:443/repositories/test",
    "graphDatabaseProvider": "graphdb",
    "databaseType": "SEMANTIC_GRAPH"
}

Now by logging in to your Graph Database and making the necessary queries, you will be able to see the newly inserted source data.

Kafka

The second, and the more versatile and scalable ingestion method, is to use a message queue such as Apache Kafka. To set up a Kafka Cluster, follow the instructions here, but in short, to ingest source files into the Graph Writer you require a Producer. The topic name which this Producer subscribes to must be the same name that you specified in the KAFKA_TOPIC_NAME_SOURCE config option (defaults to “success_queue”). Please ensure that this is the same as the success queue topic name in the Transformers you wish to ingest transformed data from. Once set up, if manually pushing data to Kafka, each message sent from the Producer must consist solely of URL of the file, for example, > s3://examplebucket/folder/input-rdf-data.nq.

Dead Letter Queue

If something goes wrong during either the operation of the Writer, that system will publish a message to the Dead Letter Queue Kafka topic (defaults to “dead_letter_queue”) explaining what went wrong along with meta-data about that ingestion, allowing for the problem to be diagnosed and later re-ingested. This message will be in the form of a JSON with the following structure:

{
	"name": "Graph-Writer",
	"time": "2020-04-21T11:41:28.374",
	"type": "Graph-Writer",
	"error": "Record '/var/local/basic-rdf.xyz' could not be processed due to: uk.co.graphbuild.writer.exception.GraphWriterExecutionException: Input file is of an invalid file type: (.xyz) - Expected nq or csv.",
	"version": "2.0.5.0",
	"entity": "/var/local/basic-rdf.xyz"
}

Ingestion Mode

Insert Mode

By default the ingested data is loaded fully into the final graph store. Different datasets are considered as being independent of each other. In this mode the new dataset adds new values to already existing subject and predicate. That default behaviour is controlled by a parameter INGESTION_MODE . For example:

Existing data:

:mercury foaf:name "Mercury"@en .
:mercury :distance "0.4"^^xsd:float .
:venus foaf:name "Venus"@en .
:venus :distance "0.7"^^xsd:float .
:earth foaf:name "Earth"@en .
:earth :distance "1"^^xsd:float

New data:

:mercury foaf:name "Merkury"@pl .
:venus foaf:name "Wenus"@pl .
:earth foaf:name "Ziemia"@pl .

Final data:

:mercury foaf:name "Mercury"@en .
:mercury foaf:name "Merkury"@pl .
:mercury :distance "0.4"^^xsd:float .
:venus foaf:name "Venus"@en .
:venus foaf:name "Wenus"@pl .
:venus :distance "0.7"^^xsd:float .
:earth foaf:name "Earth"@en .
:earth foaf:name "Ziemia"@pl .
:earth :distance "1"^^xsd:float

Update Mode

In the update mode (value update of the parameter) the ingested data is used to replace existing data. This mode is fully supported by the RDF standard and is built-in into all Semantic Knowledge Graphs. The update mode is demonstrated on the example below.

Existing data:

:mercury foaf:name "Mercury"@en .
:mercury :distance "0.4"^^xsd:float .
:venus foaf:name "Venus"@en .
:venus :distance "0.7"^^xsd:float .
:earth foaf:name "Earth"@en .
:earth :distance "1"^^xsd:float

New data:

:mercury foaf:name "Merkury"@pl .
:venus foaf:name "Wenus"@pl .
:earth foaf:name "Ziemia"@pl .

Final data:

:mercury foaf:name "Merkury"@pl .
:mercury :distance "0.4"^^xsd:float .
:venus foaf:name "Wenus"@pl .
:venus :distance "0.7"^^xsd:float .
:earth foaf:name "Ziemia"@pl .
:earth :distance "1"^^xsd:float

Please note the above only applies to Semantic Graphs, in Property Graphs the Writer defaults to an upsert pattern, updating the properties for a given node or edge.

Provenance Data

Within the Transformers, time-series data is supported as standard, every time a Transformer ingests some data we add provenance information. This means that you have a full record of data over time allowing you to see what the state the data was at any moment. The model we use to record Provenance information is the w3c standard PROV-O model. Currently, the Graph Writer does not generate its own provenance meta-data, however, any provenance previously generated will still be ingested into your Knowledge Graph. If you are using Kafka, ensure that your Kafka source topic is correctly configured if your Transformer provenance is pushed to a separate queue from your generated output data. Having provenance pushed to a separate Kafka Topic allows for a different Graph Writer to be set up enabling you to push provenance to a separate Knowledge Graph to your generated RDF from source data.

For more information on how the provenance is laid out, as well as how to query in from you Knowledge Graph, see the Provenance Guide.

PreviousAll Configuration Options NextGraph Providers

Last updated 1 year ago