Importing Data into Neo4j via CSV

At GrapheneDB, a question we get asked quite often from users is how to import data. Sample datasets are good, but loading your own data is even better. This article will explain how to import data from a CSV file into Neo4j. After outlining the steps to take, we also list some special considerations for GrapheneDB users.

Starting with 2.1, Neo4j includes a LOAD CSV Cypher clause for data import, which is a powerful ETL tool:

  • It can load a CSV file from the local filesystem or from a remote URI (i.e. S3, Dropbox, Github, etc.)
  • It can perform multiple operations in a single statement
  • It can be combined with USING PERIODIC COMMIT to group the operations on multiple rows in transactions to load large amounts of data
  • Input data is mapped directly into a complex graph structure as outlined by the user
  • It’s possible to manipulate or compute values in runtime
  • It allows merging existing data (nodes, relationships, properties) rather than just adding it to the store

Here are the steps to take to successfully import your data to your database via CSV:

Have your graph data model ready

Before running the import process you will need to know how you want to map your data onto the graph. You'll need to know what are the nodes and relationships, and which properties will they have.

Tune cache and heap configuration

Make sure to increase the heap size generously, especially if importing large datasets, and also make sure the file buffer caches fit the entire dataset.

You can estimate the size of your dataset on disk after the import by using the table in the official Neo4j docs.

Let’s assume you are going to store 100K nodes, 1M relationships and a fixed-size property per node/relationship (i.e. an integer number) :

Node store: 100,000 15B = 1.5 MB
Relationship store: 1,000,000
34B = 34MB
Property store: 1,100,000 * 41B = 45.1 MB
Those are the minimum values that we should use in your filebuffer cache configuration.

Set up indexes and constraints

Indexes will make lookups faster during and after the load process. Make sure to include an index for every property used to locate nodes in MERGE queries.

An index can be created with the CREATE INDEX clause. Example:

CREATE INDEX FOR (n:Label) ON (n.property)

If a property must be unique, adding a constraint will also implicitly create an index. For example, if you want to make sure we don’t store any duplicated user nodes, we could use a constraint for the email property.

CREATE CONSTRAINT ON (u:User) ASSERT u.email IS UNIQUE;

Loading and mapping data

The easiest way to load data from CSV is to use the LOAD CSV statement. It supports common options, such as accessing via column header or column index, configuring the terminator character, and other common options. Please refer to the official Neo4j docs for further details.

To speed up the process, make sure to use USE PERIODIC COMMIT, which will group multiple operations (by default 1000) into transactions and reduce the times Neo4j has to hit the disk to commit the changes.

LOAD CSV WITH HEADERS FROM "file:///tmp/users.csv" AS csvLine FIELDTERMINATOR ';'
MERGE (u:User { email: csvLine.email}) ON CREATE SET u.username = csvLine.username, u.name = csvLine.name;

Please note that values are read as Strings, so make sure you do format conversion where appropiate, i.e. toInt(csv.columns) when loading integer numbers.

The load process can be run from the Neo4j shell, either interactively, or by loading the Cypher code from a file using the option -file filename.

Alternatively, the code can be entered manually into the shell or the browser UI.

Considerations for GrapheneDB users

A few considerations when loading data into your GrapheneDB Neo4j instance:

  • Page cache can be configured on the DS2 and higher plans, and the heap will be adjusted automatically. They are fixed on the DS1 plan.
  • neo4j-shell does not support authentication and thus it can’t be used to load data into an instance hosted on GrapheneDB or otherwise secured with authentication credentials.
  • When running the command from the browser UI, bear in mind Neo4j won’t be able to access your filesystem. You should provide a publicly available URL instead, i.e. a file hosted on AWS S3.
  • For larger datasets, we recommend running the import process locally and once completed, performing a restore on your GrapheneDB instance.

Please feel free to contact our support team if you are having issues loading data into your GrapheneDB instance, we're happy to help.