I'm trying to load json documents into neo4j which has complex structure...I cannot use csv format...
Can someone please tell how to load 1000+ json documents using apoc?
I know how to load one document trying to find how to load in loop the 1000 documents
If you want to import from multiple JSON documents serially, you can do something like this (assuming that urls is passed to the Cypher query as a parameter):
UNWIND $urls AS url
CALL apoc.load.json(url) YIELD value
...
If it makes sense to perform the imports concurrently (e.g., there is no risk of deadlocks -- because different files would not write-lock the same set of relationships or nodes), you could consider using one of the APOC periodic execution procedures.
Assuming that your files are in the "import" directory, the following CQL should work:
CALL apoc.load.directory()
YIELD value as files
UNWIND files as file
CALL apoc.periodic.iterate('
CALL apoc.load.json($file)
YIELD value as line
', '
CREATE (pc:SomeNode{type: line.type}) //Process line here
', {batchSize:10000, parallel:true, params: {file: file}})
YIELD batches, total, updateStatistics
RETURN batches, total, updateStatistics
Related
I am performing a daily load of 100k+ json files into a neo4j database which is taking approximately 2 to 3 hours each day.
I would like to know whether neo4j would run quicker if the files were all rolled into one large file and then iterated through by the database?
I will need to learn how to do this in Python if so, but I would just like to know this before embarking on the work.
Current code snippet I use to load files, the range can change each day based on generated filenames which are based on IDs in the json records.
UNWIND range(215300000,215457000) as id
WITH DISTINCT id+"_20220103.json" as file
CALL apoc.load.json("file:///output/"+file,null, {failOnError:false})
YIELD value
Thank you!
The json construction in Python was updated to include all 150k+ json objects into one file and then Cypher was updated to iterate over the file and run the code against each json object. I initially tried a batch size of 1000 and then 100 but they resulted in many exception locks where the code must have been attempting to update the same nodes at the same time, so I have reduced the batch size down to 1 and it loads about 99% of the json objects on a first pass in 7 minutes.... much better than the initial 2 to 3 hours :-)
Code I am now using:
CALL apoc.periodic.iterate(
'CALL apoc.load.json("file:///20220107.json") YIELD value',
'UNWIND value as item.... perform other actions...
',{ batchSize:1, parallel:true})
I'm trying to use ADF for the following scenario:
a JSON is uploaded to a Azure Storage Blob, containing an array of similar objects
this JSON is read by ADF with a Lookup Activity and uploaded via a Web Activity to an external sink
I cannot use the Copy Activity, because I need to create a JSON payload for the Web Activity, so I have to lookup the array and paste it like this (payload of the Web Activity):
{
"some field": "value",
"some more fields": "value",
...
"items": #{activity('GetJsonLookupActivity').output.value}
}
The Lookup activity has a known limitation of an upper limit of 5000 rows at a time. If the JSON is larger, only 5000 top rows will be read and all else will be ignored.
I know this, so I have a system that chops payloads into chunks of 5000 rows before uploading to storage. But I'm not the only user, so there's a valid concern that someone else will try uploading bigger files and the pipeline will silently pass with a partial upload, while the user would obviously expect all rows to be uploaded.
I've come up with two concepts for a workaround, but I don't see how to implement either:
Is there any way for me to check if the JSON file is too large and fail the pipeline if so? The Lookup Activity doesn't seem to allow row counting, and the Get Metadata Activity only returns the size in bytes.
Alternatively, the MSDN docs propose a workaround of copying data in a foreach loop. But I cannot figure out how I'd use Lookup to first get rows 1-5000 and then 5001-10000 etc. from a JSON. It's easy enough with SQL using OFFSET N FETCH NEXT 5000 ROWS ONLY, but how to do it with a JSON?
You can't set any index range(1-5,000,5,000-10,000) when you use LookUp Activity.The workaround mentioned in the doc doesn't means you could use LookUp Activity with pagination,in my opinion.
My workaround is writing an azure function to get the total length of json array before data transfer.Inside azure function,divide the data into different sub temporary files with pagination like sub1.json,sub2.json....Then output an array contains file names.
Grab the array with ForEach Activity, execute lookup activity in the loop. The file path could be set as dynamic value.Then do next Web Activity.
Surely,my idea could be improved.For example,you get the total length of json array and it is under 5000 limitation,you could just return {"NeedIterate":false}.Evaluate that response by IfCondition Activity to decide which way should be next.It the value is false,execute the LookUp activity directly.All can be divided in the branches.
Suppose I have a csv file with data in the format (Subject, relation, Object).
Is it possible to load this into neo4j as a graph modeled such that the subject and object become nodes and the relation between them is the relation from the triple?
Essentially while loading from the csv, I want to load the subject and object as individual nodes and the relation is the one joining them.
(subject)-[:relation]->(object)
My csv is in the format
ent1,state,ent2
a,is,b
.
.
.
Yes, It's possible. You need to install the APOC plugin in Neo4j and then use apoc.merge.relationship.
Refer the following query to load the data: Add/Modify required details in the query.
LOAD CSV FROM "file:///path-to-file" AS line
MERGE (sub:Subject {name:line[0]})
MERGE (obj:Object {name:line[2]})
WITH sub, obj, line
CALL apoc.merge.relationship(sub,line[1],{},{},obj) YIELD rel
RETURN COUNT(*);
I’m trying to use LOAD CSV to create nodes with the labels being set to values from the CSV. Is that possible? I’m trying something like:
LOAD CSV WITH HEADERS FROM 'file:///testfile.csv' AS line
CREATE (x:line.label)
...but I get an invalid syntax error. Is there any way to do this?
bicpence,
First off, this is pretty easy to do with a Java batch import application, and they aren't hard to write. See this batch inserter example. You can use opencsv to read your CSV file.
If you would rather stick with Cypher, and if you have a finite set of labels to work with, then you could do something like this:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///testfile.csv' AS LINE
CREATE (n:load {lab:line.label, prop:line.prop});
CREATE INDEX ON :load(lab);
MATCH (n:load {lab:'label1'})
SET n:label1
REMOVE n:load
REMOVE n.lab;
MATCH (n:load {lab:'label2'})
SET n:label2
REMOVE n:load
REMOVE n.lab;
Grace and peace,
Jim
Unfortunately not, parameterized labels are not supported
Chris
you can do a workaround - create all nodes and than filter on them and create the desired nodes, than remove those old nodes
LOAD CSV WITH HEADERS FROM 'file:///testfile.csv' AS line
CREATE (tmp:line[1])
WITH tmp
CREATE (x:Person {name: labels(tmp)[0]})
WITH tmp
REMOVE tmp
paste this into http://console.neo4j.org to see example:
LOAD CSV
WITH HEADERS FROM "http://docs.neo4j.org/chunked/2.1.2/csv/import/persons.csv" AS csvLine
CREATE (p:tmp { id: toInt(csvLine.id), name: csvLine.name })
WITH p
CREATE (pp:Person { name: labels(p)[0]})
WITH p, pp
DELETE p
RETURN pp
I looked around at a few questions like this, and came to the conclusion that a nice concise way to handle these kinds of complex frustrations of not being able to easily add dynamic labels through 'LOAD CSV', is simply use your favorite programming language to read CSV lines, and produce a text output file of Cypher statements that will produce the Neo4j node/edge structure that you want. Then you will also be able to edit the text file directly, to alter whatever you want to further customize your commands.
I personally used Java given I am most comfortable with Java. I read each line of the CSV into a custom object that represents a row in my CSV file. I then printed to a file a line that reflects the Cypher statement I wanted. And then all I had to do was cut and paste those commands into Neo4j browser command line.
This way you can build your commands however you want, and you can completely avoid the limitations of 'LOAD CSV' commands with Cypher
Jim Biard's answer works but uses PERIODIC COMMIT which is useful however deprecated.
I was able to write a query that:
Loads from CSV
Uses multiple transactions
Creates nodes
Appends labels
Will work for 4.5 and onwards
:auto LOAD CSV WITH HEADERS FROM 'file:///nodes_build_ont_small.csv' AS row
CALL {
with row
call apoc.create.node([row.label], {id: row.id})
yield node
return null
} IN TRANSACTIONS of 100 rows
return null
Seems that apoc procedures are more useful then the commands themselves since this is not possible (at least in my attempts) with CREATE.
I need to test the same set of urls against 5 to 10 servers. URLs are defined in the CSV file. Server names are defined in User Defined Variables config.
I'm using While Controller based on the number of servers to iterate and execute the url requests. My current logic is defined as below:
Thread group
While controller
Counter (defines number of servers)
While controller (inner check "${URL}" != "<EOF>")
CSV Data Set Config (stop EOF is true)
HTTP Sampler (with url data)
As per the logic my script will run and read the CSV file once and stop. It's not reading the outer loop. Only inner loop and stopped.
Quote from JMeter Manual of CSV Data Set:
By default, the file is only opened once, and each thread will use a
different line from the file. However the order in which lines are
passed to threads depends on the order in which they execute, which
may vary between iterations. Lines are read at the start of each test
iteration. The file name and mode are resolved in the first iteration.
Thread groups cannot be nested. So you have to use the threadgroup to iterate in CSV and foreach to iterate in something else. The second option is to generate a CSV with the URL+Server variations, and using simply a single threadgroup to read the CSV.
First option is here.
Iterating URLs outer loop, iterating servers inner loop. You just need a threadgroup and a foreach inside it. See the pictures:
Sample results:
And of course 3 more results...
You can also play with CSVRead function if you have time :)