Cypher LOAD CSV - how to create a linked list of nodes ordered by a property? - csv

Im new to Neo4j and looking for some guidance :-)
Basically I want to create the graph below from the csv below. The NEXT relationship is created between Points based on the order of their property sequence. I would like to be able to ignore if sequences are consecutive. Any ideas?
(s1:Shape)-[:POINTS]->(p1:Point)
(s1:Shape)-[:POINTS]->(p2:Point)
(s1:Shape)-[:POINTS]->(p3:Point)
(p1)-[:NEXT]->(p2)
(p2)[:NEXT]->(p3)
and so on
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
"1-700-y11-1.1.I","53.42646060879","-6.23930113514121","1","0"
"1-700-y11-1.1.I","53.4268571616632","-6.24059395687542","2","96.6074531286277"
"1-700-y11-1.1.I","53.4269700485041","-6.24093540883784","3","122.549696670773"
"1-700-y11-1.1.I","53.4270439028769","-6.24106779537932","4","134.591291249566"
"1-700-y11-1.1.I","53.4268623569266","-6.24155684094256","5","172.866609667575"
"1-700-y11-1.1.I","53.4268380666968","-6.2417384245122","6","185.235926544428"
"1-700-y11-1.1.I","53.4268874080753","-6.24203735638874","7","205.851454672516"
"1-700-y11-1.1.I","53.427394066848","-6.24287421729846","8","285.060040065768"
"1-700-y11-1.1.I","53.4275257974236","-6.24327509689195","9","315.473852717259"
"1-700-y11-1.2.O","53.277024711771","-6.20739084216546","1","0"
"1-700-y11-1.2.O","53.2777605784999","-6.20671521402849","2","93.4772699644143"
"1-700-y11-1.2.O","53.2780318605927","-6.2068238246152","3","124.525619356934"
"1-700-y11-1.2.O","53.2786209984572","-6.20894363498438","4","280.387737910482"
"1-700-y11-1.2.O","53.2791038678913","-6.21057305710353","5","401.635418300665"
"1-700-y11-1.2.O","53.2790975844245","-6.21075327761739","6","413.677012879457"
"1-700-y11-1.2.O","53.2792296384738","-6.21116766400758","7","444.981964564454"
"1-700-y11-1.2.O","53.2799500357098","-6.21065767664905","8","532.073870043666"
"1-700-y11-1.2.O","53.2800290799386","-6.2105343995296","9","544.115464622458"
"1-700-y11-1.2.O","53.2815594673093","-6.20949562301196","10","727.987702875002"
It is the 3rd part that I cant finish. Creating the NEXT relationship!
//1. Create Shape
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM
'file:///D:\\shapes.txt' AS csv
With distinct csv.shape_id as ids
Foreach (x in ids | merge (s:Shape {id: x} ));
//2. Create Point, and Shape to Point relationship
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM
'file:///D:\\shapes.txt' AS csv
MATCH (s:Shape {id: csv.shape_id})
with s, csv
MERGE (s)-[:POINTS]->(p:Point {id: csv.shape_id,
lat : csv.shape_pt_lat, lon : csv.shape_pt_lat,
sequence : toInt(csv.shape_pt_sequence), dist_travelled : csv.shape_dist_traveled});
//3.Create Point to Point relationship
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM
'file:///D:\\shapes.txt' AS csv
???

You'll want APOC Procedures installed for this one. It has both a means of batch processing, and a quick way to link all nodes in a collection together.
Since you already have all shapes the the points of the shape in the db, you don't need to do another load csv, just use the data you've got.
We'll use apoc.periodic.iterate() to batch process each shape, and apoc.nodes.link() to link all ordered points in the shape by relationships.
CALL apoc.periodic.iterate(
"MATCH (s:Shape) RETURN s",
"WITH {s} as shape
MATCH (shape)-[:POINTS]->(point:Point)
WITH shape, point
ORDER by point.sequence ASC
WITH shape, COLLECT(point) as points
CALL apoc.nodes.link(points,'NEXT')",
{batchSize:1000, parallel:true}) YIELD batches, total
RETURN batches, total
EDIT
Looks like there may be a bug when using procedure calls within the apoc.periodic.iterate() where no mutating operations occur (attempted this after including a SET operation in the second part of the query to set a property on some nodes, the property was not added).
Unsure if this is a general case of procedure calls being executed within procedure calls, or if this is specific to apoc.periodic.iterate(), or if this only occurs with both iterate() and link().
I'll file a bug if I can learn more about the cause. In the meantime, if you don't need batching, you can forgo apoc.periodic.iterate():
MATCH (shape:Shape)-[:POINTS]->(point:Point)
WITH shape, point
ORDER by point.sequence ASC
WITH shape, COLLECT(point) as points
CALL apoc.nodes.link(points,'NEXT')

Related

Read every nth batch in pyarrow.dataset.Dataset

In Pyarrow now you can do:
a = ds.dataset("blah.parquet")
b = a.to_batches()
first_batch = next(b)
What if I want the iterator to return me every Nth batch instead of every other? Seems like this could be something in FragmentScanOptions but that's not documented at all.
No, there is no way to do that today. I'm not sure what you're after but if you are trying to sample your data there are a few choices but none that achieve quite this effect.
To load only a fraction of your data from disk you can use pyarrow.dataset.head
There is a request in place for randomly sampling a dataset although the proposed implementation would still load all of the data into memory (and just drop rows according to some random probability).
Update: If your dataset is only parquet files then there are some rather custom parts and pieces that you can cobble together to achieve what you want.
a = ds.dataset("blah.parquet")
all_fragments = []
for fragment in a.get_fragments():
for row_group_fragment in fragment.split_by_row_group():
all_fragments.append(row_group_fragment)
sampled_fragments = all_fragments[::2]
# Have to construct the sample dataset manually
sampled_dataset = ds.FileSystemDataset(sampled_fragments, schema=a.schema, format=a.format)
# Iterator which will only return some of the batches
# of the source dataset
sampled_dataset.to_batches()

Tf-slim: ValueError: Variable vgg_19/conv1/conv1_1/weights already exists, disallowed. Did you mean to set reuse=True in VarScope?

I am using tf-slim to extract features from several batches of images. The problem is my code works for the first batch , after that I get the error in the title.My code is something like this:
for i in range(0, num_batches):
#Obtain the starting and ending images number for each batch
batch_start = i*training_batch_size
batch_end = min((i+1)*training_batch_size, read_images_number)
#obtain the images from the batch
images = preprocessed_images[batch_start: batch_end]
with slim.arg_scope(vgg.vgg_arg_scope()) as sc:
_, end_points = vgg.vgg_19(tf.to_float(images), num_classes=1000, is_training=False)
init_fn = slim.assign_from_checkpoint_fn(os.path.join(checkpoints_dir, 'vgg_19.ckpt'),slim.get_model_variables('vgg_19'))
feature_conv_2_2 = end_points['vgg_19/pool5']
So as you can see, in each batch, I select a batch of images and use the vgg-19 model to extract features from the pool5 layer. But after the first iteration I get error in the line where I am trying to obtain the end-points. One solution, as I found on the internet is to reset the graph each time , but I don't want to do that because I have some weights in my graph in later part of the code which I train using these extracted features. I don't want to reset them. Any leads highly appreciated. Thanks!
You should create your graph once, not in a loop. The error message tells you exactly that - you try to build the same graph twice.
So it should be (in pseudocode)
create_graph()
load_checkpoint()
for each batch:
process_data()

Should I use Neo4j's Import Tool or Load Command to Insert Several Million Rows?

I have several CSV files that range from 25-100 MB in size. I have created constraints, created indices, am using periodic commit, and increased the allocated memory in the neo4j-wrapper.conf and neo4j.properties.
neo4j.properties:
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=100M
neostore.propertystore.db.arrays.mapped_memory=0M
neo4j-wrapper.conf changes:
wrapper.java.initmemory=5000
wrapper.java.maxmemory=5000
However my load is still taking a very long time, and I am considering using the recently released Import Tool (http://neo4j.com/docs/milestone/import-tool.html). Before I switch to it, I was wondering whether I could be doing anything else to improve the speed of my imports.
I begin by creating several constraints to make sure that the IDs I'm using are unique:
CREATE CONSTRAINT ON (Country) ASSERT c.Name IS UNIQUE;
//and constraints for other name identifiers as well..
I then use periodic commit...
USING PERIODIC COMMIT 10000
I then LOAD in the CSV where I ignore several fields
LOAD CSV WITH HEADERS FROM "file:/path/to/file/MyFile.csv" as line
WITH line
WHERE line.CountryName IS NOT NULL AND line.CityName IS NOT NULL AND line.NeighborhoodName IS NOT NULL
I then create the necessary nodes from my data.
WITH line
MERGE(country:Country {name : line.CountryName})
MERGE(city:City {name : line.CityName})
MERGE(neighborhood:Neighborhood {
name : line.NeighborhoodName,
size : toInt(line.NeighborhoodSize),
nickname : coalesce(line.NeighborhoodNN, ""),
... 50 other features
})
MERGE (city)-[:IN]->(Country)
CREATE (neighborhood)-[:IN]->(city)
//Note that each neighborhood only appears once
Does it make sense to use CREATE UNIQUE rather than applying MERGE to any COUNTRY reference? Would this speed it up?
A ~250,000-line CSV file took over 12 hours to complete, and seemed excessively slow. What else can I be doing to speed this up? Or does it just make sense to use the annoying-looking Import Tool?
A couple of things. Firstly, I would suggest reading Mark Needham's "Avoiding the Eager" blog post:
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
Basically what it says is that you should add a PROFILE to the start of each of your queries to see if any of them use the Eager operator. If they do this can really cost you performance-wise and you should probably split up your queries into separate MERGEs
Secondly, your neighborhood MERGE contains a lot of properties, and so each time it's trying to match on every single one of those properties before deciding if it should create it or not. I'd suggest something like:
MERGE (neighborhood:Neighborhood {name: line.NeighborhoodName})
ON CREATE SET
neighborhood.size = toInt(line.NeighborhoodSize),
neighborhood.nickname = coalesce(line.NeighborhoodNN, ""),
... 50 other features
})

Retrieve row number with supercsv

Is there a way with the super-csv library to find out the row number of the file that will be processed?
In other word, before i start to process my rows with a loop:
while ((obj = csvBeanReader.read(obj.getClass(),
csvModel.getNameMapping(), processors)) != null) {
//Do some logic here...
}
Can i retrieve with some library class the number of row contained into the csv file?
No, in order to find out how many rows are in your CSV file, you'll have to read the whole file with Super CSV (this is really the only way as CSV can span multiple lines). You could always do an initial pass over the file using CsvListReader (it doesn't do any bean mapping, so probably a bit more efficient) just to get the row count...
As an aside (it doesn't help in this situation), you can get the current line/row number from the reader as you are reading using the getLineNumber() and getRowNumber() methods.

Counting the number of passes through a CSV file in JMeter

Am I missing an easy way to do this?
I have a CSV file with a number of params in it, and in my test I want to be able to make some of the fields unique across CSV repetitions with a suffix determined by the number of times I've looped through the file.
So suppose my CSV (simplified) had:
abc
def
ghi
I want to generate in the test
abc_1
def_1
ghi_1 <hit EOF>
abc_2
def_2
ghi_2 <hit EOF>
abc_3
def_3
ghi_3
I thought I could set up a counter to run parallel to my CSV loop, but that won't work unless I increment it by 1/n each iteration, where n is the number of lines in my CSV file. Which you can't do because counters are integers.
I'm going to go flail around and see if I can come up with a solution, but in case I'm not successful, has anyone got any suggestions?
I've used an EOF marker row (index column with something like "EOF" or "END", etc) and used an IF controller with either a non-resetting counter OR user-variables incremented via javascript in a BSF element (BSF assertion or whatever, just a mechanism to run the script).
Unfortunately its the best solution I've come up with without putting too much effort into it.