Should I use Neo4j's Import Tool or Load Command to Insert Several Million Rows? - csv

I have several CSV files that range from 25-100 MB in size. I have created constraints, created indices, am using periodic commit, and increased the allocated memory in the neo4j-wrapper.conf and neo4j.properties.
neo4j.properties:
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=100M
neostore.propertystore.db.arrays.mapped_memory=0M
neo4j-wrapper.conf changes:
wrapper.java.initmemory=5000
wrapper.java.maxmemory=5000
However my load is still taking a very long time, and I am considering using the recently released Import Tool (http://neo4j.com/docs/milestone/import-tool.html). Before I switch to it, I was wondering whether I could be doing anything else to improve the speed of my imports.
I begin by creating several constraints to make sure that the IDs I'm using are unique:
CREATE CONSTRAINT ON (Country) ASSERT c.Name IS UNIQUE;
//and constraints for other name identifiers as well..
I then use periodic commit...
USING PERIODIC COMMIT 10000
I then LOAD in the CSV where I ignore several fields
LOAD CSV WITH HEADERS FROM "file:/path/to/file/MyFile.csv" as line
WITH line
WHERE line.CountryName IS NOT NULL AND line.CityName IS NOT NULL AND line.NeighborhoodName IS NOT NULL
I then create the necessary nodes from my data.
WITH line
MERGE(country:Country {name : line.CountryName})
MERGE(city:City {name : line.CityName})
MERGE(neighborhood:Neighborhood {
name : line.NeighborhoodName,
size : toInt(line.NeighborhoodSize),
nickname : coalesce(line.NeighborhoodNN, ""),
... 50 other features
})
MERGE (city)-[:IN]->(Country)
CREATE (neighborhood)-[:IN]->(city)
//Note that each neighborhood only appears once
Does it make sense to use CREATE UNIQUE rather than applying MERGE to any COUNTRY reference? Would this speed it up?
A ~250,000-line CSV file took over 12 hours to complete, and seemed excessively slow. What else can I be doing to speed this up? Or does it just make sense to use the annoying-looking Import Tool?

A couple of things. Firstly, I would suggest reading Mark Needham's "Avoiding the Eager" blog post:
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
Basically what it says is that you should add a PROFILE to the start of each of your queries to see if any of them use the Eager operator. If they do this can really cost you performance-wise and you should probably split up your queries into separate MERGEs
Secondly, your neighborhood MERGE contains a lot of properties, and so each time it's trying to match on every single one of those properties before deciding if it should create it or not. I'd suggest something like:
MERGE (neighborhood:Neighborhood {name: line.NeighborhoodName})
ON CREATE SET
neighborhood.size = toInt(line.NeighborhoodSize),
neighborhood.nickname = coalesce(line.NeighborhoodNN, ""),
... 50 other features
})

Related

Splitting a feature collection by system index in Google Earth Engine?

I am trying to export a large feature collection from GEE. I realize that the Python API allows for this more easily than the Java does, but given a time constraint on my research, I'd like to see if I can extract the feature collection in pieces and then append the separate CSV files once exported.
I tried to use a filtering function to perform the task, one that I've seen used before with image collections. Here is a mini example of what I am trying to do
Given a feature collection of 10 spatial points called "points" I tried to create a new feature collection that includes only the first five points:
var points_chunk1 = points.filter(ee.Filter.rangeContains('system:index', 0, 5));
When I execute this function, I receive the following error: "An internal server error has occurred"
I am not sure why this code is not executing as expected. If you know more than I do about this issue, please advise on alternative approaches to splitting my sample, or on where the error in my code lurks.
Many thanks!
system:index is actually ID given by GEE for the feature and it's not supposed to be used like index in an array. I think JS should be enough to export a large featurecollection but there is a way to do what you want to do without relying on system:index as that might not be consistent.
First, it would be a good idea to know the number of features you are dealing with. This is because generally when you use size().getInfo() for large feature collections, the UI can freeze and sometimes the tab becomes unresponsive. Here I have defined chunks and collectionSize. It should be defined in client side as we want to do Export within the loop which is not possible in server size loops. Within the loop, you can simply creating a subset of feature starting from different points by converting the features to list and changing the subset back to feature collection.
var chunk = 1000;
var collectionSize = 10000
for (var i = 0; i<collectionSize;i=i+chunk){
var subset = ee.FeatureCollection(fc.toList(chunk, i));
Export.table.toAsset(subset, "description", "/asset/id")
}

Copying fits-file data and/or header into a new fits-file

Similar question was asked before, but asked in an ambigous way and using a different code.
My problem: I want to make an exact copy of a .fits-file header into a new file. (I need to process a fits file in way, that I change the data, keep the header the same and save the result in a new file). Here a short example code, just demonstrating the tools I use and the discrepancy I arrive at:
data_old, header_old = fits.getdata("input_file.fits", header=True)
fits.writeto('output_file.fits', data_old, header_old, overwrite=True)
I would expect now that the the files are exact copies (headers and data of both being same). But if I check for difference, e.g. in this way -
fits.printdiff("input_file.fits", "output_file.fits")
I see that the two files are not exact copies of each other. The report says:
...
Files contain different numbers of HDUs:
a: 3
b: 2
Primary HDU:
Headers contain differences:
Headers have different number of cards:
a: 54
b: 4
...
Extension HDU 1:
Headers contain differences:
Keyword GCOUNT has different comments:
...
Why is there no exact copy? How can I do an exact copy of a header (and/or the data)? Is a key forgotten? Is there an alternative simple way of copy-pasting a fits-file-header?
If you just want to update the data array in an existing file while preserving the rest of the structure, have you tried the update function?
The only issue with that is it doesn't appear to have an option to write to a new file rather than update the existing file (maybe it should have this option). However, you can still use it by first copying the existing file, and then updating the copy.
Alternatively, you can do things more directly using the object-oriented API. Something like:
with fits.open(filename) as hdu_list:
hdu = hdu_list[<name or index of the HDU to update>]
hdu.data = <new ndarray>
# or hdu.data[<some index>] = <some value> i.e. just directly modify the existing array
hdu.writeto('updated.fits') # to write just that HDU to a new file, or
# hdu_list.writeto('updated.fits') # to write all HDUs, including the updated one, to a new file
There's nothing not "pythonic" about this :)

Cypher LOAD CSV - how to create a linked list of nodes ordered by a property?

Im new to Neo4j and looking for some guidance :-)
Basically I want to create the graph below from the csv below. The NEXT relationship is created between Points based on the order of their property sequence. I would like to be able to ignore if sequences are consecutive. Any ideas?
(s1:Shape)-[:POINTS]->(p1:Point)
(s1:Shape)-[:POINTS]->(p2:Point)
(s1:Shape)-[:POINTS]->(p3:Point)
(p1)-[:NEXT]->(p2)
(p2)[:NEXT]->(p3)
and so on
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
"1-700-y11-1.1.I","53.42646060879","-6.23930113514121","1","0"
"1-700-y11-1.1.I","53.4268571616632","-6.24059395687542","2","96.6074531286277"
"1-700-y11-1.1.I","53.4269700485041","-6.24093540883784","3","122.549696670773"
"1-700-y11-1.1.I","53.4270439028769","-6.24106779537932","4","134.591291249566"
"1-700-y11-1.1.I","53.4268623569266","-6.24155684094256","5","172.866609667575"
"1-700-y11-1.1.I","53.4268380666968","-6.2417384245122","6","185.235926544428"
"1-700-y11-1.1.I","53.4268874080753","-6.24203735638874","7","205.851454672516"
"1-700-y11-1.1.I","53.427394066848","-6.24287421729846","8","285.060040065768"
"1-700-y11-1.1.I","53.4275257974236","-6.24327509689195","9","315.473852717259"
"1-700-y11-1.2.O","53.277024711771","-6.20739084216546","1","0"
"1-700-y11-1.2.O","53.2777605784999","-6.20671521402849","2","93.4772699644143"
"1-700-y11-1.2.O","53.2780318605927","-6.2068238246152","3","124.525619356934"
"1-700-y11-1.2.O","53.2786209984572","-6.20894363498438","4","280.387737910482"
"1-700-y11-1.2.O","53.2791038678913","-6.21057305710353","5","401.635418300665"
"1-700-y11-1.2.O","53.2790975844245","-6.21075327761739","6","413.677012879457"
"1-700-y11-1.2.O","53.2792296384738","-6.21116766400758","7","444.981964564454"
"1-700-y11-1.2.O","53.2799500357098","-6.21065767664905","8","532.073870043666"
"1-700-y11-1.2.O","53.2800290799386","-6.2105343995296","9","544.115464622458"
"1-700-y11-1.2.O","53.2815594673093","-6.20949562301196","10","727.987702875002"
It is the 3rd part that I cant finish. Creating the NEXT relationship!
//1. Create Shape
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM
'file:///D:\\shapes.txt' AS csv
With distinct csv.shape_id as ids
Foreach (x in ids | merge (s:Shape {id: x} ));
//2. Create Point, and Shape to Point relationship
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM
'file:///D:\\shapes.txt' AS csv
MATCH (s:Shape {id: csv.shape_id})
with s, csv
MERGE (s)-[:POINTS]->(p:Point {id: csv.shape_id,
lat : csv.shape_pt_lat, lon : csv.shape_pt_lat,
sequence : toInt(csv.shape_pt_sequence), dist_travelled : csv.shape_dist_traveled});
//3.Create Point to Point relationship
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM
'file:///D:\\shapes.txt' AS csv
???
You'll want APOC Procedures installed for this one. It has both a means of batch processing, and a quick way to link all nodes in a collection together.
Since you already have all shapes the the points of the shape in the db, you don't need to do another load csv, just use the data you've got.
We'll use apoc.periodic.iterate() to batch process each shape, and apoc.nodes.link() to link all ordered points in the shape by relationships.
CALL apoc.periodic.iterate(
"MATCH (s:Shape) RETURN s",
"WITH {s} as shape
MATCH (shape)-[:POINTS]->(point:Point)
WITH shape, point
ORDER by point.sequence ASC
WITH shape, COLLECT(point) as points
CALL apoc.nodes.link(points,'NEXT')",
{batchSize:1000, parallel:true}) YIELD batches, total
RETURN batches, total
EDIT
Looks like there may be a bug when using procedure calls within the apoc.periodic.iterate() where no mutating operations occur (attempted this after including a SET operation in the second part of the query to set a property on some nodes, the property was not added).
Unsure if this is a general case of procedure calls being executed within procedure calls, or if this is specific to apoc.periodic.iterate(), or if this only occurs with both iterate() and link().
I'll file a bug if I can learn more about the cause. In the meantime, if you don't need batching, you can forgo apoc.periodic.iterate():
MATCH (shape:Shape)-[:POINTS]->(point:Point)
WITH shape, point
ORDER by point.sequence ASC
WITH shape, COLLECT(point) as points
CALL apoc.nodes.link(points,'NEXT')

Loading a pandas Dataframe into a sql database with Django

I describe the outcome of a strategy by numerous rows. Each row contains a symbol (describing an asset), a timestamp (think of a backtest) and a price + weight.
Before a strategy runs I delete all previous results from this particular strategy (I have many strategies). I then loop over all symbols and all times.
# delete all previous data written by this strategy
StrategyRow.objects.filter(strategy=strategy).delete()
for symbol in symbols.keys():
s = symbols[symbol]
for t in portfolio.prices.index:
p = prices[symbol][t]
w = weights[symbol][t]
row = StrategyRow.objects.create(strategy=strategy, symbol=s, time=t)
if not math.isnan(p):
row.price = p
if not math.isnan(w):
row.weight = w
row.save()
This works but is very, very slow. Is there a chance to achive the same with write_frame from pandas? Or maybe using faster raw sql?
I don't think the first thing you should try is the raw SQL route (more on that in a bit)
But I think it's because of calling row.save() on many objects, that operation is known to be slow.
I'd look into StrategyRow.objects.bulk_create() first, https://docs.djangoproject.com/en/1.7/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
The difference is you pass it a list of your StrategyRow model, instead of calling .save() on individual instances. It's pretty straightforward, bundle up a few rows then create them in batches, maybe try 10, 20, a 100 etc at a time, your database configs can also help find the optimum batch size. (e.g. http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_max_allowed_packet)
Back to your idea of raw SQL, that would make a difference, if e.g. the Python code that creates the StrategyRow instances is slow (e.g. StrategyRow.objects.create()), but still I believe the key is to batch insert them instead of running N queries

Counting the number of passes through a CSV file in JMeter

Am I missing an easy way to do this?
I have a CSV file with a number of params in it, and in my test I want to be able to make some of the fields unique across CSV repetitions with a suffix determined by the number of times I've looped through the file.
So suppose my CSV (simplified) had:
abc
def
ghi
I want to generate in the test
abc_1
def_1
ghi_1 <hit EOF>
abc_2
def_2
ghi_2 <hit EOF>
abc_3
def_3
ghi_3
I thought I could set up a counter to run parallel to my CSV loop, but that won't work unless I increment it by 1/n each iteration, where n is the number of lines in my CSV file. Which you can't do because counters are integers.
I'm going to go flail around and see if I can come up with a solution, but in case I'm not successful, has anyone got any suggestions?
I've used an EOF marker row (index column with something like "EOF" or "END", etc) and used an IF controller with either a non-resetting counter OR user-variables incremented via javascript in a BSF element (BSF assertion or whatever, just a mechanism to run the script).
Unfortunately its the best solution I've come up with without putting too much effort into it.