I’m trying to use LOAD CSV to create nodes with the labels being set to values from the CSV. Is that possible? I’m trying something like:
LOAD CSV WITH HEADERS FROM 'file:///testfile.csv' AS line
CREATE (x:line.label)
...but I get an invalid syntax error. Is there any way to do this?
bicpence,
First off, this is pretty easy to do with a Java batch import application, and they aren't hard to write. See this batch inserter example. You can use opencsv to read your CSV file.
If you would rather stick with Cypher, and if you have a finite set of labels to work with, then you could do something like this:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///testfile.csv' AS LINE
CREATE (n:load {lab:line.label, prop:line.prop});
CREATE INDEX ON :load(lab);
MATCH (n:load {lab:'label1'})
SET n:label1
REMOVE n:load
REMOVE n.lab;
MATCH (n:load {lab:'label2'})
SET n:label2
REMOVE n:load
REMOVE n.lab;
Grace and peace,
Jim
Unfortunately not, parameterized labels are not supported
Chris
you can do a workaround - create all nodes and than filter on them and create the desired nodes, than remove those old nodes
LOAD CSV WITH HEADERS FROM 'file:///testfile.csv' AS line
CREATE (tmp:line[1])
WITH tmp
CREATE (x:Person {name: labels(tmp)[0]})
WITH tmp
REMOVE tmp
paste this into http://console.neo4j.org to see example:
LOAD CSV
WITH HEADERS FROM "http://docs.neo4j.org/chunked/2.1.2/csv/import/persons.csv" AS csvLine
CREATE (p:tmp { id: toInt(csvLine.id), name: csvLine.name })
WITH p
CREATE (pp:Person { name: labels(p)[0]})
WITH p, pp
DELETE p
RETURN pp
I looked around at a few questions like this, and came to the conclusion that a nice concise way to handle these kinds of complex frustrations of not being able to easily add dynamic labels through 'LOAD CSV', is simply use your favorite programming language to read CSV lines, and produce a text output file of Cypher statements that will produce the Neo4j node/edge structure that you want. Then you will also be able to edit the text file directly, to alter whatever you want to further customize your commands.
I personally used Java given I am most comfortable with Java. I read each line of the CSV into a custom object that represents a row in my CSV file. I then printed to a file a line that reflects the Cypher statement I wanted. And then all I had to do was cut and paste those commands into Neo4j browser command line.
This way you can build your commands however you want, and you can completely avoid the limitations of 'LOAD CSV' commands with Cypher
Jim Biard's answer works but uses PERIODIC COMMIT which is useful however deprecated.
I was able to write a query that:
Loads from CSV
Uses multiple transactions
Creates nodes
Appends labels
Will work for 4.5 and onwards
:auto LOAD CSV WITH HEADERS FROM 'file:///nodes_build_ont_small.csv' AS row
CALL {
with row
call apoc.create.node([row.label], {id: row.id})
yield node
return null
} IN TRANSACTIONS of 100 rows
return null
Seems that apoc procedures are more useful then the commands themselves since this is not possible (at least in my attempts) with CREATE.
Related
I am sending the following csv files to marklogic
id,first_name,last_name,email,country,ip_address
5,Shawn,Grant,sgrant0#51.la,Liberia,37.194.161.124
5,Joshua,Fields,jfields1#godaddy.com,Colombia,54.224.238.176
5,Johnny,Bell,jbell2#t.co,Finland,159.38.61.122
Through mlcp using following command
C:\mlcp-9.0.3\bin>mlcp.bat import -host localhost -port 9636 -username admin -pa
ssword admin -input_file_path D:\test.csv -input_file_type delimited_text -docum
ent_type json
What happened ?
When i seen query console i had one JSON document with following information
id,first_name,last_name,email,country,ip_address
5,Shawn,Grant,sgrant0#51.la,Liberia,37.194.161.124
What i am expecting ?
By default first column of csv is taken by creating json/xml document . Since i am sending 3 rows it should have latest information(i.e.3rd row) right.
By Assumption
Since i am sending all three rows at once in mlcp we cant say which one is going first to ML DB
Let me know whether my assumption is right or wrong ..
Thanks
MLCP wants to be as fast as possible. In the case of CSV files it will process the rows using many threads (and even shard the document if you pass the split option). With this, there is no guarantee that it will be processed in any particular order. You may be able to tune some of the settings in MLCP to use one thread and not shard the file to affect the results you want, but in that case, you are loosing some of the power of MLCP.
Second to that, an observaion: You are adding quite a bit of overhead of inserting and overwriting un-needed documents from how I interpret your problem statement. Why not sort and filter your initial CSV document to only one record per ID and save your computer from doing more work.
I'm working on some Python code for my local billiard hall and I'm running into problems with JSON encoding. When I dump my data into a file I obviously get all the data in a single line. However, I want my data to be dumped into the file following the format that I want. For example (Had to do picture to get point across),
My custom JSON format
. I've looked up questions on custom JSONEncoders but it seems they all have to do with datatypes that aren't JSON serializable. I never found a solution for my specific need which is having everything laid out in the manner that I want. Basically, I want all of the list elements to on a separate row but all of the dict items to be in the same row. Do I need to write my own custom encoder or is there some other approach I need to take? Thanks!
I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)
I need to get back CSV output from my Solr queries, so I am using Solr's CSV responses writer.
All works fine using wt=csv without changing default values for CSV output, but I have one requirement: I need tab-separated CSV with no text value quoting at all.
The tab-separation is easy as I can specify a tab as csv.separator in the Solr csv responses writer.
The problem is how to get rid of encapsulation:
The default values for encapsulation of csv fields is ".
But setting encapsulator='' or encapsulator=None returns the error Invalid encapsulator.
There seems to be no documentation for this in the Solr Wiki.
How can I suppress encapsulation at all?
You are not going to be able to, the java source expects a 1 char length encapsulator:
String encapsulator = params.get(CSV_ENCAPSULATOR);
String escape = params.get(CSV_ESCAPE);
if (encapsulator!=null) {
if (encapsulator.length()!=1) throw new SolrException( SolrException.ErrorCode.BAD_REQUEST,"Invalid encapsulator:'"+encapsulator+"'");
strat.setEncapsulator(encapsulator.charAt(0));
}
What you can do:
Write your own custom NoEncapsulatorCSVResponseWriter, by inheriting from CSVResponseWriter probably, and modify the code so it does not use the encapsulator. Not difficult, but mostly a hassle.
Use some unique encapsulator (for example ø) and then add a postprocess step on your client side that just removes it. Easier but you need that extra step...
I have a very large CSV file (8000+ items) of URLs that I'm reading with a CSV Data Set Config element. It is populating the path of an HTTP Request sampler and iterating through with a while controller.
This is fine except what I want is have each user (thread) to pick a random URL from the CSV URL list. What I don't want is each thread using CSV items sequentially.
I was able to achieve this with a Random Order Controller with multiple HTTP Request samplers , however 8000+ HTTP Samplers really bogged down jmeter to an unusable state. So this is why I put the HTTP Sampler URLs in the CSV file. It doesn't appear that I can use the Random Order Controller with the CSV file data however. So how can I achieve random CSV data item selection per thread?
There is another way to achieve this:
create a separate thread group
depending on what you want to achieve:
add a (random) loop count -> this will set a start offset for the thread group that does the work
add a loop count or forever and a timer and let it loop while the other thread group is running. This thread group will read a 'pseudo' random line
It's not really random, the file is still read sequentially, but your work thread makes jumps in the file. It worked for me ;-)
There's no random selection function when reading csv data. The reason is you would need to read the whole file into memory first to do this and that's a bad idea with a load test tool (any load test tool).
Other commercial tools solve this problem by automatically re-processing the data. In JMeter you can achieve the same manually by simply sorting the data using an arbitrary field. If you sort by, say Surname, then the result is effectively random distribution.
Note. If you ensure the default All Threads is set for the CSV Data Set Config then the data will be unique in the scope of the JMeter process.
The new Random CSV Data Set Config from BlazeMeter plugin should perfectly fit your needs.
As other answers have stated, the reason you're not able to select a line at random is because you would have to read the whole file into memory which is inefficient.
Rather than trying to get JMeter to handle this on the fly, why not just randomise the file order itself before you start the test?
A scripting language such as perl makes short work of this:
cat unrandom.csv | perl -MList::Util=shuffle -e 'print shuffle<STDIN>' > random.csv
For my case:
single column
small dataset
Non-changing CSV
I just discard using CSV and refer to https://stackoverflow.com/a/22042337/6463291 and use a Bean Preprocessor instead, something like this:
String[] query = new String[]{"csv_element1", "csv_element2", "csv_element3"};
Random random = new Random();
int i = random.nextInt(query.length);
vars.put("randomOption",query[i]);
Performance seems ok, if you got the same issue can try this out.
I am not sure if this will work, but I will anyways suggest it.
Why not divide your URLs in 100 different CSV files. Then in each thread you generate the random number and use that number to identify CSV file to read using __CSVRead function.
CSVRead">http://jmeter.apache.org/usermanual/functions.html#_CSVRead
Now the only part I am not sure if the __CSVRead function reopens the file every time or shares the same file handle across the threads.
You may want to try it. Please share your findings.
A much straight forward solution.
In CSV file, add another column (say B)
apply =RAND() function in the first cell of column B (say B1). This will create random float number.
Drag the cell (say B1) corner to apply for all the corresponding URLs
Sort column B.
your URL will be sorted randomly.
Delete column B.