I need to get back CSV output from my Solr queries, so I am using Solr's CSV responses writer.
All works fine using wt=csv without changing default values for CSV output, but I have one requirement: I need tab-separated CSV with no text value quoting at all.
The tab-separation is easy as I can specify a tab as csv.separator in the Solr csv responses writer.
The problem is how to get rid of encapsulation:
The default values for encapsulation of csv fields is ".
But setting encapsulator='' or encapsulator=None returns the error Invalid encapsulator.
There seems to be no documentation for this in the Solr Wiki.
How can I suppress encapsulation at all?
You are not going to be able to, the java source expects a 1 char length encapsulator:
String encapsulator = params.get(CSV_ENCAPSULATOR);
String escape = params.get(CSV_ESCAPE);
if (encapsulator!=null) {
if (encapsulator.length()!=1) throw new SolrException( SolrException.ErrorCode.BAD_REQUEST,"Invalid encapsulator:'"+encapsulator+"'");
strat.setEncapsulator(encapsulator.charAt(0));
}
What you can do:
Write your own custom NoEncapsulatorCSVResponseWriter, by inheriting from CSVResponseWriter probably, and modify the code so it does not use the encapsulator. Not difficult, but mostly a hassle.
Use some unique encapsulator (for example ø) and then add a postprocess step on your client side that just removes it. Easier but you need that extra step...
Related
Background: I want to store a dict object in json format that has say, 2 entries:
(1) Some object that describes the data in (2). This is small data mostly definitions, parameters that control, etc. and things (maybe called metadata) that one would like to read before using the actual data in (2). In short, I want good human readability of this portion of the file.
(2) The data itself is a large chunk- should more like machine readable (no need for human to gaze over it on opening the file).
Problem: How to specify some custom indent, say 4 to the (1) and None to the (2). If I use something like json.dump(data, trig_file, indent=4) where data = {'meta_data': small_description, 'actual_data': big_chunk}, meaning the large data will have a lot of whitespace making the file large.
Assuming you can append json to a file:
Write {"meta_data":\n to the file.
Append the json for small_description formatted appropriately to the file.
Append ,\n"actual_data":\n to the file.
Append the json for big_chunk formatted appropriately to the file.
Append \n} to the file.
The idea is to do the json formatting out the "container" object by hand, and using your json formatter as appropriate to each of the contained objects.
Consider a different file format, interleaving keys and values as distinct documents concatenated together within a single file:
{"next_item": "meta_data"}
{
"description": "human-readable content goes here",
"split over": "several lines"
}
{"next_item": "actual_data"}
["big","machine-readable","unformatted","content","here","....."]
That way you can pass any indent parameters you want to each write, and you aren't doing any serialization by hand.
See How do I use the 'json' module to read in one JSON object at a time? for how one would read a file in this format. One of its answers wisely suggests the ijson library, which accepts a multiple_values=True argument.
Am having trouble identifying the propert format to store a json request body in csv format, then use the csv file value in a scenario.
This works properly within a scenario:
And request '{"contextURN":"urn:com.myco.here:env:booking:reservation:0987654321","individuals":[{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:12345678","name":{"firstName":"NUNYA","lastName":"BIDNESS"},"dateOfBirth":"1980-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"LANDBRANCH","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"},{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:23456789","name":{"firstName":"NUNYA","lastName":"BIZNESS"},"dateOfBirth":"1985-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"BRANCHLAND","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"}]}'
However, when stored in csv file as follows (I've tried quite a number other formatting variations)
'{"contextURN":"urn:com.myco.here:env:booking:reservation:0987654321","individuals":[{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:12345678","name":{"firstName":"NUNYA","lastName":"BIDNESS"},"dateOfBirth":"1980-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"LANDBRANCH","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"},{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:23456789","name":{"firstName":"NUNYA","lastName":"BIZNESS"},"dateOfBirth":"1985-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"BRANCHLAND","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"}]}',
and used in scenario as:
And request requestBody
my test returns an "javascript evaluation failed: " & the json above & :1:63 Missing close quote ^ in at line number 1 at column number 63
Can you please identify correct formatting or the usage errors I am missing? Thanks
We just use a basic CSV library behind the scenes. I suggest you roll your own Java helper class that does whatever processing / pre-processing you need.
Do read this answer as well: https://stackoverflow.com/a/54593057/143475
I can't make sense of your JSON but if you are trying to fit JSON into CSV, sorry - that's not a good idea. See this answer: https://stackoverflow.com/a/62449166/143475
I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)
Just to check, the default JSON view which changes python objects to JSON seems to include whitespace between the variables, i.e.
"field": [[110468, "Octopus_vulgaris", "common octopus"...
rather than
"field":[[110468,"Octopus_vulgaris","common octopus"...
Is that right? If so, is there an easy way to output the JSON without the extra spaces, and is this for any reason (other than readability) a bad idea.
I'm trying to make some API calls return the fastest and most concise JSON representation, so any other tips gratefully accepted. For example, I see the view calls from gluon.serializers import json - does that get re-imported every time the view is used, or is python clever enough to use it once-only. I'm hoping the latter.
The generic.json view calls gluon.serializers.json, which ultimately calls json.dumps from the Python standard library. By default, json.dumps inserts spaces after separators. If you want no spaces, you will not be able to use the generic.json view as is. You can instead do:
import json
output = json.dumps(input, separators=(',', ':'))
If input includes some data that are not JSON serializable and you want to take advantage of the special data type conversions implemented in gluon.serializers.json (i.e., datetime objects and various web2py specific objects), you can do the following:
import json
from gluon.serializers import custom_json
output = json.dumps(input, separators=(',', ':'), default=custom_json)
Using the above, you can either edit the generic.json view, create your own custom JSON view, or simply return the JSON directly from the controller.
Also, no need to worry about re-importing modules in Python -- the interpreter only loads the module once.
I’m trying to use LOAD CSV to create nodes with the labels being set to values from the CSV. Is that possible? I’m trying something like:
LOAD CSV WITH HEADERS FROM 'file:///testfile.csv' AS line
CREATE (x:line.label)
...but I get an invalid syntax error. Is there any way to do this?
bicpence,
First off, this is pretty easy to do with a Java batch import application, and they aren't hard to write. See this batch inserter example. You can use opencsv to read your CSV file.
If you would rather stick with Cypher, and if you have a finite set of labels to work with, then you could do something like this:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///testfile.csv' AS LINE
CREATE (n:load {lab:line.label, prop:line.prop});
CREATE INDEX ON :load(lab);
MATCH (n:load {lab:'label1'})
SET n:label1
REMOVE n:load
REMOVE n.lab;
MATCH (n:load {lab:'label2'})
SET n:label2
REMOVE n:load
REMOVE n.lab;
Grace and peace,
Jim
Unfortunately not, parameterized labels are not supported
Chris
you can do a workaround - create all nodes and than filter on them and create the desired nodes, than remove those old nodes
LOAD CSV WITH HEADERS FROM 'file:///testfile.csv' AS line
CREATE (tmp:line[1])
WITH tmp
CREATE (x:Person {name: labels(tmp)[0]})
WITH tmp
REMOVE tmp
paste this into http://console.neo4j.org to see example:
LOAD CSV
WITH HEADERS FROM "http://docs.neo4j.org/chunked/2.1.2/csv/import/persons.csv" AS csvLine
CREATE (p:tmp { id: toInt(csvLine.id), name: csvLine.name })
WITH p
CREATE (pp:Person { name: labels(p)[0]})
WITH p, pp
DELETE p
RETURN pp
I looked around at a few questions like this, and came to the conclusion that a nice concise way to handle these kinds of complex frustrations of not being able to easily add dynamic labels through 'LOAD CSV', is simply use your favorite programming language to read CSV lines, and produce a text output file of Cypher statements that will produce the Neo4j node/edge structure that you want. Then you will also be able to edit the text file directly, to alter whatever you want to further customize your commands.
I personally used Java given I am most comfortable with Java. I read each line of the CSV into a custom object that represents a row in my CSV file. I then printed to a file a line that reflects the Cypher statement I wanted. And then all I had to do was cut and paste those commands into Neo4j browser command line.
This way you can build your commands however you want, and you can completely avoid the limitations of 'LOAD CSV' commands with Cypher
Jim Biard's answer works but uses PERIODIC COMMIT which is useful however deprecated.
I was able to write a query that:
Loads from CSV
Uses multiple transactions
Creates nodes
Appends labels
Will work for 4.5 and onwards
:auto LOAD CSV WITH HEADERS FROM 'file:///nodes_build_ont_small.csv' AS row
CALL {
with row
call apoc.create.node([row.label], {id: row.id})
yield node
return null
} IN TRANSACTIONS of 100 rows
return null
Seems that apoc procedures are more useful then the commands themselves since this is not possible (at least in my attempts) with CREATE.