What is the standard way to represent a list/array value in CSV? For example, given this source data in JSON:
[
{
'name': 'Harry',
'subjects': ['math', 'english', 'history']
}
]
My guess as to a CSV representation would be:
name,subjects
Harry,["math","english","history"]
However that doesn't get parsed correctly (with the standard Python CSV parser).
One option, though this is almost always a hack and should be avoided unless truly necessary, is to choose a delimiter that you know will never show up in your data. For example:
name,subject
Harry,math|english|history
Of course you will have to manually handle splitting this string and turning it back into a list. Existing CSV parsers should not support this, because this concept fundamentally does not make sense in CSV.
And of course, this does not generalize well - what happens in the future when you need to store a 2D list, or a dict, or you realize you do need that delimiter character after all?
The root problem here is that CSV is a tabular format, whereas JSON is a hierarchical format. Rather than trying to "squeeze" one format to fit into a fundamentally incompatible format, you should instead normalize your data into a tabular representation. One example of how this could look:
name,subject
Harry,math
Harry,english
Harry,history
Related
Background: I want to store a dict object in json format that has say, 2 entries:
(1) Some object that describes the data in (2). This is small data mostly definitions, parameters that control, etc. and things (maybe called metadata) that one would like to read before using the actual data in (2). In short, I want good human readability of this portion of the file.
(2) The data itself is a large chunk- should more like machine readable (no need for human to gaze over it on opening the file).
Problem: How to specify some custom indent, say 4 to the (1) and None to the (2). If I use something like json.dump(data, trig_file, indent=4) where data = {'meta_data': small_description, 'actual_data': big_chunk}, meaning the large data will have a lot of whitespace making the file large.
Assuming you can append json to a file:
Write {"meta_data":\n to the file.
Append the json for small_description formatted appropriately to the file.
Append ,\n"actual_data":\n to the file.
Append the json for big_chunk formatted appropriately to the file.
Append \n} to the file.
The idea is to do the json formatting out the "container" object by hand, and using your json formatter as appropriate to each of the contained objects.
Consider a different file format, interleaving keys and values as distinct documents concatenated together within a single file:
{"next_item": "meta_data"}
{
"description": "human-readable content goes here",
"split over": "several lines"
}
{"next_item": "actual_data"}
["big","machine-readable","unformatted","content","here","....."]
That way you can pass any indent parameters you want to each write, and you aren't doing any serialization by hand.
See How do I use the 'json' module to read in one JSON object at a time? for how one would read a file in this format. One of its answers wisely suggests the ijson library, which accepts a multiple_values=True argument.
What's the purpose (not what it becomes) of doing json_encode on this before I am putting into the database
rating: {cleanliness: 3, publicFacility: 1, roomFacility: 2, security: 2}
to become this
rating: "{"cleanliness":3,"publicFacility":1,"roomFacility":2,"security":2}"
I see no point of doing this cause I need to json_decode it again before serving it back... can anybody clear me out?
Do not store json encoded data in the database. You mitigate the whole point of a relational database this way and make searching for values an expensive task. I see in your sample the attributes cleanliness, publicFacility, roomFacility and security. Those should be columns in your database so you can search for something like "all entries with a cleanliness higher than 3".
It works with the JSON column type but it is more expensive than using normal columns.
Edit: Check the use-case for your database entry. If you are sure you never need to search in or order by the encoded attributes you can store data encoded as json string. However, if your database supports the JSON column type, you should use that one because it allows searching in the stored JSON (but is more expensive than searching in normal columns). </Edit>
Second point: The second code snipped (with the quotation marks) looks like invalid syntax for json.
I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)
What is more correct? Say that JSON is a data structure or a data format?
It's almost certainly ambiguous and depends on your interpration of the words.
The way I see it:
A datastructure, in conventional Comp Sci / Programming i.e. array, queue, binary tree usually has a specific purpose. Json can be pretty much be used for anything, hence why it's a data-format. But both definitions make sense
In my opinion both is correct.
JSON is also a data format (.json) and also a data strucuture which you can use for instance in Java etc.
But more correct is data strucutre.
JSON (canonically pronounced /ˈdʒeɪsən/ jay-sən;[1] sometimes
JavaScript Object Notation) is an open-standard format that uses
human-readable text to transmit data objects consisting of
attribute–value pairs. It is the most common data format used for
asynchronous browser/server communication (AJAJ), largely replacing
XML which is used by AJAX.
Source: Wikipedia
I tried to convert the JSON data
{
"a": {
"b": null
}
}
to XML using an online converter. The response was
<a>
<b />
</a>
Converting this back to JSON using the same converter gave me
{
"a": {
}
}
This made me wonder – if you're explicitly given a null value, are you required to preserve it when dealing with JSON? I'm fairly sure that the XML <a><b /></a> is not equivalent to <a></a>, and especially not <a /> (which happens to be what I get when I continue with the same exercise).
In other words, if I'm handed JSON of unknown origin and am supposed to hand it over to an unknown recepient, am I required to preserve the nulls or can I safely remove them? Conversely, can I rely on my nulls to end up in the same way I outputted them when delivered by third-party software?
Here's a similar question: Should JSON include null values – However, the question there is whether the code should output nulls if you define the format yourself, not what you should do if you don't know anything about the original format.
EDIT – Clarification: The way I asked the question was bad and apparently caused confusion. To rephrase it: I do understand that XML and JSON are different formats and are able to carry different kinds of (meta)data. I do know that null is a valid value, as defined by RFC4627. I do understand that there are different ways to convert between XML and JSON since the formats don't have a 1-to-1 relationship. I do understand that the converter I found might be buggy. However, the fact that the same converter didn't provide the same conversion in both directions (no information was lost when converting from "b": null to <b /> and a similar translation in the opposite direction would have been possible) made me wonder something that I couldn't find an answer to despite attempts:
Is it legal, according to the JSON standard, to treat {"a":{"b":null}} and {"a":{}} as one and the same object when transferring them on behalf of other software?
Note that I'm here assuming that it's legal to add or remove whitespace as I see fit (e.g. pretty-printing, which is okay according to RFC4627), and even to rearrange the name/value pairs in a collection (again according to RFC4627). I simply don't know if a null must be preserved in the same way as significant data, or can be dropped in the same way as insignificant whitespace.
Yes, null is a separate value in JSON, and is distinct from not having an attribute, obviously. Also, you can see this question about nulls in XML. The thing to conclude here isn't that there is something wrong with JSON or XML, but simply that the tools you use aren't coded to handle these cases.
One of the problems in converting JSON to XML is that if you try and make the conversion lossless, you end up with somewhat "unnatural" XML, whereas if you try to create the most natural XML representation, it ends up losing information. That's why there are lots of different converters that all do it in slightly different ways. Choose the one that meets your requirements.