Then I read JSON files from R using rjson it seems that JSON arrays (enclosed with []) are converted to (unnamed) R lists, not vectors.
Therefore I have to first unlist(recuresice=F) each of those lists before I can work with them.
Is there any logic behind this
conversion I am missing? I mean, why
use a list (and not a vector) to
store an array?
Is there any way to
control this behavior of rjson (or
perhaps another recommended JSON
parser for R)?
JSON arrays can store values of different types, so they are equivalent to R unnamed lists.
Related
Issue
I recently encountered a challenge in Azure Data Lake Analytics when I attempted to read in a Large UTF-8 JSON Array file and switched to HDInsight PySpark (v2.x, not 3) to process the file. The file is ~110G and has ~150m JSON Objects.
HDInsight PySpark does not appear to support Array of JSON file format for input, so I'm stuck. Also, I have "many" such files with different schemas in each containing hundred of columns each, so creating the schemas for those is not an option at this point.
Question
How do I use out-of-the-box functionality in PySpark 2 on HDInsight to enable these files to be read as JSON?
Thanks,
J
Things I tried
I used the approach at the bottom of this page:
from Databricks that supplied the below code snippet:
import json
df = sc.wholeTextFiles('/tmp/*.json').flatMap(lambda x: json.loads(x[1])).toDF()
display(df)
I tried the above, not understanding how "wholeTextFiles" works, and of course ran into OutOfMemory errors that killed my executors quickly.
I attempted loading to an RDD and other open methods, but PySpark appears to support only the JSONLines JSON file format, and I have the Array of JSON Objects due to ADLA's requirement for that file format.
I tried reading in as a text file, stripping Array characters, splitting on the JSON object boundaries and converting to JSON like the above, but that kept giving errors about being unable to convert unicode and/or str (ings).
I found a way through the above, and converted to a dataframe containing one column with Rows of strings that were the JSON Objects. However, I did not find a way to output only the JSON Strings from the data frame rows to an output file by themselves. The always came out as
{'dfColumnName':'{...json_string_as_value}'}
I also tried a map function that accepted the above rows, parsed as JSON, extracted the values (JSON I wanted), then parsed the values as JSON. This appeared to work, but when I would try to save, the RDD was type PipelineRDD and had no saveAsTextFile() method. I then tried the toJSON method, but kept getting errors about "found no valid JSON Object", which I did not understand admittedly, and of course other conversion errors.
I finally found a way forward. I learned that I could read json directly from an RDD, including a PipelineRDD. I found a way to remove the unicode byte order header, wrapping array square brackets, split the JSON Objects based on a fortunate delimiter, and have a distributed dataset for more efficient processing. The output dataframe now had columns named after the JSON elements, inferred the schema, and dynamically adapts for other file formats.
Here is the code - hope it helps!:
#...Spark considers arrays of Json objects to be an invalid format
# and unicode files are prefixed with a byteorder marker
#
thanksMoiraRDD = sc.textFile( '/a/valid/file/path', partitions ).map(
lambda x: x.encode('utf-8','ignore').strip(u",\r\n[]\ufeff")
)
df = sqlContext.read.json(thanksMoiraRDD)
I have a sequence file containing multiple json records. I want to send every json record to a function . How can I extract one json record at a time?
Unfortunately there is no standard way to do this.
Unlike YAML which has a well-defined way to allow one file contain multiple YAML "documents", JSON does not have such standards.
One way to solve your problem is to invent your own "object separator". For example, you can use newline characters to separate adjacent JSON objects. You can tell your JSON encoder not to output any newline characters (by forcing escaping it into \ and n). As long as your JSON decoder is sure that it will not see any newline character unless it separates two JSON objects, it can read the stream one line at a time and decode each line.
It has also been suggested that you can use JSON arrays to store multiple JSON objects, but it will no longer be a "stream".
You can read content of your sequence files to RDD[String] and convert it to Spark Dataframe.
val seqFileContent = sc
.sequenceFile[LongWritable, BytesWritable](inputFilename)
.map(x => new String(x._2.getBytes))
val dataframeFromJson = sqlContext.read.json(seqFileContent)
Using Julia, I am trying to read and interpret JSON data, but I get many #undefs. How to obtain an array which excludes the undefs?
using JSON
source = "http://api.herostats.io/heroes/1"
download(source, "1.json")
hdict = JSON.parsefile("1.json")
#Why does hdict have so many #undefs?
hdict.vals
hdict.keys
#And how to remove them?
Julia sometimes lets you do some silly things if you're not careful. In this case, you're viewing the internals of the dictionary (hash map) by accessing hdict.keys and hdict.vals, and accessing the underlying arrays that hold the items.
Try:
values(hdict)
keys(hdict)
I have a bunch of keys in a JSON file that are defined by numbers like 8374829806766627074.
When I try to read them in R I get completely different numbers. For instance, 8374829806766627074 becomes 8374829806766626816.
I am using the fromJSON in the jsonlite package to read a json file from citrix api.
How can I instead adjust the function to read numbers as characters. I seem to fail finding a solution.
Thanks in advance!
I have a data which are object array. It contains object arrays in a tree structure. I use JSON.stringify(myArray) but the data still contain array because I see [] inside the converted data.
In my case, I want all the data to be converted into json object not array regarding I need to used the data on TreeTable of SAPUI5.
Maybe I misunderstand. Please help me clear.
This is the example of the data that I got from JSON.stringify.
[{"value":{"Id":"00145E5BB2641EE284F811A7907717A3",
"Text":"BI-RA Reporting, analysis, and dashboards",
"Parent":"00145E5BB2641EE284F811A79076F7A3","Type":"BMF"},
"children":[{"value":{"Id":"00145E5BB2641EE284F811A7907737A3",
"Text":"WebIntelligence_4.1","Parent":"00145E5BB2641EE284F811A7907717A3",
"Type":"TWB"},"children":[{"value":{"Id":"00145E5BB2641EE284F811A7907757A3",
"Text":"Functional Areas","Parent":"00145E5BB2641EE284F811A7907737A3","Type":"TWB"},
"children":[{"value":{"Id":"00145E5BB2641EE284F811A7907777A3",
"Text":"CHARTING","Parent":"00145E5BB2641EE284F811A7907757A3","Type":"TWB"},
"children":[{"value":{"Id":"001999E0B9081EE28AB706BE26631E93",
"Text":"Drill","Parent":"00145E5BB2641EE284F811A7907777A3","Type":"TWB"},
"children":[{"value":{"Id":"001999E0B9081EE28AB706BE26633E93",
"Text":"[AUTO][ACCEPT] Drill on charts DHTML","Parent":"001999E0B9081EE28AB706BE26631E93",
"Type":"TWB","Ref":"UT_WEBI_CHARTS_DRILL_HTML"}},{"value":{"Id":"001999E0B9081EE28AB706BE26635E93",
"Text":"[AUTO][ACCEPT] Drill on charts JAVA","Parent":"001999E0B9081EE28AB706BE26631E93",
"Type":"TWB","Ref":"UT_WEBI_CHARTS_DRILL_JAVA"}}]},...
The output that I want shouldn't be array of object but should be something like...
{{"value":{
"Id":"00145E5BB2641EE284F811A7907717A3",
"Text":"BI-RA Reporting, analysis, and dashboards",
"Parent":"00145E5BB2641EE284F811A79076F7A3","Type":"BMF"},
"children":{
{"value":{
"Id":"00145E5BB2641EE284F811A7907737A3",
"Text":"WebIntelligence_4.1",
"Parent":"00145E5BB2641EE284F811A7907717A3",
"Type":"TWB"},
"children":{
{"value":{
"Id":"00145E5BB2641EE284F811A7907757A3",
"Text":"Functional Areas",
"Parent":"00145E5BB2641EE284F811A7907737A3",
"Type":"TWB"},...
JSON.stringify merely converts JavaScript data structures to a JSON-formatted string for consumption by other parsers (including JSON.parse). If you want it to stringify to a different value, you must change the source data structures first.
However, it seems that this can't be represented as anything other than an array because you have duplicate keys (i.e. value appears more than once). That would not be valid for a JavaScript object or a JSON representation of such.
I think what you want is
JSON.stringify(data[0]);
or perhaps
JSON.stringify(data[0].value);
where data is the object you passed in the question