Tableau json format - json

In Tableau Public's documentation one can read that
A single JSON object cannot exceed 128 MB.
When a single object top-level array exceeds 128 MB, you must convert
it to a file where the JSON objects are defined one per line.
When trying to load a (95 MB!) JSON file into Tableau, i get an error:
Tableau encountered a single object larger than 128 MB in the JSON
file. Please try again with a smaller JSON file.
My file is one large array in the following format:
[
{"id":"1389406","updatedDate":"2018-01-31T10:17:31Z","createdDate":"2018-01-31T10:17:31Z","deleted":false,"Ids":["2466958"],"location":{"code":"bibd","name":"Main Library"},"status":{"code":"-","display":"AVAILABLE"},"barcode":"blah blah","callNumber":"TS","itemType":"In-house loan"},
{"id":"1389406","updatedDate":"2018-01-31T10:17:31Z","createdDate":"2018-01-31T10:17:31Z","deleted":false,"Ids":["2466958"],"location":{"code":"bibd","name":"Main Library"},"status":{"code":"-","display":"AVAILABLE"},"barcode":"blah blah","callNumber":"TS","itemType":"In-house loan"}
...
]
What could i do to make Tableau accept this file?

Well, you basically have to do it as the help text stipulates. One JSON object per line. Nothing more. I tried just that, but I forgot to remove the comma (,) i had at the end of every line.
Your file should look like this:
{"id":"1389406","updatedDate":"2018-01-31T10:17:31Z","createdDate":"2018-01-31T10:17:31Z","deleted":false,"Ids":["2466958"],"location":{"code":"bibd","name":"Main Library"},"status":{"code":"-","display":"AVAILABLE"},"barcode":"blah blah","callNumber":"TS","itemType":"In-house loan"}
{"id":"1389406","updatedDate":"2018-01-31T10:17:31Z","createdDate":"2018-01-31T10:17:31Z","deleted":false,"Ids":["2466958"],"location":{"code":"bibd","name":"Main Library"},"status":{"code":"-","display":"AVAILABLE"},"barcode":"blah blah","callNumber":"TS","itemType":"In-house loan"}

Related

How to read 100GB of Nested json in pyspark on Databricks

There is a nested json with very deep structure. File is of the format json.gz size 3.5GB. Once this file is uncompressed it is of size 100GB.
This json file is of the format, where Multiline = True (if this condition is used to read the file via spark.read_json then only we get to see the proper json schema).
Also, this file has a single record, in which it has two columns of Struct type array, with multilevel nesting.
How should I read this file and extract information. What kind of cluster / technique to use to extract relevant data from this file.
Structure of the JSON (multiline)
This is a single record. and the entire data is present in 2 columns - in_netxxxx and provider_xxxxx
enter image description here
I was able to achieve this in a bit different way.
Use the utility - Big Text File Splitter -
BigTextFileSplitter - Withdata Softwarehttps://www.withdata.com › big-text-file-splitter ( as the file was huge and multiple level nested) the split record size I kept was 500. This generated around 24 split files of around 3gb each. Entire process took 30 -40 mins.
Processed the _corrupt_record seperately - and populated the required information.
Read the each split file in a using - this option removes the _corrupt_record and also removes the null rows.
spark.read.option("mode", "DROPMALFORMED").json(file_path)
Once the information is fetched form each file, we can merge all the files into a single file, as per standard process.

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

JSON variable indent for different entries

Background: I want to store a dict object in json format that has say, 2 entries:
(1) Some object that describes the data in (2). This is small data mostly definitions, parameters that control, etc. and things (maybe called metadata) that one would like to read before using the actual data in (2). In short, I want good human readability of this portion of the file.
(2) The data itself is a large chunk- should more like machine readable (no need for human to gaze over it on opening the file).
Problem: How to specify some custom indent, say 4 to the (1) and None to the (2). If I use something like json.dump(data, trig_file, indent=4) where data = {'meta_data': small_description, 'actual_data': big_chunk}, meaning the large data will have a lot of whitespace making the file large.
Assuming you can append json to a file:
Write {"meta_data":\n to the file.
Append the json for small_description formatted appropriately to the file.
Append ,\n"actual_data":\n to the file.
Append the json for big_chunk formatted appropriately to the file.
Append \n} to the file.
The idea is to do the json formatting out the "container" object by hand, and using your json formatter as appropriate to each of the contained objects.
Consider a different file format, interleaving keys and values as distinct documents concatenated together within a single file:
{"next_item": "meta_data"}
{
"description": "human-readable content goes here",
"split over": "several lines"
}
{"next_item": "actual_data"}
["big","machine-readable","unformatted","content","here","....."]
That way you can pass any indent parameters you want to each write, and you aren't doing any serialization by hand.
See How do I use the 'json' module to read in one JSON object at a time? for how one would read a file in this format. One of its answers wisely suggests the ijson library, which accepts a multiple_values=True argument.

How do I read a Large JSON Array File in PySpark

Issue
I recently encountered a challenge in Azure Data Lake Analytics when I attempted to read in a Large UTF-8 JSON Array file and switched to HDInsight PySpark (v2.x, not 3) to process the file. The file is ~110G and has ~150m JSON Objects.
HDInsight PySpark does not appear to support Array of JSON file format for input, so I'm stuck. Also, I have "many" such files with different schemas in each containing hundred of columns each, so creating the schemas for those is not an option at this point.
Question
How do I use out-of-the-box functionality in PySpark 2 on HDInsight to enable these files to be read as JSON?
Thanks,
J
Things I tried
I used the approach at the bottom of this page:
from Databricks that supplied the below code snippet:
import json
df = sc.wholeTextFiles('/tmp/*.json').flatMap(lambda x: json.loads(x[1])).toDF()
display(df)
I tried the above, not understanding how "wholeTextFiles" works, and of course ran into OutOfMemory errors that killed my executors quickly.
I attempted loading to an RDD and other open methods, but PySpark appears to support only the JSONLines JSON file format, and I have the Array of JSON Objects due to ADLA's requirement for that file format.
I tried reading in as a text file, stripping Array characters, splitting on the JSON object boundaries and converting to JSON like the above, but that kept giving errors about being unable to convert unicode and/or str (ings).
I found a way through the above, and converted to a dataframe containing one column with Rows of strings that were the JSON Objects. However, I did not find a way to output only the JSON Strings from the data frame rows to an output file by themselves. The always came out as
{'dfColumnName':'{...json_string_as_value}'}
I also tried a map function that accepted the above rows, parsed as JSON, extracted the values (JSON I wanted), then parsed the values as JSON. This appeared to work, but when I would try to save, the RDD was type PipelineRDD and had no saveAsTextFile() method. I then tried the toJSON method, but kept getting errors about "found no valid JSON Object", which I did not understand admittedly, and of course other conversion errors.
I finally found a way forward. I learned that I could read json directly from an RDD, including a PipelineRDD. I found a way to remove the unicode byte order header, wrapping array square brackets, split the JSON Objects based on a fortunate delimiter, and have a distributed dataset for more efficient processing. The output dataframe now had columns named after the JSON elements, inferred the schema, and dynamically adapts for other file formats.
Here is the code - hope it helps!:
#...Spark considers arrays of Json objects to be an invalid format
# and unicode files are prefixed with a byteorder marker
#
thanksMoiraRDD = sc.textFile( '/a/valid/file/path', partitions ).map(
lambda x: x.encode('utf-8','ignore').strip(u",\r\n[]\ufeff")
)
df = sqlContext.read.json(thanksMoiraRDD)

What is the efficient way of reading and processing very large CSV file in scala (> 1GB)?

In Scala how do you efficiently (memory consumption + performance) read very large csv file? is it fast enough to just stream it line by line and process each line at each iteration?
What i need to do with CSV data :->
In my application Single line in CSV file is treated as an one single record and all the records of the CSV file are to be converted into XML elements and JSON format and save it into another file in xml and json formats.
So here question is while reading the file from csv is it a good idea to read the file in chunks and provide that chunk to another thread which will convert that CSV records into an xml/json and write that xml/json to file? If yes how?
Data of the CSV can be anything, there is no restriction on the type of the data it can be numeric, big decimal, string or date. Any easy way to handle this different data types before saving it to xml? or we don't need to take care of types?
Many Thanks
If this is not a one time task, create a program that will break this 1GB file to small size files. Then provide those new files as a input to separate futures.
Each future will read one file and resolve in the order of file content. File4 resolves after File3, which resolves after File2, which resolves after File1.
As the file has no key-value pair or hierarchical data structure, so I will suggest, just read as a string.
Hope this helps.