Hive json serde selection - json

I am confused on choosing between two json serde given in below link (
Openx and hcatolog).
https://docs.aws.amazon.com/athena/latest/ug/json.html
My json is not a nested json.Its a simple json.
A file having each record as json separated by newline.
Please let me know which would be apt in my case .

Related

How to prevent adding backslash to JSON string

I would like to read events from eventhub using Databricks, events are in json format but they can have different schema (it's important because i find solutions in which the schema was given to from_json(jsonStr,schema) function, but i cannot use it in my use case). When i use
.withColumn('Value', col('value').cast(StringType() in dataframe returns json output with backslashes "{\"time\": 1432826855000,\"host\":...... .
I found a solution How to prevent spark sql with kafka from adding backslash to JSON string in dataframe but in Delta Live Tables framework we create streaming tables by returning a dataframe, so i cant use this solution.
Should i use non pyspark functions in etl process such as
How to remove backslash from decoded JSON string? ?
Will it be efficient during streaming from eventhub to bronze?
You shouldn't worry about that backslashes - it's just a visual representation of your string when you display data and it has " character embedded into a string. Internally, data will be stored without backslashes, like: {"time": 1432826855000,"host":.......

Is format always json when SELECTing from stage?

Snowflake supports multiple file types via creation FILE_FORMAT (avro, json, csv etc).
Now I have tested SELECTing from snowflake stage (s3) both:
*.avro files (generated from nifi processor batching 10k source oracle table).
*.json files (json per line).
And when Select $1 from #myStg, snowflake expands as many rows as records on avro or json files (cool), but.. the $1 variant is both json format and now i wonder if whatever snowflake file_format we use do records always arrive as json on the variant $1 ?
I haven't tested csv or others snowflake file_formats.
Or i wonder if i get json from the avros (from oracle table) because maybe NiFi processor creates avro files (with internally uses json format).
Maybe im making some confusion here.. i know avro files contain both:
avro schema - language similar to json key/value.
compressed data (binary).
Thanks,
Emanuel O.
I tried with CSV, When Its came to CSV its parsing each records in the file like below
So when its came to JSON it will treat one complete JSON as one records so its displaying in JSON format.

How do I read a Large JSON Array File in PySpark

Issue
I recently encountered a challenge in Azure Data Lake Analytics when I attempted to read in a Large UTF-8 JSON Array file and switched to HDInsight PySpark (v2.x, not 3) to process the file. The file is ~110G and has ~150m JSON Objects.
HDInsight PySpark does not appear to support Array of JSON file format for input, so I'm stuck. Also, I have "many" such files with different schemas in each containing hundred of columns each, so creating the schemas for those is not an option at this point.
Question
How do I use out-of-the-box functionality in PySpark 2 on HDInsight to enable these files to be read as JSON?
Thanks,
J
Things I tried
I used the approach at the bottom of this page:
from Databricks that supplied the below code snippet:
import json
df = sc.wholeTextFiles('/tmp/*.json').flatMap(lambda x: json.loads(x[1])).toDF()
display(df)
I tried the above, not understanding how "wholeTextFiles" works, and of course ran into OutOfMemory errors that killed my executors quickly.
I attempted loading to an RDD and other open methods, but PySpark appears to support only the JSONLines JSON file format, and I have the Array of JSON Objects due to ADLA's requirement for that file format.
I tried reading in as a text file, stripping Array characters, splitting on the JSON object boundaries and converting to JSON like the above, but that kept giving errors about being unable to convert unicode and/or str (ings).
I found a way through the above, and converted to a dataframe containing one column with Rows of strings that were the JSON Objects. However, I did not find a way to output only the JSON Strings from the data frame rows to an output file by themselves. The always came out as
{'dfColumnName':'{...json_string_as_value}'}
I also tried a map function that accepted the above rows, parsed as JSON, extracted the values (JSON I wanted), then parsed the values as JSON. This appeared to work, but when I would try to save, the RDD was type PipelineRDD and had no saveAsTextFile() method. I then tried the toJSON method, but kept getting errors about "found no valid JSON Object", which I did not understand admittedly, and of course other conversion errors.
I finally found a way forward. I learned that I could read json directly from an RDD, including a PipelineRDD. I found a way to remove the unicode byte order header, wrapping array square brackets, split the JSON Objects based on a fortunate delimiter, and have a distributed dataset for more efficient processing. The output dataframe now had columns named after the JSON elements, inferred the schema, and dynamically adapts for other file formats.
Here is the code - hope it helps!:
#...Spark considers arrays of Json objects to be an invalid format
# and unicode files are prefixed with a byteorder marker
#
thanksMoiraRDD = sc.textFile( '/a/valid/file/path', partitions ).map(
lambda x: x.encode('utf-8','ignore').strip(u",\r\n[]\ufeff")
)
df = sqlContext.read.json(thanksMoiraRDD)

how to convert nested json file into csv in scala

I want to convert my nested json into csv ,i used
df.write.format("com.databricks.spark.csv").option("header", "true").save("mydata.csv")
But it can use to normal json but not nested json. Anyway that I can convert my nested json to csv?help will be appreciated,Thanks!
When you ask Spark to convert a JSON structure to a CSV, Spark can only map the first level of the JSON.
This happens because of the simplicity of the CSV files. It is just asigning a value to a name. That is why {"name1":"value1", "name2":"value2"...} can be represented as a CSV with this structure:
name1,name2, ...
value1,value2,...
In your case, you are converting a JSON with several levels, so Spark exception is saying that it cannot figure out how to convert such a complex structure into a CSV.
If you try to add only a second level to your JSON, it will work, but be careful. It will remove the names of the second level to include only the values in an array.
You can have a look at this link to see the example for json datasets. It includes an example.
As I have no information about the nature of the data, I can't say much more about it. But if you need to write the information as a CSV you will need to simplify the structure of your data.
Read json file in spark and create dataframe.
val path = "examples/src/main/resources/people.json"
val people = sqlContext.read.json(path)
Save the dataframe using spark-csv
people.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("newcars.csv")
Source :
read json
save to csv

JSON SerDe for Hive that supports JSON arrays

I have tried the JSON SerDe that Amazon provides for EMR instance and works great if you need to address/map JSON dictionary fields to columns. However I wasn't been able to figure how to do the same with JSON arrays. For example if there is a JSON array as follows:
[23123.32, "Text Text", { "key1": "value1" } ]
Is there a way to map the first element of an array to a column in Hive table? What about the embedded dictionary fields?
I was struggling with the same problem till I found this serde on github -
https://github.com/rcongiu/Hive-JSON-Serde
Just include it using the 'add jar' command once you start hive and it works like a charm.