Identify empty JSON files with Spark 2.4 - json

I want to avoid processing empty JSON files. Some empty JSON files I am getting only contain the open and close square brackets, like: [] . Files containing only that should be understood as empty files.
With Spark 2.2 the following line would return true:
spark.read.json(pathToFile).isEmpty
But with Spark 2.4 it returns false.
How do I go about identifying this type of file as empty when using Spark 2.4?

Look at columns
val stuff = spark.read.json("hdfs:///user/me/empty.json")
scala> stuff.columns
res6: Array[String] = Array()

Related

convert nested json string column into map type column in spark

overall aim
I have data landing into blob storage from an azure service in form of json files where each line in a file is a nested json object. I want to process this with spark and finally store as a delta table with nested struct/map type columns which can later be queried downstream using the dot notation columnName.key
data nesting visualized
{
key1: value1
nestedType1: {
key1: value1
keyN: valueN
}
nestedType2: {
key1: value1
nestedKey: {
key1: value1
keyN: valueN
}
}
keyN: valueN
}
current approach and problem
I am not using the default spark json reader as it is resulting in some incorrect parsing of the files instead I am loading the files as text files and then parsing using udfs by using python's json module ( eg below ) post which I use explode and pivot to get the first level of keys into columns
#udf('MAP<STRING,STRING>' )
def get_key_val(x):
try:
return json.loads(x)
except:
return None
Post this initial transformation I now need to convert the nestedType columns to valid map types as well. Now since the initial function is returning map<string,string> the values in nestedType columns are not valid jsons so I cannot use json.loads, instead I have regex based string operations
#udf('MAP<STRING,STRING>' )
def convert_map(string):
try:
regex = re.compile(r"""\w+=.*?(?:(?=,(?!"))|(?=}))""")
obj = dict([(a.split('=')[0].strip(),(a.split('=')[1])) for a in regex.findall(s)])
return obj
except Exception as e:
return e
this is fine for second level of nesting but if I want to go further that would require another udf and subsequent complications.
question
How can I use a spark udf or native spark functions to parse the nested json data such that it is queryable in columnName.key format.
also there is no restriction of spark version, hopefully I was able to explain this properly. do let me know if you want me to put some sample data and the code for ease. Any help is appreciated.

Combining and loading Json content as python dictionary

Ihave 100 json files (file1- file 100)in my directory.All these 100 have the same fields and my aim is to load allcontents in one dictionary or dataframe.Basically the content of each file (ie file1- file100) willbe a row for my dictionary or dataframe
To test the code first,I wrote a script to load contents from one json file
file2 = open(r"\Users\sbz\file1.txt","w+")
import json
import traceback
def read_json_file(file2):
with open(file2, "r") as f:
try:
return json.load(f)
for combining i wrote this
def combine_dictionaries(dictionary_list):
my_dictionary = {}
for key in dictionary_list:
my_dictionary.update(key)
return my_dictionary
I am unable to load the file or display contents of dictionary using print(file2)
Is there something I am missing? Or is there better wayto loop in all 100 files and load them as a single dictionary?
If json.load isn't working, my guess is that your JSON file is probably formatted incorrectly. Try getting it to work with a simple file like:
{
"test": 0
}
After that works, then try loading one of your 100 files. I copy-pasted your read_json_file function and I'm able to see the data in my file: print(read_json_file("data.json"))
For looping through the files and combining them:
It doesn't look like your combine_dictionaries function is 100% there yet for what you want to do. update doesn't merge the dictionaries into rows as you want; it will replace the keys of one dictionary with the keys of another, and since each file has the same fields the resulting dictionary will be the last one in the list. Technically, a list of dictionaries is already a list of rows which is what you want and you can index the list based on row number, for example, list_of_dictionaries[0] will get the dictionary created from file1 if you fill the list in order of file1 to file100. If you want to go further than file numbers, you can put all of these dictionaries into another dictionary if you can generate a unique key for each dictionary:
def combine_dictionaries(dictionary_list):
my_dictionary = {}
for dictionary in dictionary_list:
my_dictionary[generate_key(dictionary)] = dictionary
return my_dictionary
Where generate_key is a function that will return a key unique to that dictionary. Now combined_dictionary.get(0) will get file1's dictionary, and combined_dictionary.get(0).get("somefield") will get the "somefield" data from file1.

Reading massive JSON files into Spark Dataframe

I have a large nested NDJ (new line delimited JSON) file that I need to read into a single spark dataframe and save to parquet. In an attempt to render the schema I use this function:
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName))
}
})
}
on the dataframe that is returned by reading by
val df = sqlCtx.read.json(sparkContext.wholeTextFiles(path).values)
I've also switched this to val df = spark.read.json(path) so that this only works with NDJs and not multi-line JSON--same error.
This is causing an out of memory error on the workers
java.lang.OutOfMemoryError: Java heap space.
I've altered the jvm memory options and spark executor/driver options to no avail
Is there a way to stream the file, flatten the schema, and add to a dataframe incrementally? Some lines of the JSON contain new fields from the preceding entires...so those would need to be filled in later.
No work around. The issue was with the JVM object limit. I ended up using a scala json parser and built the dataframe manually.
You can achieve this in multiple ways.
First while reading, you can provide the schema for dataframe to read json or you can allow the spark to infer the schema by itself.
Once the json is in dataframe, you can follow the following ways to flatten it.
a. Using explode() on dataframe - to flatten it.
b. Using spark sql and access the nested fields using . operator. You can find examples here
Lastly, if you want to add new columns to dataframe
a. First option,using withColumn() is one approach. However this will be done for each new column added and for entire data set.
b. Using sql to generate new dataframe from existing - this may be easiest
c. Lastly, using map, then accessing elements, get old schema, add new values, create new schema and finally get the new df - as below
One withColumn will work on entire rdd. So generally its not a good practise to use the method for every column you want to add. There is a way where you work with columns and their data inside a map function. Since one map function is doing the job here, the code to add new column and its data will be done in parallel.
a. you can gather new values based on the calculations
b. Add these new column values to main rdd as below
val newColumns: Seq[Any] = Seq(newcol1,newcol2)
Row.fromSeq(row.toSeq.init ++ newColumns)
Here row, is the reference of row in map method
c. Create new schema as below
val newColumnsStructType = StructType{Seq(new StructField("newcolName1",IntegerType),new StructField("newColName2", IntegerType))
d. Add to the old schema
val newSchema = StructType(mainDataFrame.schema.init ++ newColumnsStructType)
e. Create new dataframe with new columns
val newDataFrame = sqlContext.createDataFrame(newRDD, newSchema)

Loading json data into hive using spark sql

I am Unable to push json data into hive Below is the sample json data and my work . Please suggest me the missing one
json Data
{
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani#gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani#gmail.com"
},
{
"userId":"thanks",
"jobTitleName":"Program Directory",
"firstName":"Tom",
"lastName":"Hanks",
"preferredFullName":"Tom Hanks",
"employeeCode":"E3",
"region":"CA",
"phoneNumber":"408-2222222",
"emailAddress":"tomhanks#gmail.com"
}
]
}
I tried to use sqlcontext and jsonFile method to load which is failing to parse the json
val f = sqlc.jsonFile("file:///home/vm/Downloads/emp.json")
f.show
error is : java.lang.RuntimeException: Failed to parse a value for data type StructType() (current token: VALUE_STRING)
I tried in different way and able to crack and get the schema
val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")
val jsonData = files.map(x => x._2)
sqlc.jsonRDD(jsonData).registerTempTable("employee")
val emp= sqlc.sql("select Employees[1].userId as ID,Employees[1].jobTitleName as Title,Employees[1].firstName as FirstName,Employees[1].lastName as LastName,Employees[1].preferredFullName as PeferedName,Employees[1].employeeCode as empCode,Employees[1].region as Region,Employees[1].phoneNumber as Phone,Employees[1].emailAddress as email from employee")
emp.show // displays all the values
I am able to get the data and schema seperately for each record but I am missing an idea to get all the data and load into hive.
Any help or suggestion is much appreaciated.
Here is the Cracked answer
val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")
val jsonData = files.map(x => x._2)
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.HiveContext
val hc=new HiveContext(sc)
hc.jsonRDD(jsonData).registerTempTable("employee")
val fuldf=hc.jsonRDD(jsonData)
val dfemp=fuldf.select(explode(col("Employees")))
dfemp.saveAsTable("empdummy")
val df=sql("select * from empdummy")
df.select ("_c0.userId","_c0.jobTitleName","_c0.firstName","_c0.lastName","_c0.preferredFullName","_c0.employeeCode","_c0.region","_c0.phoneNumber","_c0.emailAddress").saveAsTable("dummytab")
Any suggestion for optimising the above code.
SparkSQL only supports reading JSON files when the file contains one JSON object per line.
SQLContext.scala
/**
* Loads a JSON file (one object per line), returning the result as a [[DataFrame]].
* It goes through the entire dataset once to determine the schema.
*
* #group specificdata
* #deprecated As of 1.4.0, replaced by `read().json()`. This will be removed in Spark 2.0.
*/
#deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
def jsonFile(path: String): DataFrame = {
read.json(path)
}
Your file should look like this (strictly speaking, it's not a proper JSON file)
{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani#gmail.com"}
{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani#gmail.com"}
{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks#gmail.com"}
Please have a look at the outstanding JIRA issue. Don't think it is that much of priority, but just for record.
You have two options
Convert your json data to the supported format, one object per line
Have one file per JSON object - this will result in too many files.
Note that SQLContext.jsonFile is deprecated, use SQLContext.read.json.
Examples from spark documentation

How to read whole file in one string

I want to read json or xml file in pyspark.lf my file is split in multiple line in
rdd= sc.textFile(json or xml)
Input
{
" employees":
[
{
"firstName":"John",
"lastName":"Doe"
},
{
"firstName":"Anna"
]
}
Input is spread across multiple lines.
Expected Output {"employees:[{"firstName:"John",......]}
How to get the complete file in a single line using pyspark?
There are 3 ways (I invented the 3rd one, the first two are standard built-in Spark functions), solutions here are in PySpark:
textFile, wholeTextFile, and a labeled textFile (key = file, value = 1 line from file. This is kind of a mix between the two given ways to parse files).
1.) textFile
input:
rdd = sc.textFile('/home/folder_with_text_files/input_file')
output: array containing 1 line of file as each entry ie. [line1, line2, ...]
2.) wholeTextFiles
input:
rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')
output: array of tuples, first item is the "key" with the filepath, second item contains 1 file's entire contents ie.
[(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...]
3.) "Labeled" textFile
input:
import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)
for filename in glob.glob(Data_File + "/*"):
Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)
output: array with each entry containing a tuple using filename-as-key with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.
[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
...]
You can also recombine either as a list of lines:
Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()
[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]
Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):
Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()
If your data is not formed on one line as textFile expects, then use wholeTextFiles.
This will give you the whole file so that you can parse it down into whatever format you would like.
This is how you would do in scala
rdd = sc.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
rdd.collect.foreach(t=>println(t._2))
"How to read whole [HDFS] file in one string [in Spark, to use as sql]":
e.g.
// Put file to hdfs from edge-node's shell...
hdfs dfs -put <filename>
// Within spark-shell...
// 1. Load file as one string
val f = sc.wholeTextFiles("hdfs:///user/<username>/<filename>")
val hql = f.take(1)(0)._2
// 2. Use string as sql/hql
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val results = hiveContext.sql(hql)
Python way
rdd = spark.sparkContext.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
json = rdd.collect()[0][1]