How to read whole file in one string - json

I want to read json or xml file in pyspark.lf my file is split in multiple line in
rdd= sc.textFile(json or xml)
Input
{
" employees":
[
{
"firstName":"John",
"lastName":"Doe"
},
{
"firstName":"Anna"
]
}
Input is spread across multiple lines.
Expected Output {"employees:[{"firstName:"John",......]}
How to get the complete file in a single line using pyspark?

There are 3 ways (I invented the 3rd one, the first two are standard built-in Spark functions), solutions here are in PySpark:
textFile, wholeTextFile, and a labeled textFile (key = file, value = 1 line from file. This is kind of a mix between the two given ways to parse files).
1.) textFile
input:
rdd = sc.textFile('/home/folder_with_text_files/input_file')
output: array containing 1 line of file as each entry ie. [line1, line2, ...]
2.) wholeTextFiles
input:
rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')
output: array of tuples, first item is the "key" with the filepath, second item contains 1 file's entire contents ie.
[(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...]
3.) "Labeled" textFile
input:
import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)
for filename in glob.glob(Data_File + "/*"):
Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)
output: array with each entry containing a tuple using filename-as-key with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.
[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
...]
You can also recombine either as a list of lines:
Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()
[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]
Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):
Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()

If your data is not formed on one line as textFile expects, then use wholeTextFiles.
This will give you the whole file so that you can parse it down into whatever format you would like.

This is how you would do in scala
rdd = sc.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
rdd.collect.foreach(t=>println(t._2))

"How to read whole [HDFS] file in one string [in Spark, to use as sql]":
e.g.
// Put file to hdfs from edge-node's shell...
hdfs dfs -put <filename>
// Within spark-shell...
// 1. Load file as one string
val f = sc.wholeTextFiles("hdfs:///user/<username>/<filename>")
val hql = f.take(1)(0)._2
// 2. Use string as sql/hql
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val results = hiveContext.sql(hql)

Python way
rdd = spark.sparkContext.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
json = rdd.collect()[0][1]

Related

load a json file containing list of strings

I have a json file containing a list of strings like this:
['Hello\nHow are you?', 'What is your name?\nMy name is john']
I have to read this file and store it as a list of strings but I am so confused that how should I read json file like this. Also, I should use utf-8 encoding format.
Let's assume you have one or multiple lines as described in the json file. Here is my suggestion (Remember to replace the file name test.json to yours):
import ast
with open("test.json", "r") as input_file:
line_list = input_file.readlines()
all_texts = [item for sublist in line_list for item in ast.literal_eval(sublist)]
print(all_texts)
The file you have shown is not in json format. Anyways, to read a json file you have to do following
import json
jsonObj = json.loads('path/to/file.json')
This will return a dictionary object and store it in jsonObj.

How to convert a multi-dimensional dictionary to json file?

I have uploaded a *.mat file that contains a 'struct' to my jupyter lab using:
from pymatreader import read_mat
data = read_mat(mat_file)
Now I have a multi-dimensional dictionary, for example:
data['Forces']['Ss1']['flap'].keys()
Gives the output:
dict_keys(['lf', 'rf', 'lh', 'rh'])
I want to convert this into a JSON file, exactly by the keys that already exist, without manually do so because I want to perform it to many *.mat files with various key numbers.
EDIT:
Unfortunately, I no longer have access to MATLAB.
An example for desired output would look something like this:
json_format = {
"Forces": {
"Ss1": {
"flap": {
"lf": [1,2,3,4],
"rf": [4,5,6,7],
"lh": [23 ,5,6,654,4],
"rh": [4 ,34 ,35, 56, 66]
}
}
}
}
ANOTHER EDIT:
So after making lists of the subkeys (I won't elaborate on it), I did this:
FORCES = []
for ind in individuals:
for force in forces:
for wing in wings:
FORCES.append({
ind: {
force: {
wing: data['Forces'][ind][force][wing].tolist()
}
}
})
Then, to save:
with open(f'{ROOT_PATH}/Forces.json', 'w') as f:
json.dump(FORCES, f)
That worked but only because I looked manually for all of the keys... Also, for some reason, I have squared brackets at the beginning and at the end of this json file.
The json package will output dictionaries to JSON:
import json
with open('filename.json', 'w') as f:
json.dump(data, f)
If you are using MATLAB-R2016b or later, and want to go straight from MATLAB to JSON check out JSONENCODE and JSONDECODE. For your purposes JSONENCODE
encodes data and returns a character vector in JSON format.
MathWorks Docs
Here is a quick example that assumes your data is in the MATLAB variable test_data and writes it to a file specified in the variable json_file
json_data = jsonencode(test_data);
writematrix(json_data,json_file);
Note: Some MATLAB data formats cannot be translate into JSON data due to limitations in the JSON specification. However, it sounds like your data fits well with the JSON specification.

Combining and loading Json content as python dictionary

Ihave 100 json files (file1- file 100)in my directory.All these 100 have the same fields and my aim is to load allcontents in one dictionary or dataframe.Basically the content of each file (ie file1- file100) willbe a row for my dictionary or dataframe
To test the code first,I wrote a script to load contents from one json file
file2 = open(r"\Users\sbz\file1.txt","w+")
import json
import traceback
def read_json_file(file2):
with open(file2, "r") as f:
try:
return json.load(f)
for combining i wrote this
def combine_dictionaries(dictionary_list):
my_dictionary = {}
for key in dictionary_list:
my_dictionary.update(key)
return my_dictionary
I am unable to load the file or display contents of dictionary using print(file2)
Is there something I am missing? Or is there better wayto loop in all 100 files and load them as a single dictionary?
If json.load isn't working, my guess is that your JSON file is probably formatted incorrectly. Try getting it to work with a simple file like:
{
"test": 0
}
After that works, then try loading one of your 100 files. I copy-pasted your read_json_file function and I'm able to see the data in my file: print(read_json_file("data.json"))
For looping through the files and combining them:
It doesn't look like your combine_dictionaries function is 100% there yet for what you want to do. update doesn't merge the dictionaries into rows as you want; it will replace the keys of one dictionary with the keys of another, and since each file has the same fields the resulting dictionary will be the last one in the list. Technically, a list of dictionaries is already a list of rows which is what you want and you can index the list based on row number, for example, list_of_dictionaries[0] will get the dictionary created from file1 if you fill the list in order of file1 to file100. If you want to go further than file numbers, you can put all of these dictionaries into another dictionary if you can generate a unique key for each dictionary:
def combine_dictionaries(dictionary_list):
my_dictionary = {}
for dictionary in dictionary_list:
my_dictionary[generate_key(dictionary)] = dictionary
return my_dictionary
Where generate_key is a function that will return a key unique to that dictionary. Now combined_dictionary.get(0) will get file1's dictionary, and combined_dictionary.get(0).get("somefield") will get the "somefield" data from file1.

Identify empty JSON files with Spark 2.4

I want to avoid processing empty JSON files. Some empty JSON files I am getting only contain the open and close square brackets, like: [] . Files containing only that should be understood as empty files.
With Spark 2.2 the following line would return true:
spark.read.json(pathToFile).isEmpty
But with Spark 2.4 it returns false.
How do I go about identifying this type of file as empty when using Spark 2.4?
Look at columns
val stuff = spark.read.json("hdfs:///user/me/empty.json")
scala> stuff.columns
res6: Array[String] = Array()

Reading massive JSON files into Spark Dataframe

I have a large nested NDJ (new line delimited JSON) file that I need to read into a single spark dataframe and save to parquet. In an attempt to render the schema I use this function:
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName))
}
})
}
on the dataframe that is returned by reading by
val df = sqlCtx.read.json(sparkContext.wholeTextFiles(path).values)
I've also switched this to val df = spark.read.json(path) so that this only works with NDJs and not multi-line JSON--same error.
This is causing an out of memory error on the workers
java.lang.OutOfMemoryError: Java heap space.
I've altered the jvm memory options and spark executor/driver options to no avail
Is there a way to stream the file, flatten the schema, and add to a dataframe incrementally? Some lines of the JSON contain new fields from the preceding entires...so those would need to be filled in later.
No work around. The issue was with the JVM object limit. I ended up using a scala json parser and built the dataframe manually.
You can achieve this in multiple ways.
First while reading, you can provide the schema for dataframe to read json or you can allow the spark to infer the schema by itself.
Once the json is in dataframe, you can follow the following ways to flatten it.
a. Using explode() on dataframe - to flatten it.
b. Using spark sql and access the nested fields using . operator. You can find examples here
Lastly, if you want to add new columns to dataframe
a. First option,using withColumn() is one approach. However this will be done for each new column added and for entire data set.
b. Using sql to generate new dataframe from existing - this may be easiest
c. Lastly, using map, then accessing elements, get old schema, add new values, create new schema and finally get the new df - as below
One withColumn will work on entire rdd. So generally its not a good practise to use the method for every column you want to add. There is a way where you work with columns and their data inside a map function. Since one map function is doing the job here, the code to add new column and its data will be done in parallel.
a. you can gather new values based on the calculations
b. Add these new column values to main rdd as below
val newColumns: Seq[Any] = Seq(newcol1,newcol2)
Row.fromSeq(row.toSeq.init ++ newColumns)
Here row, is the reference of row in map method
c. Create new schema as below
val newColumnsStructType = StructType{Seq(new StructField("newcolName1",IntegerType),new StructField("newColName2", IntegerType))
d. Add to the old schema
val newSchema = StructType(mainDataFrame.schema.init ++ newColumnsStructType)
e. Create new dataframe with new columns
val newDataFrame = sqlContext.createDataFrame(newRDD, newSchema)