Combining and loading Json content as python dictionary - json

Ihave 100 json files (file1- file 100)in my directory.All these 100 have the same fields and my aim is to load allcontents in one dictionary or dataframe.Basically the content of each file (ie file1- file100) willbe a row for my dictionary or dataframe
To test the code first,I wrote a script to load contents from one json file
file2 = open(r"\Users\sbz\file1.txt","w+")
import json
import traceback
def read_json_file(file2):
with open(file2, "r") as f:
try:
return json.load(f)
for combining i wrote this
def combine_dictionaries(dictionary_list):
my_dictionary = {}
for key in dictionary_list:
my_dictionary.update(key)
return my_dictionary
I am unable to load the file or display contents of dictionary using print(file2)
Is there something I am missing? Or is there better wayto loop in all 100 files and load them as a single dictionary?

If json.load isn't working, my guess is that your JSON file is probably formatted incorrectly. Try getting it to work with a simple file like:
{
"test": 0
}
After that works, then try loading one of your 100 files. I copy-pasted your read_json_file function and I'm able to see the data in my file: print(read_json_file("data.json"))
For looping through the files and combining them:
It doesn't look like your combine_dictionaries function is 100% there yet for what you want to do. update doesn't merge the dictionaries into rows as you want; it will replace the keys of one dictionary with the keys of another, and since each file has the same fields the resulting dictionary will be the last one in the list. Technically, a list of dictionaries is already a list of rows which is what you want and you can index the list based on row number, for example, list_of_dictionaries[0] will get the dictionary created from file1 if you fill the list in order of file1 to file100. If you want to go further than file numbers, you can put all of these dictionaries into another dictionary if you can generate a unique key for each dictionary:
def combine_dictionaries(dictionary_list):
my_dictionary = {}
for dictionary in dictionary_list:
my_dictionary[generate_key(dictionary)] = dictionary
return my_dictionary
Where generate_key is a function that will return a key unique to that dictionary. Now combined_dictionary.get(0) will get file1's dictionary, and combined_dictionary.get(0).get("somefield") will get the "somefield" data from file1.

Related

convert nested json string column into map type column in spark

overall aim
I have data landing into blob storage from an azure service in form of json files where each line in a file is a nested json object. I want to process this with spark and finally store as a delta table with nested struct/map type columns which can later be queried downstream using the dot notation columnName.key
data nesting visualized
{
key1: value1
nestedType1: {
key1: value1
keyN: valueN
}
nestedType2: {
key1: value1
nestedKey: {
key1: value1
keyN: valueN
}
}
keyN: valueN
}
current approach and problem
I am not using the default spark json reader as it is resulting in some incorrect parsing of the files instead I am loading the files as text files and then parsing using udfs by using python's json module ( eg below ) post which I use explode and pivot to get the first level of keys into columns
#udf('MAP<STRING,STRING>' )
def get_key_val(x):
try:
return json.loads(x)
except:
return None
Post this initial transformation I now need to convert the nestedType columns to valid map types as well. Now since the initial function is returning map<string,string> the values in nestedType columns are not valid jsons so I cannot use json.loads, instead I have regex based string operations
#udf('MAP<STRING,STRING>' )
def convert_map(string):
try:
regex = re.compile(r"""\w+=.*?(?:(?=,(?!"))|(?=}))""")
obj = dict([(a.split('=')[0].strip(),(a.split('=')[1])) for a in regex.findall(s)])
return obj
except Exception as e:
return e
this is fine for second level of nesting but if I want to go further that would require another udf and subsequent complications.
question
How can I use a spark udf or native spark functions to parse the nested json data such that it is queryable in columnName.key format.
also there is no restriction of spark version, hopefully I was able to explain this properly. do let me know if you want me to put some sample data and the code for ease. Any help is appreciated.

Get information out of large JSON file

I am new to JSON file and i'm strugeling to get any information out of it.
The structure of the JSON file is as following:
Json file Structure
Now what I need is to access the "batches", to get the data from each variable.
I did try codes (shown below) i've found to reach deeper keys but somehow i still didnt get any results.
1.
def safeget(dct, *keys):
for key in keys:
try:
dct = dct[key]
except KeyError:
return None
return dct
safeget(mydata,"batches")
def dict_depth(mydata):
if isinstance(mydata, dict):
return 1 + (max(map(dict_depth, mydata.values()))
if mydata else 0)
return 0
print(dict_depth(mydata))
The final goal then would be to create a loop to extract all the information but thats something for the future.
Any help is highly appreciated, also any recommendations how i should ask things here in the future to get the best answers!
As far as I understood, you simply want to extract all the data without any ordering?
Then this should work out:
# Python program to read
# json file
import json
# Opening JSON file
f = open('data.json',)
# returns JSON object as
# a dictionary
data = json.load(f)
# Iterating through the json
# list
for i in data['emp_details']:
print(i)
# Closing file
f.close()

load a json file containing list of strings

I have a json file containing a list of strings like this:
['Hello\nHow are you?', 'What is your name?\nMy name is john']
I have to read this file and store it as a list of strings but I am so confused that how should I read json file like this. Also, I should use utf-8 encoding format.
Let's assume you have one or multiple lines as described in the json file. Here is my suggestion (Remember to replace the file name test.json to yours):
import ast
with open("test.json", "r") as input_file:
line_list = input_file.readlines()
all_texts = [item for sublist in line_list for item in ast.literal_eval(sublist)]
print(all_texts)
The file you have shown is not in json format. Anyways, to read a json file you have to do following
import json
jsonObj = json.loads('path/to/file.json')
This will return a dictionary object and store it in jsonObj.

How to convert a multi-dimensional dictionary to json file?

I have uploaded a *.mat file that contains a 'struct' to my jupyter lab using:
from pymatreader import read_mat
data = read_mat(mat_file)
Now I have a multi-dimensional dictionary, for example:
data['Forces']['Ss1']['flap'].keys()
Gives the output:
dict_keys(['lf', 'rf', 'lh', 'rh'])
I want to convert this into a JSON file, exactly by the keys that already exist, without manually do so because I want to perform it to many *.mat files with various key numbers.
EDIT:
Unfortunately, I no longer have access to MATLAB.
An example for desired output would look something like this:
json_format = {
"Forces": {
"Ss1": {
"flap": {
"lf": [1,2,3,4],
"rf": [4,5,6,7],
"lh": [23 ,5,6,654,4],
"rh": [4 ,34 ,35, 56, 66]
}
}
}
}
ANOTHER EDIT:
So after making lists of the subkeys (I won't elaborate on it), I did this:
FORCES = []
for ind in individuals:
for force in forces:
for wing in wings:
FORCES.append({
ind: {
force: {
wing: data['Forces'][ind][force][wing].tolist()
}
}
})
Then, to save:
with open(f'{ROOT_PATH}/Forces.json', 'w') as f:
json.dump(FORCES, f)
That worked but only because I looked manually for all of the keys... Also, for some reason, I have squared brackets at the beginning and at the end of this json file.
The json package will output dictionaries to JSON:
import json
with open('filename.json', 'w') as f:
json.dump(data, f)
If you are using MATLAB-R2016b or later, and want to go straight from MATLAB to JSON check out JSONENCODE and JSONDECODE. For your purposes JSONENCODE
encodes data and returns a character vector in JSON format.
MathWorks Docs
Here is a quick example that assumes your data is in the MATLAB variable test_data and writes it to a file specified in the variable json_file
json_data = jsonencode(test_data);
writematrix(json_data,json_file);
Note: Some MATLAB data formats cannot be translate into JSON data due to limitations in the JSON specification. However, it sounds like your data fits well with the JSON specification.

How to read whole file in one string

I want to read json or xml file in pyspark.lf my file is split in multiple line in
rdd= sc.textFile(json or xml)
Input
{
" employees":
[
{
"firstName":"John",
"lastName":"Doe"
},
{
"firstName":"Anna"
]
}
Input is spread across multiple lines.
Expected Output {"employees:[{"firstName:"John",......]}
How to get the complete file in a single line using pyspark?
There are 3 ways (I invented the 3rd one, the first two are standard built-in Spark functions), solutions here are in PySpark:
textFile, wholeTextFile, and a labeled textFile (key = file, value = 1 line from file. This is kind of a mix between the two given ways to parse files).
1.) textFile
input:
rdd = sc.textFile('/home/folder_with_text_files/input_file')
output: array containing 1 line of file as each entry ie. [line1, line2, ...]
2.) wholeTextFiles
input:
rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')
output: array of tuples, first item is the "key" with the filepath, second item contains 1 file's entire contents ie.
[(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...]
3.) "Labeled" textFile
input:
import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)
for filename in glob.glob(Data_File + "/*"):
Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)
output: array with each entry containing a tuple using filename-as-key with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.
[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
...]
You can also recombine either as a list of lines:
Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()
[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]
Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):
Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()
If your data is not formed on one line as textFile expects, then use wholeTextFiles.
This will give you the whole file so that you can parse it down into whatever format you would like.
This is how you would do in scala
rdd = sc.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
rdd.collect.foreach(t=>println(t._2))
"How to read whole [HDFS] file in one string [in Spark, to use as sql]":
e.g.
// Put file to hdfs from edge-node's shell...
hdfs dfs -put <filename>
// Within spark-shell...
// 1. Load file as one string
val f = sc.wholeTextFiles("hdfs:///user/<username>/<filename>")
val hql = f.take(1)(0)._2
// 2. Use string as sql/hql
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val results = hiveContext.sql(hql)
Python way
rdd = spark.sparkContext.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
json = rdd.collect()[0][1]