Python: Dump JSON Data Following Custom Format - json

I'm working on some Python code for my local billiard hall and I'm running into problems with JSON encoding. When I dump my data into a file I obviously get all the data in a single line. However, I want my data to be dumped into the file following the format that I want. For example (Had to do picture to get point across),
My custom JSON format
. I've looked up questions on custom JSONEncoders but it seems they all have to do with datatypes that aren't JSON serializable. I never found a solution for my specific need which is having everything laid out in the manner that I want. Basically, I want all of the list elements to on a separate row but all of the dict items to be in the same row. Do I need to write my own custom encoder or is there some other approach I need to take? Thanks!

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

Is there a way to get columns names of dataframe in pyspark without reading the whole dataset?

I have huges datasets in my HDFS environnement, say 500+ datasets and all of them are around 100M+ rows. I want to get only the column names of each dataset without reading the whole datasets because it will take too long time to do that. My data are json formatted and I'm reading them using the classic spark json reader : spark.read.json('path'). So what's the best way to get columns names without wasting my time and memory ?
Thanks...
from the official doc :
If the schema parameter is not specified, this function goes through the input once to determine the input schema.
Therefore, you cannot get the column names with only the first line.
Still, you can do an extra step first, that will extract one line and create a dataframe from it, then extract the column names.
One answer could be the following :
Read the data using spark.read.txt('path') method
Limit the number of rows to 1 with the method limit(1) since we just want the header as column names
Convert the table to rdd and collect it as a list with the method collect()
Convert the first row collected from unicode string to python dict (since I'm working with json formatted data).
The keys of the above dict is exactly what we are looking for (columns names as list in python).
This code worked for me:
from ast import literal_eval
literal_eval(spark.read.text('path').limit(1)
.rdd.flatMap(lambda x: x)
.collect()[0]).keys()
The reason it works faster might be that pyspark won't load the whole dataset with all the field structures if you read it using txt format (because everything is read as a big string), it's lighter and more efficient for that specific case.

Cannot identify proper format for a json request body stored and used in csv file for use in a karate scenario

Am having trouble identifying the propert format to store a json request body in csv format, then use the csv file value in a scenario.
This works properly within a scenario:
And request '{"contextURN":"urn:com.myco.here:env:booking:reservation:0987654321","individuals":[{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:12345678","name":{"firstName":"NUNYA","lastName":"BIDNESS"},"dateOfBirth":"1980-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"LANDBRANCH","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"},{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:23456789","name":{"firstName":"NUNYA","lastName":"BIZNESS"},"dateOfBirth":"1985-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"BRANCHLAND","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"}]}'
However, when stored in csv file as follows (I've tried quite a number other formatting variations)
'{"contextURN":"urn:com.myco.here:env:booking:reservation:0987654321","individuals":[{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:12345678","name":{"firstName":"NUNYA","lastName":"BIDNESS"},"dateOfBirth":"1980-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"LANDBRANCH","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"},{"individualURN":"urn:com.myco.here:env:booking:reservation:0987654321:individual:23456789","name":{"firstName":"NUNYA","lastName":"BIZNESS"},"dateOfBirth":"1985-03-01","address":{"streetAddressLine1":"1 Myplace","streetAddressLine2":"","city":"BRANCHLAND","countrySubdivisionCode":"WV","postalCode":"25506","countryCode":"USA"},"objectType":"INDIVIDUAL"}]}',
and used in scenario as:
And request requestBody
my test returns an "javascript evaluation failed: " & the json above & :1:63 Missing close quote ^ in at line number 1 at column number 63
Can you please identify correct formatting or the usage errors I am missing? Thanks
We just use a basic CSV library behind the scenes. I suggest you roll your own Java helper class that does whatever processing / pre-processing you need.
Do read this answer as well: https://stackoverflow.com/a/54593057/143475
I can't make sense of your JSON but if you are trying to fit JSON into CSV, sorry - that's not a good idea. See this answer: https://stackoverflow.com/a/62449166/143475

Pentaho Data Integration - Two flows saving into same JSON output

I'm doing a tranfomation that has two differents flows. In the end of the transformation the two flows converge and save the data into the same json file output. Verifing a specific column on result file the values are strange. They look like as follow:
Column
[B#3e8fe299
[B#50b541fb
[B#44b719d4
[B#7dad3c13
[B#6e46a542
[B#170d9515
When I save in differents files it doesn't occurs, the values stay right. Does anyone know what could be causing it and how can I solve it?
Thanks.
Looks like you're printing out java Byte Array object IDs. Here is a link which shows similar values to yours ([B#...):
Java: Syntax and meaning behind "[B#1ef9157"? Binary/Address?
Can you verify what types your fields are?

Editing JSON - Add Attribute

I have a slew of JSON files I'm getting dumps of, with data from the day/period it was pulled. Most of the JSON files I'm dealing with are a lot larger than this, but I figured a smaller one would be easier to work with.
{"playlists":[{"uri":"spotify:user:11130196075:playlist:1Ov4b3NkyzIMwfY9E8ixpE","listeners":366,"streams":386,"dateAdded":"2016-02-24","newListeners":327,"title":"#Covers","owner":"Saga Prommeedet"},{"uri":"spotify:user:mickeyrose30:playlist:2Ov4b3NkyzIMwfY9E8ixpE","listeners":229,"streams":263,"dateAdded":"removed","newListeners":154,"title":"bestcovers2016","owner":"Mickey Rose"}],"top":2,"total":53820}
What I'm essentially trying to do is add a date attribute to each line of data, so that when I combine multiple JSON files to put through an analytical tool, the right row of data is associated with the correct date. My first thought was to write it as such:
{"playlists":[{"uri":"spotify:user:11130196075:playlist:1Ov4b3NkyzIMwfY9E8ixpE","listeners":366,"streams":386,"dateAdded":"2016-02-24","newListeners":327,"title":"#Covers","owner":"Saga Prommeedet"},{"uri":"spotify:user:mickeyrose30:playlist:2Ov4b3NkyzIMwfY9E8ixpE","listeners":229,"streams":263,"dateAdded":"removed","newListeners":154,"title":"bestcovers2016","owner":"Mickey Rose"}],"top":2,"total":53820,"date":072617}
since the "top" and "total" attributes are showing up on each row of data (with the associated values also showing up on each row) when I put it through an analytical tool like Tableau.
Also, have been editing and saving files through Brackets, and testing things through this converter (https://konklone.io/json/)
In javascript language
var m = JSON.parse(json_string);
m["date"]="20170804";
JSON.stringify(m);
This will work for you, very simple,