Pyspark: write json from schema

Pyspark: write json from schema - json

I need to delete certain entries from nested Json files. As far as I know, I cant just delete them from the json file directly, so my next choice would be to load them into a pyspark dataframe, delete the entries there, create a new json with the same schema (& preferably the same name) and replace the old json file. I have extracted the schema into a json file, is there a way to write the dataframe back into a json file, somehow parsing the extracted schema?
Thanks!

Spark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this method either takes below string or a constant from SaveMode class.
overwrite – mode is used to overwrite the existing file, alternatively, you can use SaveMode.Overwrite.
append – To add the data to the existing file, alternatively, you can use SaveMode.Append.
ignore – Ignores write operation when the file already exists, alternatively you can use SaveMode.Ignore.
errorifexists or error – This is a default option when the file already exists, it returns an error, alternatively, you can use SaveMode.ErrorIfExists.
df2.write.mode(SaveMode.Overwrite).json("/tmp/spark_output/zipcodes.json")

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?

If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

JSON variable indent for different entries

Background: I want to store a dict object in json format that has say, 2 entries:
(1) Some object that describes the data in (2). This is small data mostly definitions, parameters that control, etc. and things (maybe called metadata) that one would like to read before using the actual data in (2). In short, I want good human readability of this portion of the file.
(2) The data itself is a large chunk- should more like machine readable (no need for human to gaze over it on opening the file).
Problem: How to specify some custom indent, say 4 to the (1) and None to the (2). If I use something like json.dump(data, trig_file, indent=4) where data = {'meta_data': small_description, 'actual_data': big_chunk}, meaning the large data will have a lot of whitespace making the file large.

Assuming you can append json to a file:
Write {"meta_data":\n to the file.
Append the json for small_description formatted appropriately to the file.
Append ,\n"actual_data":\n to the file.
Append the json for big_chunk formatted appropriately to the file.
Append \n} to the file.
The idea is to do the json formatting out the "container" object by hand, and using your json formatter as appropriate to each of the contained objects.

Consider a different file format, interleaving keys and values as distinct documents concatenated together within a single file:
{"next_item": "meta_data"}
{
"description": "human-readable content goes here",
"split over": "several lines"
}
{"next_item": "actual_data"}
["big","machine-readable","unformatted","content","here","....."]
That way you can pass any indent parameters you want to each write, and you aren't doing any serialization by hand.
See How do I use the 'json' module to read in one JSON object at a time? for how one would read a file in this format. One of its answers wisely suggests the ijson library, which accepts a multiple_values=True argument.

How to parse lua table object into json?

I was wondering if there was a way to parse a lua table into an javascript object, without using any libraries i.e require("json") haven't seen one yet, but if someone knows how please answer

If you want to know how to parse Lua tables to JSON strings take a look into the source code of any of the many JSON libraries available for Lua.
http://lua-users.org/wiki/JsonModules
For example:
https://github.com/rxi/json.lua/blob/master/json.lua
or
https://github.com/LuaDist/dkjson/blob/master/dkjson.lua

If you do not want to use any library and want to do it with pure Lua code the most convenient way for me is to use table.concat function:
local result
for key, value in ipairs(tableWithData) do
-- prepare json key-value pairs and save them in separate table
table.insert(result, string.format("\"%s\":%s", key, value))
end
-- get simple json string
result = "{" .. table.concat(result, ",") .. "}"
If your table has nested tables you can do this recursively.

The are a lot of pure-Lua JSON libraries.
Even me have one.
How to include pure-Lua module into your script without using require():
Download the Lua JSON module (for example, go to my json.lua, right-click on Raw and select Save Link as in context menu)
Delete the last line return json from this file
Insert the whole file at the beginning of your script
Now you can use local json_as_string = json.encode(your_Lua_table) in your script.

passing variable to json file while matching response in karate

I'm validating my response from a GET call through a .json file
match response == read('match_response.json')
Now I want to reuse this file for various other features as only one field in the .json varies. Let's say this param in the json file is "varyingField"
I'm trying to pass this field every time I am matching the response but not able to
def varyingField = 'VARIATION1'
match response == read('match_response.json') {'varyingField' : '#(varyingField)'}}
In the json file I have
"varyingField": "#(varyingField)"

You are trying to use an argument to read for a JSON file ? Sorry such a thing is not supported in Karate, please read the docs.
Use this pattern:
create a JSON file that has all your "happy path" values set
use the read() syntax to load the file (which means this is re-usable across multiple tests)
use the set keyword to update only the field for your scenario or negative test
For more details, refer this answer: https://stackoverflow.com/a/51896522/143475

Using Jmeter, I need to add UUID extracted from JSON in CSV in same column (multiple values of UUID) So to pass in Delete Path

Using Jmeter, I need to add UUID extracted from JSON and add that in CSV in same column (multiple) to feed in Delete Request (REST). This is to test multiple delete calls which has unique UUID generated from POST call. Or is there any other way I can test multiple delete call after extracting from POST calls. Lets say 50 Post then 50 Delete calls.

I don't think you need to do anything as given threads reside in the same Thread Group you should be able to use the same variable for the Delete request.
JMeter Variables are local to a thread so different threads will have different variable values.
If you are still looking for a file-based solution be aware of fact that you can write an arbitrary JMeter Variable into a file using Groovy code like:
Add JSR223 PostProcessor after the JSON Extractor
Make sure you have groovy selected in the "Language" dropdown
Make sure you have Cache compiled script if available box ticked
Put the following code into "Script" area
def csvFile = new File('test.csv')
csvFile << vars.get('your_variable')
csvFile << System.getProperty('line.separator')
This way you will get any extracted UUID written into test.csv file in JMeter's "bin" folder, you can use it in the CSV Data Set Config for your Delete request.
More information:
Groovy Goodness: Working with Files
Apache Groovy - Why and How You Should Use It

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Pyspark: write json from schema - json

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

JSON variable indent for different entries

How to parse lua table object into json?

passing variable to json file while matching response in karate

Using Jmeter, I need to add UUID extracted from JSON in CSV in same column (multiple values of UUID) So to pass in Delete Path

Categories

Resources