append data to an existing json file - json

Appreciate if someone can point me to the right direction in here, bit new to python :)
I have a json file that looks like this:
[
{
"user":"user5",
"games":"game1"
},
{
"user":"user6",
"games":"game2"
},
{
"user":"user5",
"games":"game3"
},
{
"user":"user6",
"games":"game4"
}
]
And i have a small csv file that looks like this:
module_a,module_b
10,20
15,16
1,11
2,6
I am trying to append the csv data into the above mentioned json so it looks this, keeping the order as it is:
[
{
"user":"user5",
"module_a":"10",
"games":"game1",
"module_b":"20"
},
{
"user":"user6",
"module_a":"15",
"games":"game2",
"module_b":"16"
},
{
"user":"user5",
"module_a":"1",
"games":"game3",
"module_b":"11"
},
{
"user":"user6",
"module_a":"2",
"games":"game4",
"module_b":"6"
}
]
what would be the best approach to achive this keep the output order as it is.
Appreciate any guidance.

JSON specification doesn't prescribe orderness and it won't be enforced (unless it's a default mode of operation of the underlying platform) by any JSON parser so going a long way just to keep the order when processing JSON files is usually pointless. To quote:
An object is an unordered collection of zero or more name/value
pairs, where a name is a string and a value is a string, number,
boolean, null, object, or array.
...
JSON parsing libraries have been observed to differ as to whether or
not they make the ordering of object members visible to calling
software. Implementations whose behavior does not depend on member
ordering will be interoperable in the sense that they will not be
affected by these differences.
That being said, if you really insist on order, you can parse your JSON into a collections.OrderedDict (and write it back from it) which will allow you to inject data at specific places while keeping the overall order. So, first load your JSON as:
import json
from collections import OrderedDict
with open("input_file.json", "r") as f: # open the JSON file for reading
json_data = json.load(f, object_pairs_hook=OrderedDict) # read & parse it
Now that you have your JSON, you can go ahead and load up your CSV, and since there's not much else to do with the data you can immediately apply it to the json_data. One caveat, tho - since there is no direct map between the CSV and the JSON one has to assume index as being the map (i.e. the first CSV row being applied to the first JSON element etc.) so we'll use enumerate() to track the current index. There is also no info on where to insert individual values so we'll assume that the first column goes after the first JSON object entry, the second goes after the second entry and so on, and since they can have different lenghts we'll use itertools.izip_longest() to interleave them. So:
import csv
from itertools import izip_longest # use zip_longest on Python 3.x
with open("input_file.csv", "rb") as f: # open the CSV file for reading
reader = csv.reader(f) # build a CSV reader
header = next(reader) # lets store the header so we can get the key values later
for index, row in enumerate(reader): # enumerate and iterate over the rest
if index >= len(json_data): # there are more CSV rows than we have elements in JSO
break
row = [(header[i], v) for i, v in enumerate(row)] # turn the row into element tuples
# since collections.OrderedDict doesn't support random access by index we'll have to
# rebuild it by mixing in the CSV elements with the existing JSON elements
# use json_data[i].items() on Python 3.x
data = (v for p in izip_longest(json_data[index].iteritems(), row) for v in p)
# then finally overwrite the current element in json_data with a new OrderedDict
json_data[index] = OrderedDict(data)
And with our CSV data nicely inserted into the json_data, all that's left is to write back the JSON (you may overwrite the original file if you wish):
with open("output_file.json", "w") as f: # open the output JSON file for writing
json.dump(json_data, f, indent=2) # finally, write back the modified JSON
This will produce the result you're after. It even respects the names in the CSV header so you can replace them with bob and fred and it will insert those keys in your JSON. You can even add more of them if you need more elements added to your JSON.
Still, just because it's possible, you really shouldn't rely on JSON orderness. If it's user-readibility you're after, there are far more suitable formats with optional orderness like YAML.

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

JSON variable indent for different entries

Background: I want to store a dict object in json format that has say, 2 entries:
(1) Some object that describes the data in (2). This is small data mostly definitions, parameters that control, etc. and things (maybe called metadata) that one would like to read before using the actual data in (2). In short, I want good human readability of this portion of the file.
(2) The data itself is a large chunk- should more like machine readable (no need for human to gaze over it on opening the file).
Problem: How to specify some custom indent, say 4 to the (1) and None to the (2). If I use something like json.dump(data, trig_file, indent=4) where data = {'meta_data': small_description, 'actual_data': big_chunk}, meaning the large data will have a lot of whitespace making the file large.
Assuming you can append json to a file:
Write {"meta_data":\n to the file.
Append the json for small_description formatted appropriately to the file.
Append ,\n"actual_data":\n to the file.
Append the json for big_chunk formatted appropriately to the file.
Append \n} to the file.
The idea is to do the json formatting out the "container" object by hand, and using your json formatter as appropriate to each of the contained objects.
Consider a different file format, interleaving keys and values as distinct documents concatenated together within a single file:
{"next_item": "meta_data"}
{
"description": "human-readable content goes here",
"split over": "several lines"
}
{"next_item": "actual_data"}
["big","machine-readable","unformatted","content","here","....."]
That way you can pass any indent parameters you want to each write, and you aren't doing any serialization by hand.
See How do I use the 'json' module to read in one JSON object at a time? for how one would read a file in this format. One of its answers wisely suggests the ijson library, which accepts a multiple_values=True argument.

How to parse json efficiently in Snowpipe with ON_ERROR=CONTINUE

I'm setting up a snowpipe to load data from s3 bucket to snowflake schema.
S3 contains files in NDJOSN format. One file can contain multiple records and I want to process all of them. Even if one record is broken.
To do so, I need to add on_error='continue' option to pipe creation and use csv file format as stated in official snowflake docs here.
That way I receive raw strings of JSON that I need to parse to access data. And since snowpipes do not support nested selects the only way to do that is to parse it for each column individually.
Resulting in this copy statement:
copy into MY_TABLE
from (select parse_json($1):id, parse_json($1):name, parse_json($1):status
from #MY_STAGE_CONNECTED_TO_S3)
on_error = 'continue'
This code needs to parse json 3 times for each row.
I have a table with ~40 col's and for it, this query works ~5 times slower than a more straightforward solution that uses file format option to parse JSON but unfortunately does not support on_error=continue option.
copy into HQO_DEVELOPMENT.PUBLIC.DIM_TENANT
from (select $1:id, $1:name, $1:status
from #HQO_DEVELOPMENT.PUBLIC.DIM_TENANT_STAGE_NN)
file_format = (type = 'json')
What was tried
Using nested select like this: : is not supported
copy into HQO_DEVELOPMENT.PUBLIC.DIM_TENANT from
(select $1:id, $1:name, $1:status from (
select parse_json($1) from
#HQO_DEVELOPMENT.PUBLIC.DIM_TENANT_STAGE))
on_error = 'continue'
Using JSON type on stage and omit on the pipe: won't help the issue
Is there a way to have the benefit of on_error=continue and don't parse JSON for every column?
Snowflake documentation says: Both CSV and semi-structured file types are supported; however, even when loading semi-structured data (e.g. JSON), you should set CSV as the file format type (default value). You can use the corresponding file format (e.g. JSON), but any error in the transformation will stop the COPY operation, even if you set the ON_ERROR option to continue or skip the file.
https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-table.html
On the other hand, I see on_error option works for NDJSON files, at least when you set file type on stage level. For testing,
I created the following NDJSON file for testing:
{ "id":1, "name":"Gokhan", "location":"Netherlands" }
{ "id":2, "name":"Hans", location:Germany -- broken json #1 }
{ "id":3, "name":"Joe", "location":"UK" }
{ broken json #2 }
{ "id":4, "name":"Mike", "location":"US" }
I created a file type object, and created a stage using this file type (you can alter your existing stage and set a file type):
CREATE FILE FORMAT myformat TYPE = json;
CREATE STAGE mystage FILE_FORMAT = (FORMAT_NAME = myformat);
I uploaded my sample NDJSON file to this stage, created a snowpipe to load it:
CREATE PIPE mypipe AS
COPY INTO mytable
FROM (SELECT $1:id, $1:name, $1:location FROM #mystage)
ON_ERROR = CONTINUE;
When I refreshed the pipe, it successfully loaded the "valid" (3) records from the file.

How to add entries to a JSON array/list

I'm trying to set up a Discord bot that only lets people on a list in a JSON file use it, I am wondering how to add data to the JSON array/list but I'm not sure how to move forward and I have had no real luck looking for answers elsewhere.
This is an example of how the JSON file looks:
{
IDs: [
"2359835092385",
"4634637576835",
"3454574836835"
]
}
Now, what I am looking to do, is add a new ID to "IDs" and not have it completely break, and I wish to be able to have other entries in the JSON file as well so i can make something like "AdminIDs" for people that can do more stuff to the bot.
Yes. I know I can do this stuff role based in guilds/servers, but I would like to be able to use the bot in DMs as well as on guilds/server.
What I want/need is a short and simple to manipulate script that I can easily put in to a new command so I can add new people to the bot without having to open and edit the JSON file manually.
If you haven't parsed your data already via the package json then you can do the following for parsing the data:
import json
json_code = { "..": ... }
parsed_json = json.dumps(json_code)
print(parsed_json['IDs'])
Then you can simply use this data like a normal list and append data to it.
All keys must be surrounded by a string
In this cause the key is the IDs while the value is the list and the value of the list would be the items inside it
import json
data={
"IDs":[
"2359835092385",
"4634637576835",
"3454574836835"
]
}
Let's say that your JSON data is from a file, to load it so that you can manipulate it do the following
raw_json_data=open('filename.json',encoding='utf-8')
j_data=json.load(raw_json_data) #Now j_data is basically the same as data except difference in name
print(j_data)
# >> {'IDs': ['2359835092385', '4634637576835', '3454574836835']}
To add things inside the list IDs you use the append method
data['IDs'].append('adding something') #or j_data['IDs'].append("SOMEthing")
print(data)
# >> {'IDs': ['2359835092385', '4634637576835', '3454574836835', 'adding something']}
To add a new key
data['Names']=['Jack','Nick','Alice','Nancy']
print(data)
# >> {'IDs': ['2359835092385', '4634637576835', '3454574836835', 'adding something'], 'Names': ['Jack', 'Nick', 'Alice', 'Nancy']}

How do I read a Large JSON Array File in PySpark

Issue
I recently encountered a challenge in Azure Data Lake Analytics when I attempted to read in a Large UTF-8 JSON Array file and switched to HDInsight PySpark (v2.x, not 3) to process the file. The file is ~110G and has ~150m JSON Objects.
HDInsight PySpark does not appear to support Array of JSON file format for input, so I'm stuck. Also, I have "many" such files with different schemas in each containing hundred of columns each, so creating the schemas for those is not an option at this point.
Question
How do I use out-of-the-box functionality in PySpark 2 on HDInsight to enable these files to be read as JSON?
Thanks,
J
Things I tried
I used the approach at the bottom of this page:
from Databricks that supplied the below code snippet:
import json
df = sc.wholeTextFiles('/tmp/*.json').flatMap(lambda x: json.loads(x[1])).toDF()
display(df)
I tried the above, not understanding how "wholeTextFiles" works, and of course ran into OutOfMemory errors that killed my executors quickly.
I attempted loading to an RDD and other open methods, but PySpark appears to support only the JSONLines JSON file format, and I have the Array of JSON Objects due to ADLA's requirement for that file format.
I tried reading in as a text file, stripping Array characters, splitting on the JSON object boundaries and converting to JSON like the above, but that kept giving errors about being unable to convert unicode and/or str (ings).
I found a way through the above, and converted to a dataframe containing one column with Rows of strings that were the JSON Objects. However, I did not find a way to output only the JSON Strings from the data frame rows to an output file by themselves. The always came out as
{'dfColumnName':'{...json_string_as_value}'}
I also tried a map function that accepted the above rows, parsed as JSON, extracted the values (JSON I wanted), then parsed the values as JSON. This appeared to work, but when I would try to save, the RDD was type PipelineRDD and had no saveAsTextFile() method. I then tried the toJSON method, but kept getting errors about "found no valid JSON Object", which I did not understand admittedly, and of course other conversion errors.
I finally found a way forward. I learned that I could read json directly from an RDD, including a PipelineRDD. I found a way to remove the unicode byte order header, wrapping array square brackets, split the JSON Objects based on a fortunate delimiter, and have a distributed dataset for more efficient processing. The output dataframe now had columns named after the JSON elements, inferred the schema, and dynamically adapts for other file formats.
Here is the code - hope it helps!:
#...Spark considers arrays of Json objects to be an invalid format
# and unicode files are prefixed with a byteorder marker
#
thanksMoiraRDD = sc.textFile( '/a/valid/file/path', partitions ).map(
lambda x: x.encode('utf-8','ignore').strip(u",\r\n[]\ufeff")
)
df = sqlContext.read.json(thanksMoiraRDD)