Parse a JSON file with ISODate in Python - json

I have a JSON file with some lines like:
"updatedAt" : ISODate("2018-11-20T09:32:16.732+0000"),
I tried json.loads but it has an error json.decoder.JSONDecodeError: Expecting value: line 2 column 13 (char 15).
I believe that the problem is at ISODate () but how could I handle that with Python?
Many thanks

This is not valid JSON, to begin with. I guess the ISODATE("...") is generated from MongoDB, maybe dumping the ISODate() helper directly instead of its string representation into the JSON?
In any case, you could use a regex on the whole JSON-string to get rid of the ISODate("..."), retrieve the date as a string and then use python-dateutil to parse the value to a datetime.datetime.
Something to the tune of
import json
import dateutil.parse
import re
json_str = ....
clean_json = re.compile('ISODate\(("[^"]+")\)').sub('\\1', json_str)
json_obj = json.loads(clean_json)
# use dateutil.parser.parse(s) to parse each date into a datetime.datetime

Related

json conversion within pyspark: Cannot parse the schema in JSON format: Failed to convert the JSON string (big JSON string) to a data type

I'm having trouble with json conversion within pyspark working with complex nested-struct columns. The schema for the from_json doesn't seem to behave. Example:
import pyspark.sql.functions as f
df = spark.createDataFrame([[1,'a'],[2,'b'],[3,'c']], ['rownum','rowchar'])\
.withColumn('struct', f.expr("transform(array(1,2,3), i -> named_struct('a1',rownum*i,'a2',rownum*i*2))"))
df.display()
df.withColumn('struct',f.to_json('struct')).withColumn('struct',f.from_json('struct',df.schema['struct'])).display()
df.withColumn('struct',f.to_json('struct')).withColumn('struct',f.from_json('struct',df.select('struct').schema)).display()
fails with
Cannot parse the schema in JSON format: Failed to convert the JSON string (big JSON string) to a data type
Not sure if this is a syntax error on my end, an edge case that's failing, the wrong way to do things, or something else.
You're not passing the correct schema to from_json. Try with this instead:
df.withColumn('struct', f.to_json('struct')) \
.withColumn('struct', f.from_json('struct', df.schema["struct"].dataType)) \
.display()

Convert Array[Byte] to JSON format using Spark Scala

I'm reading an .avro file where the data of a particular column is in binary format. I'm currently converting the binary format to string format with the help of UDF for a readable purpose and then finally i will need to convert it into JSON format for further parsing the data. Is there a way i can convert string object to JSON format using Spark Scala code.
Any help would be much appreciated.
val avroDF = spark.read.format("com.databricks.spark.avro").
load("file:///C:/46.avro")
import org.apache.spark.sql.functions.udf
// Convert byte object to String format
val toStringDF = udf((x: Array[Byte]) => new String(x))
val newDF = avroDF.withColumn("BODY",
toStringDF(avroDF("body"))).select("BODY")
Output of newDF is shown below:
BODY |
+---------------------------------------------------------------------------------------------------------------+
|{"VIN":"FU74HZ501740XXXXX","MSG_TYPE":"SIGNAL","TT":0,"RPM":[{"E":1566800008672,"V":1073.75},{"E":1566800002538,"V":1003.625},{"E":1566800004084,"V":1121.75}
My desired output should be like below:
I do not know if you want a generic solution but in your particular case, you can code something like this:
spark.read.json(newDF.as[String])
.withColumn("RPM", explode(col("RPM")))
.withColumn("E", col("RPM.E"))
.withColumn("V", col("RPM.V"))
.drop("RPM")
.show()

Parse String in Restify

For the following JSON file
{
"_id":"1",
"Time":"03:20:22"
}
I want to add hours such that "Time" + 2 = "05:20:22";
Using Restify, I do not know how to parse this JSON into caculatable format because if I do data.Time.split(":"); restify does not recognize this command.
How should I parse Strings like "03:20:11" so that I can add number of hours to it?
You can use following sample code: (Make sure data.Time is a string while you retrieve object from restify. Because split function works only on strings)
x= "03:20:22";
y=x.split(':')
z=int(y[0])+2
y[0]=str(z)
print y

Error parsing JSON file in python 3.4

I am trying to load a Json file from a url and parse it on Python3.4 but i get a few errors and I've no idea what they are pointing to. I did verify the json file on the url from jsonlint.com and the file seems fine. The data.read() is returning 'byte' file and i've type casted it. The code is
import urllib.request
import json
inp = input("enter url :")
if len(inp)<1: inp ='http://python-data.dr-chuck.net/comments_42.json'
data=urllib.request.urlopen(inp)
data_str = str(data.read())
print(type(data_str))
parse_data = json.loads(data_str)
print(type(parse_data))
The error that i'm getting is:
The expression str(data.read()) doesn't "cast" your bytes into a string, it just produces a string representation of them. This can be seen if you print data_str: it's a str beginning with b'.
To actually decode the JSON, you need to do data_str = data.read().decode('utf=8')

Parsing HTTP Response in Python

I want to manipulate the information at THIS url. I can successfully open it and read its contents. But what I really want to do is throw out all the stuff I don't want, and to manipulate the stuff I want to keep.
Is there a way to convert the string into a dict so I can iterate over it? Or do I just have to parse it as is (str type)?
from urllib.request import urlopen
url = 'http://www.quandl.com/api/v1/datasets/FRED/GDP.json'
response = urlopen(url)
print(response.read()) # returns string with info
When I printed response.read() I noticed that b was preprended to the string (e.g. b'{"a":1,..). The "b" stands for bytes and serves as a declaration for the type of the object you're handling. Since, I knew that a string could be converted to a dict by using json.loads('string'), I just had to convert the byte type to a string type. I did this by decoding the response to utf-8 decode('utf-8'). Once it was in a string type my problem was solved and I was easily able to iterate over the dict.
I don't know if this is the fastest or most 'pythonic' way of writing this but it works and theres always time later of optimization and improvement! Full code for my solution:
from urllib.request import urlopen
import json
# Get the dataset
url = 'http://www.quandl.com/api/v1/datasets/FRED/GDP.json'
response = urlopen(url)
# Convert bytes to string type and string type to dict
string = response.read().decode('utf-8')
json_obj = json.loads(string)
print(json_obj['source_name']) # prints the string with 'source_name' key
You can also use python's requests library instead.
import requests
url = 'http://www.quandl.com/api/v1/datasets/FRED/GDP.json'
response = requests.get(url)
dict = response.json()
Now you can manipulate the "dict" like a python dictionary.
json works with Unicode text in Python 3 (JSON format itself is defined only in terms of Unicode text) and therefore you need to decode bytes received in HTTP response. r.headers.get_content_charset('utf-8') gets your the character encoding:
#!/usr/bin/env python3
import io
import json
from urllib.request import urlopen
with urlopen('https://httpbin.org/get') as r, \
io.TextIOWrapper(r, encoding=r.headers.get_content_charset('utf-8')) as file:
result = json.load(file)
print(result['headers']['User-Agent'])
It is not necessary to use io.TextIOWrapper here:
#!/usr/bin/env python3
import json
from urllib.request import urlopen
with urlopen('https://httpbin.org/get') as r:
result = json.loads(r.read().decode(r.headers.get_content_charset('utf-8')))
print(result['headers']['User-Agent'])
TL&DR: When you typically get data from a server, it is sent in bytes. The rationale is that these bytes will need to be 'decoded' by the recipient, who should know how to use the data. You should decode the binary upon arrival to not get 'b' (bytes) but instead a string.
Use case:
import requests
def get_data_from_url(url):
response = requests.get(url_to_visit)
response_data_split_by_line = response.content.decode('utf-8').splitlines()
return response_data_split_by_line
In this example, I decode the content that I received into UTF-8. For my purposes, I then split it by line, so I can loop through each line with a for loop.
I guess things have changed in python 3.4. This worked for me:
print("resp:" + json.dumps(resp.json()))