I am passing a JSON object through a POST API into a flask app. The goal is to convert it to a single row pandas DF and pass it on for further processing.
the JSON payload is as follows:
{
"ABC": "123",
"DATE": "2020-01-01",
"AMOUNT": "100",
"IDENTIFIER": "12345"
}
The output of data=flask.request.get_json() and print(data) is
{'ABC': '123', 'DATE': '2020-01-01', 'AMOUNT': '100','IDENTIFIER': '12345'}
But when I do a pd.read_json(data) on it I get an error
ValueError: Invalid file path or buffer object type: <class 'dict'>
Any ideas on how to handle this? I need the output to be
ABC DATE AMOUNT IDENTIFIER
123 2020-01-01 100 12345
Thanks!
Try this:
import pandas as pd
df = pd.DataFrame([data.values()], columns=data.keys())
print(df)
Output:
ABC DATE AMOUNT IDENTIFIER
0 123 2020-01-01 100 12345
Related
I have a df_movies and col of geners that look like json format.
|genres |
[{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 37, 'name': 'Western'}]
How can I extract the first field of 'name': val?
way #1
df_movies.withColumn
("genres_extract",regexp_extract(col("genres"),
""" 'name': (\w+)""",1)).show(false)
way #2
df_movies.withColumn
("genres_extract",regexp_extract(col("genres"),
"""[{'id':\s\d,\s 'name':\s(\w+)""",1))
Excepted: Action
You can use get_json_object function:
Seq("""[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 37, "name": "Western"}]""")
.toDF("genres")
.withColumn("genres_extract", get_json_object(col("genres"), "$[0].name" ))
.show()
+--------------------+--------------+
| genres|genres_extract|
+--------------------+--------------+
|[{"id": 28, "name...| Action|
+--------------------+--------------+
Another possibility is using the from_json function together with a self defined schema. This allows you to "unwrap" the json structure into a dataframe with all of the data in there, so that you can use it however you want!
Something like the following:
import org.apache.spark.sql.types._
Seq("""[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 37, "name": "Western"}]""")
.toDF("genres")
// Creating the necessary schema for the from_json function
val moviesSchema = ArrayType(
new StructType()
.add("id", StringType)
.add("name", StringType)
)
// Parsing the json string into our schema, exploding the column to make one row
// per json object in the array and then selecting the wanted columns,
// unwrapping the parsedActions column into separate columns
val parsedDf = df
.withColumn("parsedMovies", explode(from_json(col("genres"), moviesSchema)))
.select("parsedMovies.*")
parsedDf.show(false)
+---+---------+
| id| name|
+---+---------+
| 28| Action|
| 12|Adventure|
| 37| Western|
+---+---------+
Here is my JSON example. When I convert JSON to CSV file, it creates different columns for each object of reviews array. columns names be like - serial name.0 rating.0 _id.0 name.1 rating.1 _id.1. How can i convert to CSV file where only serial,name,rating,_id will be the column name and every object of the reviews will be put in a different row?
`
[{
"serial": "63708940a8d291c502be815f",
"reviews": [
{
"name": "shadman",
"rating": 4,
"_id":"6373d4eb50cff661989f3d83"
},
{
"name": "niloy1",
"rating": 3,
"_id": "6373d59450cff661989f3db8"
},
],
}]
`
`
I am trying to use the CSV file to pandas. If not possible, is there any way to solve the problem using pandas package in python?
I suggest you use pandas for the CSV export only and process the json data by flattening the data structure first so that the result can then be easily loaded in a Pandas DataFrame.
Try:
data_python = [{
"serial": "63708940a8d291c502be815f",
"reviews": [
{
"name": "shadman",
"rating": 4,
"_id":"6373d4eb50cff661989f3d83"
},
{
"name": "niloy1",
"rating": 3,
"_id": "6373d59450cff661989f3db8"
},
],
}]
from collections import defaultdict
from pprint import pprint
import pandas as pd
dct_flat = defaultdict(list)
for dct in data_python:
for dct_reviews in dct["reviews"]:
dct_flat['serial'].append(dct['serial'])
for key, value in dct_reviews.items():
dct_flat[key].append(value)
#pprint(data_python)
#pprint(dct_flat)
df = pd.DataFrame(dct_flat)
print(df)
df.to_csv("data.csv")
which gives:
serial name rating _id
0 63708940a8d291c502be815f shadman 4 6373d4eb50cff661989f3d83
1 63708940a8d291c502be815f niloy1 3 6373d59450cff661989f3db8
and
,serial,name,rating,_id
0,63708940a8d291c502be815f,shadman,4,6373d4eb50cff661989f3d83
1,63708940a8d291c502be815f,niloy1,3,6373d59450cff661989f3db8
as CSV file content.
Notice that the json you provided in your question can't be loaded from file or string in Python neither using Python json module nor using Pandas because it is not valid json code. See below for corrected valid json data:
valid_json_data='''\
[{
"serial": "63708940a8d291c502be815f",
"reviews": [
{
"name": "shadman",
"rating": 4,
"_id":"6373d4eb50cff661989f3d83"
},
{
"name": "niloy1",
"rating": 3,
"_id": "6373d59450cff661989f3db8"
}
]
}]
'''
and code for loading this data from json file:
import json
json_file = "data.json"
with open(json_file) as f:
data_json = f.read()
data_python = json.loads(data_json)
Working with the following marketing JSON file
{
"request_id": "xx",
"timeseries_stats": [
{
"timeseries_stat": {
"id": "xx",
"timeseries": [
{
"start_time": "xx",
"end_time": "xx",
"stats": {
"impressions": xx,
"swipes": xx,
"view_completion": xx,
"spend": xx
}
},
{
"start_time": "xx",
"end_time": "xx",
"stats": {
"impressions": xx,
"swipes": xx,
"view_completion": xx,
"spend": xx
}
}
I can parse this using pandas very easily and obtain the desired dataframe in the format
start_time end_time impressions swipes view_completion spend
xx xx xx xx xx xx
xx xx xx xx xx xx
but need to do it in spark on AWS Glue.
After creating an initial spark dataframe (df) using
rdd = sc.parallelize(JSON_resp['timeseries_stats'][0]['timeseries_stat']['timeseries'])
df = rdd.toDF()
I tried expanding the stats key as follows
df_expanded = df.select("start_time","end_time","stats.*")
Error:
AnalysisException: 'Can only star expand struct data types.
Attribute: `ArrayBuffer(stats)`;'
&
from pyspark.sql.functions import explode
df_expanded = df.select("start_time","end_time").withColumn("stats", explode(df.stats))
Error:
AnalysisException: 'The number of aliases supplied in the AS clause does not match the
number of columns output by the UDTF expected 2 aliases but got stats ;
Pretty new to spark, any help would be much appreciated for either of the 2 approaches!
It's a pretty similar problem as in:
parse array of dictionaries from JSON with Spark
except I need to flatten this additional stats key.
When you explode a map column, it will give you two columns and so .withColumn is not working. Use explode with select statement.
from pyspark.sql import functions as f
df.select('start_time', 'end_time', f.explode('stats')) \
.groupBy('start_time', 'end_time').pivot('key').agg(f.first('value')).show()
+----------+--------+-----------+-----+------+---------------+
|start_time|end_time|impressions|spend|swipes|view_completion|
+----------+--------+-----------+-----+------+---------------+
| yy| yy| yy| yy| yy| yy|
| xx| xx| xx| xx| xx| xx|
+----------+--------+-----------+-----+------+---------------+
I have a Python 3.8.5 script that gets a JSON from an API, saves to disk, reads JSON to DF. It works.
df = pd.io.json.read_json('json_file', orient='records')
I want to try IO buffer instead so I don't have to read/write to disk, but I am getting an error. The code is like this:
from io import StringIO
io = StringIO()
json_out = []
# some code to append API results to json_out
json.dump(json_out, io)
df = pd.io.json.read_json(io.getvalue())
On that last line I get the error
File "C:\Users\chap\Anaconda3\lib\site-packages\pandas\util\_decorators.py", line 199, in wrapper
return func(*args, **kwargs)
File "C:\Users\chap\Anaconda3\lib\site-packages\pandas\util\_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "C:\Users\chap\Anaconda3\lib\site-packages\pandas\io\json\_json.py", line 618, in read_json
result = json_reader.read()
File "C:\Users\chap\Anaconda3\lib\site-packages\pandas\io\json\_json.py", line 755, in read
obj = self._get_object_parser(self.data)
File "C:\Users\chap\Anaconda3\lib\site-packages\pandas\io\json\_json.py", line 777, in _get_object_parser
obj = FrameParser(json, **kwargs).parse()
File "C:\Users\chap\Anaconda3\lib\site-packages\pandas\io\json\_json.py", line 886, in parse
self._parse_no_numpy()
File "C:\Users\chap\Anaconda3\lib\site-packages\pandas\io\json\_json.py", line 1119, in _parse_no_numpy
loads(json, precise_float=self.precise_float), dtype=None
ValueError: Trailing data
The JSON is in a list format. So this is not the actual json but it looks like this when I write to disk:
json = [
{"state": "North Dakota",
"address": "123 30th st E #206",
"account": "123"
},
{"state": "North Dakota",
"address": "456 30th st E #206",
"account": "456"
}
]
Given that it worked in the first case (write/read from disk), I don't know how to troubleshoot. How do I troubleshoot something in the buffer? The actual data is mostly text but has some number fields.
Don't know what's going wrong for you, this works for me:
import json
import pandas as pd
from io import StringIO
json_out = [
{"state": "North Dakota",
"address": "123 30th st E #206",
"account": "123"
},
{"state": "North Dakota",
"address": "456 30th st E #206",
"account": "456"
}
]
io = StringIO()
json.dump(json_out, io)
df = pd.io.json.read_json(io.getvalue())
print(df)
leads me to believe there's something wrong with the code that appends the API data...
However, if you have a list of dictionaries, you don't need the IO step. You can just do:
pd.DataFrame(json_out)
EDIT: I think I remember this error when there was a comma at the end of my json like so:
[
{
"hello":"world",
},
]
I have the following type of json document which I need to insert in a mongodb collection with pymongo :
json={
"resource": "/items/6791111",
"user_id": 123456789,
"topic": "items",
"application_id":001,
"attempts": 1,
"sent": "2020-07-22T15:53:06.000-04:00",
"received":"2020-07-22T15:53:06.000-04:00"
}
the fields sent and received are strings so if I run :
collection.insert_one(json)
this will be saved as string in the database, how can I store directly as a date?
I tried something like this:
from dateutil.parser import parse
json['sent']=parse(json['sent'])
collection.insert_one(json)
but doesn't seems to me pretty good solution because I have documents which in some cases have several date fields or sometimes some date field is null (example in a order maybe the delivered field is null until the order is delivered)
something like this:
json2={
"resource": "/items/6791111",
"user_id": 123456789,
"topic": "items",
"application_id":001,
"attempts": 1,
"sent": "2020-07-22T15:53:06.000-04:00",
"received":Null
}
now I'm parsing the dates by hand using a function, but its really not useful at all
And I need to have the datefield parsed as dates so I can filter by time.
You can use attempt isoparse on each field which will convert any valid dates to datetime format and will therefore be stored in MongoDB as a BSON date type. Nulls will be unaffected.
from dateutil.parser import isoparse
k, v in json.items():
try:
json[k] = isoparse(v)
except Exception:
pass
Full worked example:
from pymongo import MongoClient
from dateutil.parser import isoparse
import pprint
collection = MongoClient()['mydatabase'].collection
json={
"resource": "/items/6791111",
"user_id": 123456789,
"topic": "items",
"application_id":1,
"attempts": 1,
"sent": "2020-07-22T15:53:06.000-04:00",
"received":"2020-07-22T15:53:06.000-04:00",
}
for k, v in json.items():
try:
json[k] = isoparse(v)
except Exception:
pass
collection.insert_one(json)
pprint.pprint(collection.find_one(), indent=4)
gives:
{ '_id': ObjectId('5fde015e794ced49eeaa7a65'),
'application_id': 1,
'attempts': 1,
'nulldate': None,
'received': datetime.datetime(2020, 7, 22, 19, 53, 6),
'resource': '/items/6791111',
'sent': datetime.datetime(2020, 7, 22, 19, 53, 6),
'topic': 'items',
'user_id': 123456789}