I would appreciate some help on this: I am doing a request to get a json file with data but I cannot print the final result, it says "name 'collapse_columns' is not defined"
This is my code taken from: https://medium.com/geekculture/how-to-execute-a-rest-api-call-on-apache-spark-the-right-way-in-python-4367f2740e78
import requests
import json
from pyspark.sql.functions import udf, col, explode
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType
from pyspark.sql import Row
# Declare a function that will execute our REST API call
def executeRestApi(verb, url, headers, body):
#
headers = {
'content-type': "application/json"
}
res = None
# Make API request, get response object back, create dataframe from above schema.
try:
if verb == "get":
res = requests.get(url, data=body, headers=headers)
else:
res = requests.post(url, data=body, headers=headers)
except Exception as e:
return e
if res != None and res.status_code == 200:
return json.loads(res.text)
return None
# Define the response schema and the UDF
schema = StructType([
StructField("Count", IntegerType(), True),
StructField("Message", StringType(), True),
StructField("SearchCriteria", StringType(), True),
StructField("Results", ArrayType(
StructType([
StructField("Make_ID", IntegerType()),
StructField("Make_Name", StringType())
])
))
])
# ensure that the new column, which is used to execute the UDF, will eventually contain data as a structured object rather than plain JSON
udf_executeRestApi = udf(executeRestApi, schema)
# Create the Request DataFrame and Execute
from pyspark.sql import Row
headers = {
'content-type': "application/json"
}
body = json.dumps({
})
RestApiRequestRow = Row("verb", "url", "headers", "body")
request_df = spark.createDataFrame([
RestApiRequestRow("get", "https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json", headers, body)
])
#Finally we can use withColumn on the Dataframe to execute the UDF and REST API.
result_df = request_df.withColumn("result", udf_executeRestApi(col("verb"), col("url"), col("headers"), col("body")))
#print
df = result_df.select(explode(col("result.Results")).alias("results"))
df.select(collapse_columns(df.schema)).show()
By printing result.df or request_df I get this bellow but how couuld I access the json data?
+----+--------------------+--------------------+----+------+
|verb| url| headers|body|result|
+----+--------------------+--------------------+----+------+
| get|https://vpic.nhts...|[content-type -> ...| {}| [,,,]|
+----+--------------------+--------------------+----+------+
+----+--------------------+--------------------+----+
|verb| url| headers|body|
+----+--------------------+--------------------+----+
| get|https://vpic.nhts...|[content-type -> ...| {}|
+----+--------------------+--------------------+----+
Many thanks!
If you read the description of the sample solution, it says that it uses a user defined function for the collapse, that can by found here:
https://github.com/jamesshocking/collapse-spark-dataframe
Related
I want to read nested data from a json. I have created a .proto file based on the json but still I am not able to read nested data from this said json.
nested.proto --> compiling using protoc --python_out=$PWD nested.proto
syntax = "proto2";
message Employee{
required int32 EMPLOYEE_ID = 1;
message ListItems {
required string FULLADDRESS = 1;
}
repeated ListItems EMPLOYEE_ADDRESS = 2;
}
nested.json
{
"EMPLOYEE_ID": 5044,
"EMPLOYEE_ADDRESS": [
{
"FULLADDRESS": "Suite 762"
}
]
}
parse.py
#!/usr/bin/env python3
import json
from google.protobuf.json_format import Parse
import nested_pb2 as np
input_file = "nested.json"
if __name__ == "__main__":
# reading json file
f = open(input_file, 'rb')
content = json.load(f)
# initialize emp_table here
emp_table = np.Employee()
employee = Parse(json.dumps(content), emp_table, True)
print(employee.EMPLOYEE_ID) #output: 5044
emp_table = np.Employee().ListItems()
items = Parse(json.dumps(content), emp_table, True)
print(items.FULLADDRESS) #output: NO OUTPUT (WHY?)
Couple of things:
The type is ListItems but the name is EMPLOYEE_ADDRESS
Python is awkward (!) with repeated's
You're writing more code than you need
I recommend adhering to the style guide if you can.
Try:
#!/usr/bin/env python3
import json
from google.protobuf.json_format import Parse
import nested_pb2 as np
input_file = "nested.json"
if __name__ == "__main__":
# reading json file
f = open(input_file, 'rb')
content = json.load(f)
# initialize emp_table here
emp_table = np.Employee()
employee = Parse(json.dumps(content), emp_table, True)
print(employee.EMPLOYEE_ID) #output: 5044
for item in employee.EMPLOYEE_ADDRESS:
print(item)
I am trying to bring in JIRA data into Foundry using an external API. When it comes in via Magritte, the data gets stored in AVRO and there is a column called response. The response column has data that looks like this...
[{"id":"customfield_5","name":"test","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["cf[5]","test"],"schema":{"type":"user","custom":"com.atlassian.jira.plugin.system.customfieldtypes:userpicker","customId":5}},{"id":"customfield_2","name":"test2","custom":true,"orderable":true,"navigable":true,"searchable":true,"clauseNames":["test2","cf[2]"],"schema":{"type":"option","custom":"com.atlassian.jira.plugin.system.customfieldtypes:select","customId":2}}]
Due to the fact that this imports as AVRO, the documentation that talks about how to convert this data that's in Foundry doesn't work. How can I convert this data into individual columns and rows?
Here is the code that I've attempted to use:
from transforms.api import transform_df, Input, Output
from pyspark import SparkContext as sc
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf
import json
import pyspark.sql.types as T
#transform_df(
Output("json output"),
json_raw=Input("json input"),
)
def my_compute_function(json_raw, ctx):
sqlContext = SQLContext(sc)
source = json_raw.select('response').collect() # noqa
# Read the list into data frame
df = sqlContext.read.json(sc.parallelize(source))
json_schema = T.StructType([
T.StructField("id", T.StringType(), False),
T.StructField("name", T.StringType(), False),
T.StructField("custom", T.StringType(), False),
T.StructField("orderable", T.StringType(), False),
T.StructField("navigable", T.StringType(), False),
T.StructField("searchable", T.StringType(), False),
T.StructField("clauseNames", T.StringType(), False),
T.StructField("schema", T.StringType(), False)
])
udf_parse_json = udf(lambda str: parse_json(str), json_schema)
df_new = df.select(udf_parse_json(df.response).alias("response"))
return df_new
# Function to convert JSON array string to a list
def parse_json(array_str):
json_obj = json.loads(array_str)
for item in json_obj:
yield (item["a"], item["b"])
Parsing Json in a string column to a struct column (and then into separate columns) can be easily done using the F.from_json function.
In your case, you need to do:
df = df.withColumn("response_parsed", F.from_json("response", json_schema))
Then you can do this or similar to get the contents into different columns:
df = df.select("response_parsed.*")
However, this won't work as your schema is incorrect, you actually have a list of json structs in each row, not just 1, so you need a T.ArrayType(your_schema) wrapping around the whole thing, you'll also need to do an F.explode before selecting, to get each array element in its own row.
An additional useful function is F.get_json_object, which allows you to get json one json object from a json string.
Using a UDF like you've done could work, but UDFs are generally much less performant than native spark functions.
Additionally, all the AVRO file format does in this case is to merge multiple json files into one big file, with each file in its own row, so the example under "Rest API Plugin" - "Processing JSON in Foundry" should work as long as you skip the 'put this schema on the raw dataset' step.
I used the magritte-rest connector to walk through the paged results from search:
type: rest-source-adapter2
restCalls:
- type: magritte-paging-inc-param-call
method: GET
path: search
paramToIncrease: startAt
increaseBy: 50
initValue: 0
parameters:
startAt: '{%startAt%}'
extractor:
- type: json
assign:
issues: /issues
allowNull: true
condition:
type: magritte-rest-non-empty-condition
var: issues
maxIterationsAllowed: 4096
cacheToDisk: false
oneFilePerResponse: false
That yielded a dataset that looked like this:
Once I had that, this parsed expanded and parsed the returned JSON issues into a properly-typed DataFrame with fields holding the inner structure of the issue as a (very complex) struct:
import json
from pyspark.sql import Row
from pyspark.sql.functions import explode
def issues_enumerated(All_Issues_Paged):
def generate_issue_row(input_row: Row) -> Row:
"""
Generates a dataframe of each responses issue array as a single array record per-Row
"""
d = input_row.asDict()
resp_json = d['response']
resp_obj = json.loads(resp_json)
issues = list(map(json.dumps,resp_obj['issues']))
return Row(issues=issues)
# array-per-record
unexploded_df = All_Issues_Paged.rdd.map(generate_issue_row).toDF()
# row-per-record
row_per_record_df = unexploded_df.select(explode(unexploded_df.issues))
# raw JSON string per-record RDD
issue_json_strings_rdd = row_per_record_df.rdd.map(lambda _: _.col)
# JSON object dataframe
issues_df = spark.read.json(issue_json_strings_rdd)
issues_df.printSchema()
return issues_df
I am fairly new to python and would like some assistance with a problem.
I have a SQL select query that returns a column with values. I would like to pass the records in a header request to a REST API call. The issue is that the API call expects the data in a specific JSON format.
How can I convert the data returned by the query to the specific JSON format shown below?
Query from SQL returns:
InfoId
------
1
2
3
4
5
6
I need to pass these values to a REST API as JSON in the following format:
{
"InfoId":[
1,2,3,4,5,6
]
}
I have tried couple of options to solve this problem.
I have tried converting the data into json using the pandas datatable.to_json method with the various orient parameters but none of them return the desired format as shown above.
import requests
import json
import pyodbc
import pandas as pd
conn = pyodbc.connect('Driver={SQL SERVER};'
'Server=myServer;'
'Database=TestDb;'
'Trusted_Connection=yes;'
)
cursor = conn.cursor()
sql_query = pd.read_sql_query('SELECT InfoId FROM tbl_info', conn)
#print(sql_query)
print(sql_query.to_json(orient='values', index=False))
url = "http://ldapiserver:5000/patterns/v1/swirl?idType=Info"
#sample payload
#payload = "{\r\n \"InfoId\": [\r\n 1,2,3,4,5,6\r\n ]\r\n}"
payload = sql_query.to_json(orient='records')
headers = {
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=json.dumps(payload, indent=4))
resp_body = response.json()
print(resp_body)
print(response.elapsed.total_seconds())
The 2nd method I have tried is to convert the rows from SQL query into an list object and then form the json string. It works that way but I would like to automate is so that irrespective of the query it can for the json string.
import requests
import json
import pyodbc
conn = pyodbc.connect('Driver={SQL SERVER};'
'Server=myServer;'
'Database=TestDb;'
'Trusted_Connection=yes;'
)
cursor = conn.cursor()
cursor.execute("""
SELECT InfoId FROM tbl_info
""")
rows = cursor.fetchall()
# Convert query to row arrays
rowarray_list = []
for row in rows:
t = (row.InfoId)
rowarray_list.append(t)
j = json.dumps(rowarray_list)
conn.close()
txt = '{"InfoId": ', j, '}'
# print(txt)
payload = txt[0]+txt[1]+txt[2]
url = "http://ldapiserver:5000/patterns/v1/swirl?idType=Info"
# payload = "{\r\n \"InfoId\": [\r\n 72,74\r\n ]\r\n}"
#print(json .dumps(payload, indent=4))
headers = {
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
resp_body = response.json()
print(resp_body)
print(response.elapsed.total_seconds())
Appreciate any help with this.
Thank you.
To convert your SQL query to JSON,
.
.
.
rows = cursor.fetchall()
# convert to list
json_rows = [dict(zip([key[0] for key in cursor.description], row)) for row in rows]
Then you can return your response as you like
I am trying to live stream data into Power Bi from python. However I am encountering the error
TypeError: Object of type 'bytes' is not JSON serializable
I have put my code below, please indicate what I am doing wrong as I don't quite understand what the issue is.
import pandas as pd
from datetime import datetime
from datetime import timedelta
import requests
import json
import time
import random
# function for data_generation
def data_generation():
surr_id = random.randint(1, 3)
speed = random.randint(20, 200)
date = datetime.today().strftime("%Y-%m-%d")
time = datetime.now().isoformat()
return [surr_id, speed, date, time]
if __name__ == '__main__':
REST_API_URL = 'api_url'
while True:
data_raw = []
for j in range(1):
row = data_generation()
data_raw.append(row)
print("Raw data - ", data_raw)
# set the header record
HEADER = ["surr_id", "speed", "date", "time"]
data_df = pd.DataFrame(data_raw, columns=HEADER)
data_json = bytes(data_df.to_json(orient='records'), encoding='utf-8')
print("JSON dataset", data_json)
# Post the data on the Power BI API
try:
req = requests.post(REST_API_URL, data=json.dumps(
data_json), headers=HEADER, timeout=5)
print("Data posted in Power BI API")
except requests.exceptions.ConnectionError as e:
req = "No response"
print(req)
time.sleep(3)
Solved, Just changed req = requests.post(REST_API_URL, data=json.dumps(data_json), headers=HEADER, timeout=5) to req = requests.post(url=REST_API_URL, data=data_json)
I am new to Python and Django. I am an IT professional that deploys software that monitors computers. The api outputs to JSON. I want to create a Django app that reads the api and outputs the data to an html page. Where do I get started? I think the idea is to write the JSON feed to a Django model. Any help/advice is greatly appreciated.
Here's a simple single file to extract the JSON data:
import urllib2
import json
def printResults(data):
theJSON = json.loads(data)
for i in theJSON[""]
def main():
urlData = ""
webUrl = urllib2.urlopen(urlData)
if (webUrl.getcode() == 200):
data = webUrl.read()
printResults(data)
else:
print "Received error"
if __name__ == '__main__':
main()
If you have an URL returning a json as response, you could try this:
import requests
import json
url = 'http://....' # Your api url
response = requests.get(url)
json_response = response.json()
Now json_response is a list containing dicts. Let's suppose you have this structure:
[
{
'code': ABC,
'avg': 14.5,
'max': 30
},
{
'code': XYZ,
'avg': 11.6,
'max': 21
},
...
]
You can iterate over the list and take every dict into a model.
from yourmodels import CurrentModel
...
for obj in json_response:
cm = CurrentModel()
cm.avg = obj['avg']
cm.max = obj['max']
cm.code = obj['code']
cm.save()
Or you could use a bulk method, but keep in mind that bulk_create does not trigger save method.