I am running a SQL query via the google.cloud.bigquery.Client.query package in AWS lambda (Python 2.7 runtime). The native BQ object extracted from a query is the BigQuery Row() i.e.,
Row((u'exampleEmail#gmail.com', u'XXX1234XXX'), {u'email': 0, u'email_id': 1})
I need to convert this to Json, i.e.,
[{'email_id': 'XXX1234XXX', 'email': 'exampleEmail#gmail.com'}]
When running locally, I am able to just call the python Dict function on the row to transform it, i.e.,
queryJob = bigquery.Client.query(sql)
list=[]
for row in queryJob.result():
** at this point row = the BQ sample Row object shown above **
tmp = dict(row)
list.append(tmp)`
but when I load this into AWS Lambda it throws the error:
ValueError: dictionary update sequence element #0 has length 22; 2 is required
I have tried forcing it in different ways, breaking it out into sections etc but cannot get this into the JSON format desired.
I took a brief dive into the rabbit hole of transforming the QueryJob into a Pandas dataframe and then from there into a JSON object, which also works locally but runs into numpy package errors in AWS Lambda which seems to be a bit of a known issue.
I feel like this should have an easy solution but just haven't found it yet.
Try doing it like this
`
L = []
sql = (#sql_statement)
query_job = client.query(sql) # API request
query_job.result()
for row in query_job:
email_id= row.get('email_id')
email= row.get('email')
L.append([email_id, email])
`
Related
I am currently struggling to consume an API for a script I'm writing, not that familiar with Python or JSON. Basically I am sending a couple of get requests and want to access a value that is held in an un-named array, but I am getting the following error returned:
Traceback (most recent call last):
File "request_radius.py", line 15, in <module>
node = jdata[0]['name']
IndexError: list index out of range
The code that I am writing is below (values have been changed). I am trying to pull out the value that is MANCHESTER, from name, which I thought would be entry 0 in the array, but I guess I'm missing something or I need to approach this from another angle.
import requests
import json
LIST = ['1','2']
data = {}
for i in LIST:
api = 'http://foo.bar.com/values/%s/nodes' % (i)
r = requests.get(api)
r.raise_for_status()
jdata = r.json()
# value returned [{'name': 'MANCHESTER'}]
node = jdata[0]['name']
Thanks for looking :D
UPDATE:
API call was returning blank values, tried again when it was working as expected and was working fine.
You are getting an IndexError because jdata is empty. My guess is this is caused because your response is an empty json.
Are you sure your comment about the value returned is accurate?
I am trying to write a pyspark dataframe to CSV. I have Spark 1.6, and I am trying things such as the line: df.write.format('com.intelli.spark.csv).save('mycsv.csv') and df.write.format('com.databricks.spark.csv').save(PATH).
These always give an error along the lines of java.lang.ClassNotFoundException: Failed to find data source: com.intelli.spark.csv. Please find packages at http://spark-packages.org.
I have tried downloading spark-cv_2.10-0.1.jar and using it in the --jars argument of spark-submit, but that also leads to a similar error. I have also tried spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 but it gives server access errors.
Try this way. In spark 1.6, you will have to covert it to rdd and write.
def toCSVLine(data):
return ','.join(str(d) for d in data)
rdd1 = df.rdd.map(toCSVLine)
rdd1.saveAsTextFile('output_dir')
Edit-
Try to add this in your spark code after passing
--py-files argument.
spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")
I have a Azure Eventhub, which is streaming data (in JSON format).
I read it as a Spark dataframe, parse the incoming "body" with from_json(col("body"), schema) where schema is pre-defined. In code it, looks like:
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import *
schema = StructType().add(...) # define the incoming JSON schema
df_stream_input = (spark
.readStream
.format("eventhubs")
.options(**ehConfInput)
.load()
.select(from_json(col("body").cast("string"), schema)
)
And now = if there is some inconsistency between the incoming JSON's schema and the defined schema (e.g. the source eventhub starts sending data in new format without notice), the from_json() functions will not throw an error = instead, it will put NULL to the fields, which are present in my schema definition but not in the JSONs eventhub sends.
I want to capture this information and log it somewhere (Spark's log4j, Azure Monitor, warning email, ...).
My question is: what is the best way how to achieve this.
Some of my thoughts:
First thing I can think of is to have a UDF, which checks for the NULLs and if there is any problem, it raise an Exception. I believe there it is not possible to send logs to log4j via PySpark, as the "spark" context cannot be initiated within the UDF (on the workers) and one wants to use the default:
log4jLogger = sc._jvm.org.apache.log4j
logger = log4jLogger.LogManager.getLogger('PySpark Logger')
Second thing I can think of is to use "foreach/foreachBatch" function and put this check logic there.
But I feel both these approaches are like.. like too much custom - I was hoping that Spark has something built-in for these purposes.
tl;dr You have to do this check logic yourself using foreach or foreachBatch operators.
It turns out I was mistaken thinking that columnNameOfCorruptRecord option could be an answer. It will not work.
Firstly, it won't work due to this:
case _: BadRecordException => null
And secondly due to this that simply disables any other parsing modes (incl. PERMISSIVE that seems to be used alongside columnNameOfCorruptRecord option):
new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
In other words, your only option is to use the 2nd item in your list, i.e. foreach or foreachBatch and handle corrupted records yourself.
A solution could use from_json while keeping the initial body column. Any record with an incorrect JSON would end up with the result column null and foreach* would catch it, e.g.
def handleCorruptRecords:
// if json == null the body was corrupt
// handle it
df_stream_input = (spark
.readStream
.format("eventhubs")
.options(**ehConfInput)
.load()
.select("body", from_json(col("body").cast("string"), schema).as("json"))
).foreach(handleCorruptRecords).start()
My python app calls MSSQL 2017 function trough pyodbc and function query response is JSON formatted on the server side.
cursor.execute("SELECT dbo.fnInvoiceJSON(%s,%s);" % (posted,vat))
row = cursor.fetchone()
Response that returns in the app is a class 'pyodbc.Row like it should be.
I can pass this response trough requests.post to other API call if i convert this to to string.
Is there any way to convert this response a accessible python dict / JSON object?
As noted in the question, under pyodbc fetchone() returns a pyodbc.Row object and fetchall() returns an array of pyodbc.Row objects.
However, pyodbc also supports a fetchval() method that will return the value in the first column of the first row in the result set (similar to the ExecuteScalar method in C#). fetchval() returns the value as the corresponding Python type (e.g., string, None, etc.).
I trying to format my output from a lambda function into JSON. The lambda function queries my Amazon Aurora RDS instance and returns an array of rows in the following format:
[[name,age,town,postcode]]
which gives the an example output:
[["James", 23, "Maidenhead","sl72qw"]]
I understand that mapping templates are designed to translate one format to another but I don't understand how I can take the output above and map in to a JSON format using these mapping templates.
I have checked the documentation and it only covers converting one JSON to another.
Without seeing the code you're specifically using, it's difficult to give you a definitely correct answer, but I suspect what you're after is returning the data from python as a dictionary then converting that to JSON.
It looks like this thread contains the relevant details on how to do that.
More specifically, using the DictCursor
cursor = connection.cursor(pymysql.cursors.DictCursor)