pyflink kafka connector deserializes received json data to null - json

I am creating a stream processor using PyFlink. When I connect Kafka to Flink, everything works fine. But when I send json data to kafka, PyFlink receives it but the deserialiser converts it to null. PyFlink code is
from pyflink.common.serialization import Encoder
from pyflink.datastream.connectors import StreamingFileSink
from pyflink.common.serialization import JsonRowDeserializationSchema
from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaConsumer
from pyflink.common import Row
# Starting Flink app
def start_flink_app():
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)
env.add_jars(
"file:///Users/samprabin/Documents/xealei_fall_detector/dataProcessorAndClassifier/jar/flink-sql-connector-kafka_2.11-1.12.3.jar")
deserialization_schema = JsonRowDeserializationSchema.builder() \
.type_info(type_info=Types.ROW([Types.INT(), Types.STRING()])).build()
kafka_consumer = FlinkKafkaConsumer(
topics='quickstart-events',
deserialization_schema=deserialization_schema,
properties={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group'})
ds = env.add_source(kafka_consumer)
ds.print()
env.execute("tutorial_job1")
if __name__ == "__main__":
print('Main program started...')
start_flink_app()
And the kafka producer code is
from kafka import KafkaProducer
from json import dumps
producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
value_serializer=lambda x: dumps(x).encode('utf-8'))
data = {"name":"tom"}
producer.send('quickstart-events', value=data)
please let me know how I can receive the json data in PyFlink

i faced same problem. What I did was use to the same serializer/deserializer given in example of flink kafka producer and generate output in a topic. I found that we have to use the following format :
{"f0": 123, "f1": "ddd"}
and then it works as expected without giving the null null it was giving earlier.

Issue resolved by using Types.ROW_NAMED instead Types.ROW. Then provide your field names.
deserialization_schema = JsonRowDeserializationSchema.builder().type_info(
type_info=Types.ROW_NAMED(
["abc","xyz"], [Types.STRING(), Types.STRING()])).build()

Related

Converting JSON into a DataFrame within FastAPI app

I am trying to create an API for customer churn at a bank. I have completed the model and now want to create the API using FastAPI. My problem is converting the JSON passed data to a dataframe to be able to run it through the model. Here is the code.
from fastapi import FastAPI
from starlette.middleware.cors import CORSMiddleware
from pycaret.classification import *
import pandas as pd
import uvicorn # ASGI
import pickle
import pydantic
from pydantic import BaseModel
class customer_input(BaseModel):
CLIENTNUM:int
Customer_Age:int
Gender:str
Dependent_count:int
Education_Level:str
Marital_Status:str
Income_Category:str
Card_Category:str
Months_on_book:int
Total_Relationship_Count:int
Months_Inactive_12_mon:int
Contacts_Count_12_mon:int
Credit_Limit:float
Total_Revolving_Bal:int
Avg_Open_To_Buy:float
Total_Amt_Chng_Q4_Q1:float
Total_Trans_Amt:int
Total_Trans_Ct:int
Total_Ct_Chng_Q4_Q1:float
Avg_Utilization_Ratio:float
app = FastAPI()
#Loading the saved model from pycaret
model = load_model('BankChurnersCatboostModel25thDec2020')
origins = [
'*'
]
app.add_middleware(
CORSMiddleware,
allow_origins=origins,
allow_credentials=True,
allow_methods=['GET','POST'],
allow_headers=['Content-Type','application/xml','application/json'],
)
#app.get("/")
def index():
return {"Nothing to see here"}
#app.post("/predict")
def predict(data: customer_input):
# Convert input data into a dictionary
data = data.dict()
# Convert the dictionary into a dataframe
my_data = pd.DataFrame([data])
# Predicting using pycaret
prediction = predict_model(model, my_data)
return prediction
# Only use below 2 lines when testing on localhost -- remove when deploying
if __name__ == '__main__':
uvicorn.run(app, host='127.0.0.1', port=8000)
When I test this out I get the Internal Server Error from the OpenAPI interface so I check my cmd and the error says
ValueError: [TypeError("'numpy.int64' object is not iterable"), TypeError('vars() argument must have __dict__ attribute')]
How can I have the data that is passed into the predict function successfully convert into a dataframe. Thank you.
Ok so I fixed this by changing the customer_input class. Any int types I changed to a float and that fixed it. I don't understand why though. Can anyone explain?
Fundamentally those int values are only meant to be an integer because they are all discrete values (i.e choosing number of dependents in a bank) but I guess I could put a constrain on the front-end.

DStream JSON object to SQLite

stack overflow community,
I have the following question:
I am using Spark Streaming and KafkaUtils to read from a Kafka topic, then I transform the Dstream to JSON. What I want is to save this JSON object to an SQLite database with a column-row format.
Sample of the code I run in spark-streaming:
import sys
import json
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == '__main__':
conf = SparkConf().setAppName("PythonStreamingDirectKafka").setMaster("spark://spark-master:7077")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
ssc = StreamingContext(sc, 20)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {'metadata.broker.list': brokers})
message = kvs.map(lambda x: x[1])
message.pprint()
# Functions
json_object = message.map(lambda s: eval(s))
temperatures_object = json_object.map(lambda s: s["temperature_value"])
#Aggregations
json_object.pprint()
temperatures_object.pprint()
ssc.start()
ssc.awaitTermination()
The output of DStream
DStream output
SQLite schema:
Database Schema
Do you have any idea how to achieve this? It's complicated to me how to transfer JSON data to SQLite from spark streaming using Pyspark.
I appreciate any help in advance!

How to convert json file into table structure in redshift using python

How can I convert JSON file into a table structure in Redshift? I tried the below python code.
import boto3
import json
import os
import sys
import psycopg2
import csv
from collections import defaultdict
def jsonfile(path):
session = boto3.Session(
aws_access_key_id='dfjfkgj',
aws_secret_access_key='sdfg',
region_name='us-west-2')
s3 = session.resource('s3')
bucket= s3.Bucket('ag-redshift-poc')
with open(path, 'rb') as data:
res=json.load(data)
f = open('data.csv','wb')
output = csv.writer(f)
output.writerow(res[0].keys())
for row in res:
output.writerow(row.values())
bucket.put_object(Key=('C:\Python27\data.csv'),Body=res)
print 'success'
def redshift():
co=psycopg2.connect(dbname= 'redshiftpoc', host='shdjf',
port= '5439', user= 'admin', password= 'snd')
curr = co.cursor()
curr.execute("""copy sample from 's3://ag-redshift-poc/testfile/json.txt'
CREDENTIALS 'aws_access_key_id=fdfd;aws_secret_access_key=sxhd'
""")
co.commit()
print 'success'
curr.close()
co.close()
jsonfile('C:\Python27\json.txt')
redshift()
Redshift can directly absorb JSON to COPY into your table. (Though not very efficient).
In your case, modify the COPY query to,
COPY sample FROM 's3://<bucket_name>/<path_to_json>'
CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx'
JSON 'auto' ACCEPTINVCHARS;
Please note JSON 'auto' in query. This maps every column in table with keys in JSON file.
More details here in the COPY examples

spark streaming+kafka - appending to parquet

I am streaming meter reading records as JSON from kafka 2.11-1 into Spark 2.1. I dont understand how to convert the streamed object into a dataframe before saving it to a parquet file. I want the scala script to infer the schema from JSON so that a new parquet file format will be generated automatically when the JSON format of the streaming source data changes (I'll figure out later how to detect this and start a new file whenever a format change occurs). For now, I am unable to write the parquet file.
import org.apache.spark
import org.apache.spark.streaming._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SQLContext, SaveMode, SparkSession}
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
val sqlContext = new SQLContext(sc)
ssc.checkpoint("_checkpoint")
// Connect to Kafka
import org.apache.spark.streaming.kafka.KafkaUtils
import _root_.kafka.serializer.StringDecoder
val kafkaParams = Map("metadata.broker.list" -> "xx.xx.xx.xx:9092")
val kafkaTopics = Set("test")
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, kafkaTopics)
messages.print()
messages.foreachRDD(rdd => {
val part1= rdd.map(_._1)
val part2= rdd.map(_._2) // this has the json
print ("%%%% part1 is : " + part1)
print ("%%%% part2 is : " + part2)
// here: infer the schema from json and append the streamed data to a parquet file on hdfs
} )
ssc.start()
ssc.awaitTermination()
The json looks like this:
-------------------------------------------
Time: 1513155855000 ms
-------------------------------------------
(null,{"customer_id":"customer_51","customer_acct_id":"cusaccid_1197","serv_acct_id":"service_1957","installed_service_id":"instserv_946","meter_id":"meter_319","channel_number":"156","interval_read_date":"2013-06-16 11:26:04","interval_received":"5","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N", "provisioned_meter_ind:"N"})
(null,{"customer_id":"customer_25","customer_acct_id":"cusaccid_1303","serv_acct_id":"service_844","installed_service_id":"instserv_1636","meter_id":"meter_663","channel_number":"1564","interval_read_date":"2014-02-13 12:52:34","interval_received":"8","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
(null,{"customer_id":"customer_1955","customer_acct_id":"cusaccid_1793","serv_acct_id":"service_577","installed_service_id":"instserv_1971","meter_id":"meter_1459","channel_number":"1312","interval_read_date":"2017-05-23 07:32:13","interval_received":"11","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
(null,{"customer_id":"customer_1833","customer_acct_id":"cusaccid_1381","serv_acct_id":"service_461","installed_service_id":"instserv_477","meter_id":"meter_1373","channel_number":"1769","interval_read_date":"2011-12-13 10:12:20","interval_received":"15","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
(null,{"customer_id":"customer_1597","customer_acct_id":"cusaccid_1753","serv_acct_id":"service_379","installed_service_id":"instserv_1061","meter_id":"meter_1759","channel_number":"632","interval_read_date":"2013-07-22 05:49:55","interval_received":"7","interval_measure":"hour","interval_expected":"1","received_3days":"1","expected_3days":"1","received_30days":"1","expected_30days":"1","meter_exclusion_ind":"N","provisioned_meter_ind":"N"})
2017-12-13 09:04:15,626 INFO org.apache.spark.streaming.scheduler.JobGenerator (Logging.scala:logInfo(54)) - Checkpointing graph for time 1513155855000 ms
I'm testing this using spark-shell:
spark-shell --jars /opt/alti-spark-2.1.1/external/kafka-0-8/target/spark-streaming-kafka-0-8_2.11-2.1.1.jar --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.0

What are SparkSession Config Options

I am trying to use SparkSession to convert JSON data of a file to RDD with Spark Notebook. I already have the JSON file.
val spark = SparkSession
.builder()
.appName("jsonReaderApp")
.config("config.key.here", configValueHere)
.enableHiveSupport()
.getOrCreate()
val jread = spark.read.json("search-results1.json")
I am very new to spark and do not know what to use for config.key.here and configValueHere.
SparkSession
To get all the "various Spark parameters as key-value pairs" for a SparkSession, “The entry point to programming Spark with the Dataset and DataFrame API," run the following (this is using Spark Python API, Scala would be very similar).
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
SparkConf().getAll()
or without importing SparkConf:
spark.sparkContext.getConf().getAll()
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html
You can get a deeper level list of SparkSession configuration options by running the code below. Most are the same, but there are a few extra ones. I am not sure if you can change these.
spark.sparkContext._conf.getAll()
SparkContext
To get all the "various Spark parameters as key-value pairs" for a SparkContext, the "Main entry point for Spark functionality," ... "connection to a Spark cluster," ... and "to create RDDs, accumulators and broadcast variables on that cluster,” run the following.
import pyspark
from pyspark import SparkConf, SparkContext
spark_conf = SparkConf().setAppName("test")
spark = SparkContext(conf = spark_conf)
SparkConf().getAll()
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html
Spark parameters
You should get a list of tuples that contain the "various Spark parameters as key-value pairs" similar to the following:
[(u'spark.eventLog.enabled', u'true'),
(u'spark.yarn.appMasterEnv.PYSPARK_PYTHON', u'/<yourpath>/parcels/Anaconda-4.2.0/bin/python'),
...
...
(u'spark.yarn.jars', u'local:/<yourpath>/lib/spark2/jars/*')]
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkConf.html
https://spark.apache.org/docs/latest//api/python/reference/api/pyspark.SparkConf.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkConf.html
For a complete list of Spark properties, see:
http://spark.apache.org/docs/latest/configuration.html#viewing-spark-properties
Setting Spark parameters
Each tuple is ("spark.some.config.option", "some-value") which you can set in your application with:
SparkSession
spark = (
SparkSession
.builder
.appName("Your App Name")
.config("spark.some.config.option1", "some-value")
.config("spark.some.config.option2", "some-value")
.getOrCreate())
sc = spark.sparkContext
SparkContext
spark_conf = (
SparkConf()
.setAppName("Your App Name")
.set("spark.some.config.option1", "some-value")
.set("spark.some.config.option2", "some-value"))
sc = SparkContext(conf = spark_conf)
spark-defaults
You can also set the Spark parameters in a spark-defaults.conf file:
spark.some.config.option1 some-value
spark.some.config.option2 "some-value"
then run your Spark application with spark-submit (pyspark):
spark-submit \
--properties-file path/to/your/spark-defaults.conf \
--name "Your App Name" \
--py-files path/to/your/supporting/pyspark_files.zip \
--class Main path/to/your/pyspark_main.py
This is how it worked for me to add spark or hive settings in my scala:
{
val spark = SparkSession
.builder()
.appName("StructStreaming")
.master("yarn")
.config("hive.merge.mapfiles", "false")
.config("hive.merge.tezfiles", "false")
.config("parquet.enable.summary-metadata", "false")
.config("spark.sql.parquet.mergeSchema","false")
.config("hive.merge.smallfiles.avgsize", "160000000")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.orc.impl", "native")
.config("spark.sql.parquet.binaryAsString","true")
.config("spark.sql.parquet.writeLegacyFormat","true")
//.config(“spark.sql.streaming.checkpointLocation”, “hdfs://pp/apps/hive/warehouse/dev01_landing_initial_area.db”)
.getOrCreate()
}
The easiest way to set some config:
spark.conf.set("spark.sql.shuffle.partitions", 500).
Where spark refers to a SparkSession, that way you can set configs at runtime. It's really useful when you want to change configs again and again to tune some spark parameters for specific queries.
In simple terms, values set in "config" method are automatically propagated to both SparkConf and SparkSession's own configuration.
for eg :
you can refer to
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-settings.html to understand how hive warehouse locations are set for SparkSession using config option
To know about the this api you can refer to : https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html
Every Spark config option is expolained at: http://spark.apache.org/docs/latest/configuration.html
You can set these at run-time as in your example above or through the config file given to spark-submit