Writing Spark DataFrame to Kafka as comma separate json object - json

I am not able to send dataframe as comma separated json object for larger data set .
Working code for smaller data set
df.selectExpr("CAST(collect_list(to_json(struct(*))) AS STRING) AS value") \
.write.format("kafka")\
.option("compression", "gzip")\
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "JsonFormat") \
.option("kafka.request.timeout.ms", 120000) \
.option("kafka.linger.ms", 10) \
.option("compression", "gzip")\
.option("kafka.retries", 3) \
.save()
spark.stop()
output
[{
"firstname": "James",
"middlename": "",
"lastname": "Smith",
"id": "36636",
"gender": "M",
"salary": 3000
}, {
"firstname": "Michael",
"middlename": "Rose",
"lastname": "",
"id": "40288",
"gender": "M",
"salary": 4000
}, {
"firstname": "Robert",
"middlename": "",
"lastname": "Williams",
"id": "42114",
"gender": "M",
"salary": 4000
}, {
"firstname": "Maria",
"middlename": "Anne",
"lastname": "Jones",
"id": "39192",
"gender": "F",
"salary": 4000
}, {
"firstname": "Satish",
"middlename": "Anjaneyapp",
"lastname": "Brown",
"id": "",
"gender": "F",
"salary": -1
}]
Actual Problem
for larger data set - collect_list(to_json(struct(*))) AS STRING) - trying to collect huge data and sending through kafka . We are getting below error
Caused by: org.apache.kafka.common.errors.RecordTooLargeException: The message is 51312082 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.
Limitation :
I can send only one 1 mb per message through Kafka .
Is there a way , we can break the message upto 1 mb size and send the comma seperated json object .
Tried below configurations , but no luck
kafka.linger.ms
batch.size

Don't comma separate your JSON objects. Then the records won't be valid JSON. You also shouldn't break into "1MB chunks", because then you'll have incomplete strings being sent to different partitions, and you have no easy way to detemine ordering to put them together in any consumer.
Remove the collect_list call and instead ensure your dataframe has a values string column of valid individual JSON objects as multiple rows. Then the Kafka writer will write each row as a new message

Related

AWS Glue - Crawl Json file and insert into Redshift

Hi I am trying to use Aws Glue to load a S3 file into Redshift. When I try to crawl a Json file from my S3 bucket into a table it doesn't seem to work: the result is a table with a single array column as seen in the picture below. I have already tried using a Json Classifier with the path as "$[*]" but that doesnt seem to work either. Any ideas?
The structure of the Json file is the below:
[
{
"firstname": "andrew",
"lastname": "johnson",
"subject": "Mathematics",
"mark": 49
},
{
"firstname": "mary",
"lastname": "james",
"subject": "Physics",
"mark": ""
},
{
"firstname": "Peter",
"lastname": "Lloyd",
"subject": "Soc. Studies",
"mark": 89
}
]
The below is a screenshot for the resulted table, which is a single array column which cant be mapped to the table in Redshift:

Error in reading json file with schema in Scala

Getting error while reading ArrayType values(phoneNumbers), without ArrayType values, I can read rest values.
{
"firstName": "Rack",
"lastName": "Jackon",
"gender": "man",
"age": 24,
"address": {
"streetAddress": 126,
"city": "San Jone",
"state": "CA",
"postalCode": 394221
},
"phoneNumbers": [
{ "type": "home", "number": 7383627627}
]
}
My schema ->
val schema=StructType(List(
StructField("firstName",StringType),
StructField("lastName",StringType),
StructField("gender",StringType),
StructField("age",IntegerType),
StructField("address",StructType(List(
StructField("streetAddress",StringType),
StructField("city",StringType),
StructField("state",StringType),
StructField("postalCode",IntegerType)))),
StructField("phoneNumbers",ArrayType(StructType(List(
StructField("type",StringType),
StructField("number",IntegerType))))),
))
json_df.selectExpr("firstName","lastName",
"gender","age","address.streetAddress","address.city",
"address.state","address.postalCode",
"explode(phoneNumbers) as phone","phone.type","phone.number").drop("phone").show()
When I do .show, it shows only column names and no values but when I don't take "phoneNumbers" array, it works fine.
IntegerType represents 4-byte signed integer numbers and has a maximum of 2147483647, which cannot hold phone numbers. Either use LongType or StringType for phone numbers.
You got no results from your select query because you're exploding an empty array of phone numbers, which returns 0 rows. The array is empty because the phone numbers cannot be saved in an IntegerType column.

Extract data from a JSON file using python

Say if I have JSON entry as follows(The JSON file generated by fetching data from a Firebase DB):
[{"goal_savings": 0.0, "social_id": "", "score": 0, "country": "BR", "photo": "http://graph.facebook", "id": "", "plates": 3, "rcu": null, "name": "", "email": ".", "provider": "facebook", "phone": "", "savings": [], "privacyPolicyAccepted": true, "currentRole": "RoleType.PERSONAL", "empty_lives_date": null, "userId": "", "authentication_token": "-------", "onboard_status": "ONBOARDING_WIZARD", "fcmToken": ----------", "level": 1, "dni": "", "social_token": "", "lives": 10, "bills": [{"date": "2020-12-10", "role": "RoleType.PERSONAL", "name": "Supermercado", "category": "feeding", "periodicity": "PeriodicityType.NONE", "value": 100.0"}], "payments": [], "goals": [], "goalTransactions": [], "incomes": [], "achievements": [{"created_at":", "name": ""}]}]
How do I extract the content corresponding to 'value' which is present inside column 'bills' . Any way to do this ?
My python code is as follows. With this I was only able to get data within bills column. But I need only the entry corresponding to 'value' which is present inside bills.
import json
filedata = open('firebase-dataset.json','r')
data = json.load(filedata)
listoffields = [] # To produce it into a list with fields
for dic in data:
try:
listoffields.append(dic['bills']) # only non-essential bill categories.
except KeyError:
pass
print(listoffields)
The JSON you posted contains misplaced quotes.
I think you are trying to extract the value of 'value' column within bills.
try this
print(listoffields[0][0]['value'])
which will print you 100.0 as str. use float() to use it in calculations.
---edit---
Say the JSON you having contains many JSON objects separated by commas as..
[{ first-entry },{ second-entry },{ third.. }, ....and so on]
..and you want to find the value of each bill in the each JSON obj..
may be the code below will work.-
bill_value_list = [] # to store 'value' of each bill
for bill_list in listoffields:
bill_value_list.append(float(bill_list[0]['value'])) # blill_list[0] will contain complete bill dictionary.
print(bill_value_list)
print(sum(bill_value_list)) # do something usefull
Paste it after the code you posted.(no changes to your code .. since it always works :-) )

How can we structure array of dictionary in firebase realtime database?

I want to create below json structure like in firebase realtime databse
{
"doner": [
{
"firstName": "Sandesh",
"lastName": "Sardar",
"location": [50.11, 8.68],
"mobile": "100",
"age": 21
},
{
"firstName": "Akash",
"lastName": "saw",
"location": [50.85, 4.35],
"mobile": "1200",
"age": 22
},
{
"firstName": "Sahil",
"lastName": "abc",
"location": [48.85, 2.35],
"mobile": "325846",
"age": 23
},
{
"firstName": "ram",
"lastName": "abc",
"location": [46.2039, 6.1400],
"mobile": "3257673",
"age": 34
}]
}
But when I imported file in firebase realtime database, it turns into below structure where ]1
I believe this is not array of dictionary, but it is a dictionary of multiple dictionaries.
Is there any way to structure array of dictionaries in firebase?
The Firebase Realtime Database doesn't natively store arrays in the format you want. It instead stores arrays as key-value pairs, with the key being the (string representation) of the index of each item in the array.
When you read the data from Firebase (either through an SDK, or through the REST API), it converts this map back into an array.
So what you're seeing is the expected behavior, and there's no way to change it.
If you'd like to learn more about how Firebase deals with arrays, and why, I recommend checking out Kato's blog post here: Best Practices: Arrays in Firebase.

How we will take the keys from a json in the original order?

I have a JSON file, I want to process that JSON data as Key, value pair.
Here is my JSON file
```"users" : {
"abc": {
"ip": "-------------",
"username": "users#gmail.com",
"password": "---------",
"displayname": "-------",
"Mode": "-----",
"phonenumber": "1********1",
"pstndisplay": "+1 *******5"
},
"efg": {
"ip": "-------------",
"username": "user1#gmail.com",
"password": "---------",
"displayname": "-------",
"Mode": "-----",
"phonenumber": "1********1",
"pstndisplay": "+1 *******5"
},
"xyz": {
"ip": "-------------",
"username": "user2#gmail.com",
"password": "---------",
"displayname": "-------",
"Mode": "-----",
"phonenumber": "1********1",
"pstndisplay": "+1 *******5"```
here i tried to get json data
``` ${the file as string}= Get File ${users_json_path}
${parsed}= Evaluate json.loads("""${the file as string}""") json
${properties}= Set Variable ${parsed["users"]}
Log ${properties}
:FOR ${key} IN #{properties}
\ ${sub dict}= Get From Dictionary ${properties} ${key}
\ Log ${sub dict}
\ Signin ${sub dict}[ip] ${sub dict}[username] ${sub dict}[password] ${sub dict}[Mode]
\ Log ${key} is successfully signed in.
Expected Behavior - The keys what I am parsing should be in sequence from JSON file. For example, abc will get a sign in first then efg and xyz.
${key} = abc
${key} = efg
${key} = xyz
Below are the questions:
1) How we will take users from JSON in sequence? Right now it is taking randomly
2) What will be the best logic to achieve that?
I see you tagged the question with python 2.7 - where Bryan Oakely's comment fully holds true, the elements are in random order.
If you upgrade to python 3 though, starting from v3.6 onwards the dictionaries are guaranteed to preserve the insertion order. Thus on parsing it with the json library the result will be the same as in the source string/file.
Alternatively, in v2 you can use OrderedDict to accomplish the same - plus, specifying the object_pairs_hook argument to JSONDecoder - with it you'll basically specify the result will be OrderedDict:
${parsed}= Evaluate json.loads("""${the file as string}""", object_pairs_hook=collections.OrderedDict) json, collections
#!/usr/bin/env python3
import json
def main():
json_str = """
{
"users" : {
"abc": {
"ip": "-------------",
"username": "users#gmail.com",
"password": "---------",
"displayname": "-------",
"Mode": "-----",
"phonenumber": "1********1",
"pstndisplay": "+1 *******5"
},
"efg": {
"ip": "-------------",
"username": "user1#gmail.com",
"password": "---------",
"displayname": "-------",
"Mode": "-----",
"phonenumber": "1********1",
"pstndisplay": "+1 *******5"
}
}
}
"""
json_object = json.loads(json_str)
for line in json_str.split('\n'):
if '"' in line and ':' in line and '{' in line and '"users"' not in line:
key = line.split('"')[1]
print(key, json_object['users'][key])