AWS Glue - Crawl Json file and insert into Redshift - json

Hi I am trying to use Aws Glue to load a S3 file into Redshift. When I try to crawl a Json file from my S3 bucket into a table it doesn't seem to work: the result is a table with a single array column as seen in the picture below. I have already tried using a Json Classifier with the path as "$[*]" but that doesnt seem to work either. Any ideas?
The structure of the Json file is the below:
[
{
"firstname": "andrew",
"lastname": "johnson",
"subject": "Mathematics",
"mark": 49
},
{
"firstname": "mary",
"lastname": "james",
"subject": "Physics",
"mark": ""
},
{
"firstname": "Peter",
"lastname": "Lloyd",
"subject": "Soc. Studies",
"mark": 89
}
]
The below is a screenshot for the resulted table, which is a single array column which cant be mapped to the table in Redshift:

Related

Writing Spark DataFrame to Kafka as comma separate json object

I am not able to send dataframe as comma separated json object for larger data set .
Working code for smaller data set
df.selectExpr("CAST(collect_list(to_json(struct(*))) AS STRING) AS value") \
.write.format("kafka")\
.option("compression", "gzip")\
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "JsonFormat") \
.option("kafka.request.timeout.ms", 120000) \
.option("kafka.linger.ms", 10) \
.option("compression", "gzip")\
.option("kafka.retries", 3) \
.save()
spark.stop()
output
[{
"firstname": "James",
"middlename": "",
"lastname": "Smith",
"id": "36636",
"gender": "M",
"salary": 3000
}, {
"firstname": "Michael",
"middlename": "Rose",
"lastname": "",
"id": "40288",
"gender": "M",
"salary": 4000
}, {
"firstname": "Robert",
"middlename": "",
"lastname": "Williams",
"id": "42114",
"gender": "M",
"salary": 4000
}, {
"firstname": "Maria",
"middlename": "Anne",
"lastname": "Jones",
"id": "39192",
"gender": "F",
"salary": 4000
}, {
"firstname": "Satish",
"middlename": "Anjaneyapp",
"lastname": "Brown",
"id": "",
"gender": "F",
"salary": -1
}]
Actual Problem
for larger data set - collect_list(to_json(struct(*))) AS STRING) - trying to collect huge data and sending through kafka . We are getting below error
Caused by: org.apache.kafka.common.errors.RecordTooLargeException: The message is 51312082 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.
Limitation :
I can send only one 1 mb per message through Kafka .
Is there a way , we can break the message upto 1 mb size and send the comma seperated json object .
Tried below configurations , but no luck
kafka.linger.ms
batch.size
Don't comma separate your JSON objects. Then the records won't be valid JSON. You also shouldn't break into "1MB chunks", because then you'll have incomplete strings being sent to different partitions, and you have no easy way to detemine ordering to put them together in any consumer.
Remove the collect_list call and instead ensure your dataframe has a values string column of valid individual JSON objects as multiple rows. Then the Kafka writer will write each row as a new message

How to parse nested json and write in Redshift?

I have a following json structure like this:
{
"firstname": "A",
"lastname": "B",
"age": 24,
"address": {
"streetAddress": "123",
"city": "San Jone",
"state": "CA",
"postalCode": "394221"
},
"phonenumbers": [
{ "type": "home", "number": "123456789" }
{ "type": "mobile", "number": "987654321" }
]
}
I need to copy this json from S3 to a Redshift table.
I am currently using copy command with a path file but it loads array as a single column.
I wanted the nested array to be parsed and the table should like this:
firstname|lastname|age|streetaddress|city |state|postalcode|type|number
-----------------------------------------------------------------------------
A | B |24 |123 |SanJose|CA |394221 |home|123456789
-----------------------------------------------------------------------------
A | B |24 |123 |SanJose|CA |394221 |mob|987654321
Is there a way to do that?
You can do use nested JSON paths by making use of JSON path files. However, this does not work with the multiple phone number types.
If you can modify the dataset to have multiple records (one for mobile, one for home) then your file would look similar to the below.
{
"jsonpaths": [
"$.firstname",
"$.lastname",
"$.venuestate",
"$.age",
"$.address.streetAddress",
"$.address.city",
"$.address.state",
"$.address.postalCode",
"$.phonenumbers[0].type",
"$.phonenumbers[0].number"
]
}
If you are unable to change the format you will need to perform an ETL task upon load before it can be consumed by Redshift. For this you could use an event for creation of objects to trigger a Lambda function and then perform the ETL process for you before it loads into Redshift.

How can we structure array of dictionary in firebase realtime database?

I want to create below json structure like in firebase realtime databse
{
"doner": [
{
"firstName": "Sandesh",
"lastName": "Sardar",
"location": [50.11, 8.68],
"mobile": "100",
"age": 21
},
{
"firstName": "Akash",
"lastName": "saw",
"location": [50.85, 4.35],
"mobile": "1200",
"age": 22
},
{
"firstName": "Sahil",
"lastName": "abc",
"location": [48.85, 2.35],
"mobile": "325846",
"age": 23
},
{
"firstName": "ram",
"lastName": "abc",
"location": [46.2039, 6.1400],
"mobile": "3257673",
"age": 34
}]
}
But when I imported file in firebase realtime database, it turns into below structure where ]1
I believe this is not array of dictionary, but it is a dictionary of multiple dictionaries.
Is there any way to structure array of dictionaries in firebase?
The Firebase Realtime Database doesn't natively store arrays in the format you want. It instead stores arrays as key-value pairs, with the key being the (string representation) of the index of each item in the array.
When you read the data from Firebase (either through an SDK, or through the REST API), it converts this map back into an array.
So what you're seeing is the expected behavior, and there's no way to change it.
If you'd like to learn more about how Firebase deals with arrays, and why, I recommend checking out Kato's blog post here: Best Practices: Arrays in Firebase.

PySpark - Getting list of dicts and converting its keys/values to columns

I have the following json (located in my local file system in path_json):
[
{
"name": "John",
"email": "john#hisemail.com",
"gender": "Male",
"dict_of_columns": [
{
"column_name": "hobbie",
"columns_value": "guitar"
},
{
"column_name": "book",
"columns_value": "1984"
}
]
},
{
"name": "Mary",
"email": "mary#heremail.com",
"gender": "Female",
"dict_of_columns": [
{
"column_name": "language",
"columns_value": "Python"
},
{
"column_name": "job",
"columns_value": "analyst"
}
]
}
]
As you can see, this is a nested json.
I am reading it with the following command:
df = spark.read.option("multiline", "true").json(path_json)
Ok. Now, it produces me the following DataFrame:
+------------------------------------+-------------------+------+----+
|dict_of_columns |email |gender|name|
+------------------------------------+-------------------+------+----+
|[[hobbie, guitar], [book, 1984]] |john#hisemail.com |Male |John|
|[[language, Python], [job, analyst]]|mary#heremail.com |Female|Mary|
+------------------------------------+-------------------+------+----+
I want to know if there is a way to produces the following dataframe:
+----+-----------------+------+------+-------+--------+----+
|book|email |gender|hobbie|job |language|name|
+----+-----------------+------+------+-------+--------+----+
|1984|john#hisemail.com|Male |guitar|null |null |John|
|null|mary#heremail.com|Female|null |analyst|Python |Mary|
+----+-----------------+------+------+-------+--------+----+
A few comments:
My real data has thousands and thousands of lines
I don't know all the column_name in my dataset (there are many of them)
email is unique for each line, so it can be used as key if a join is necessary. I tried this approach before: create a main dataframe with columns [name,gender,email] and other dataframes for each row containing the dictionaries. But without success (and it doesn`t have good performance).
Thanks you so much!

How to store data into SQLite database from existing JSON file in Django?

Suppose I have JSON file
{Key1:value,
key2:value,
key3:value}
{Key1:value,
key2:value,
key3:value}
Now I want to store this data into SQLite database which should have one table having fields: Key 1, Key 2, Key 3.
How can I save this data into database using script?
What you are looking for is django fixtures.
First, make a django model that represents the table that you want to create. Key1, Key2, etc. are the fields in your model.
Then, all you need to do is get your data into the correct json format like the one below.
[
{
"model": "myapp.person",
"pk": 1,
"fields": {
"first_name": "John",
"last_name": "Lennon"
}
},
{
"model": "myapp.person",
"pk": 2,
"fields": {
"first_name": "Paul",
"last_name": "McCartney"
}
}
]
Finally, run python manage.py loaddata <json filepath>. And you're done!