Error in reading json file with schema in Scala - json

Getting error while reading ArrayType values(phoneNumbers), without ArrayType values, I can read rest values.
{
"firstName": "Rack",
"lastName": "Jackon",
"gender": "man",
"age": 24,
"address": {
"streetAddress": 126,
"city": "San Jone",
"state": "CA",
"postalCode": 394221
},
"phoneNumbers": [
{ "type": "home", "number": 7383627627}
]
}
My schema ->
val schema=StructType(List(
StructField("firstName",StringType),
StructField("lastName",StringType),
StructField("gender",StringType),
StructField("age",IntegerType),
StructField("address",StructType(List(
StructField("streetAddress",StringType),
StructField("city",StringType),
StructField("state",StringType),
StructField("postalCode",IntegerType)))),
StructField("phoneNumbers",ArrayType(StructType(List(
StructField("type",StringType),
StructField("number",IntegerType))))),
))
json_df.selectExpr("firstName","lastName",
"gender","age","address.streetAddress","address.city",
"address.state","address.postalCode",
"explode(phoneNumbers) as phone","phone.type","phone.number").drop("phone").show()
When I do .show, it shows only column names and no values but when I don't take "phoneNumbers" array, it works fine.

IntegerType represents 4-byte signed integer numbers and has a maximum of 2147483647, which cannot hold phone numbers. Either use LongType or StringType for phone numbers.
You got no results from your select query because you're exploding an empty array of phone numbers, which returns 0 rows. The array is empty because the phone numbers cannot be saved in an IntegerType column.

Related

Writing Spark DataFrame to Kafka as comma separate json object

I am not able to send dataframe as comma separated json object for larger data set .
Working code for smaller data set
df.selectExpr("CAST(collect_list(to_json(struct(*))) AS STRING) AS value") \
.write.format("kafka")\
.option("compression", "gzip")\
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "JsonFormat") \
.option("kafka.request.timeout.ms", 120000) \
.option("kafka.linger.ms", 10) \
.option("compression", "gzip")\
.option("kafka.retries", 3) \
.save()
spark.stop()
output
[{
"firstname": "James",
"middlename": "",
"lastname": "Smith",
"id": "36636",
"gender": "M",
"salary": 3000
}, {
"firstname": "Michael",
"middlename": "Rose",
"lastname": "",
"id": "40288",
"gender": "M",
"salary": 4000
}, {
"firstname": "Robert",
"middlename": "",
"lastname": "Williams",
"id": "42114",
"gender": "M",
"salary": 4000
}, {
"firstname": "Maria",
"middlename": "Anne",
"lastname": "Jones",
"id": "39192",
"gender": "F",
"salary": 4000
}, {
"firstname": "Satish",
"middlename": "Anjaneyapp",
"lastname": "Brown",
"id": "",
"gender": "F",
"salary": -1
}]
Actual Problem
for larger data set - collect_list(to_json(struct(*))) AS STRING) - trying to collect huge data and sending through kafka . We are getting below error
Caused by: org.apache.kafka.common.errors.RecordTooLargeException: The message is 51312082 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.
Limitation :
I can send only one 1 mb per message through Kafka .
Is there a way , we can break the message upto 1 mb size and send the comma seperated json object .
Tried below configurations , but no luck
kafka.linger.ms
batch.size
Don't comma separate your JSON objects. Then the records won't be valid JSON. You also shouldn't break into "1MB chunks", because then you'll have incomplete strings being sent to different partitions, and you have no easy way to detemine ordering to put them together in any consumer.
Remove the collect_list call and instead ensure your dataframe has a values string column of valid individual JSON objects as multiple rows. Then the Kafka writer will write each row as a new message

Select part of Json array based on a value from another column in PostgreSQL

I have a table where I store country information in a column and json data in another column.
I'd like to select only part of the json data, basically I'd like to find the country value inside the Json, and return the animals values from key "animals" that are the closest (and on the left side) to the country found in the json.
This is the table "myanimals":
Country
Metadata
US
{ "a": 1, "b": 2, "animals": ["dog","cat","mouse"], "region": {"country": "china"}, "animals": ["horse","bear","eagle"], "region": { "country": "us" } }
India
{ "a": 20, "b": 40, "animals": ["fish","cat","rat","hamster"], "region": {"country": "india"}, "animals": ["dog","rabbit","fox","fish"], "region": { "country": "poland" } }
Metadata is in json and NOT jsonb.
Using Postgres, I wanted to query so I'd end up with a new column, something like "animals_in_country", where the only information shown would be the values from key
"animals" which are the closest (and located on the left) to the matched country, as it follows:
Country
Metadata
animals_in_country
US
{ "a": 1, "b": 2, "animals": ["dog","cat","mouse"], "region": {"country": "china"}, "animals": ["horse","bear","eagle"], "region": { "country": "us" } }
["horse","bear","eagle"]
India
{ "a": 20, "b": 40, "animals": ["fish","cat","rat","hamster"], "region": {"country": "india"}, "animals": ["dog","rabbit","fox","fish"], "region": { "country": "poland" } }
["fish","cat","rat","hamster"]
Here's some pseudo code of what I am trying to achieve (please refer to the table shown above)
- Take the value in "Country", "US", and find the location of the same value in the JSON column
- location found, now search before this key 'country' for the key 'animals'
- Return whole array of values from 'animals'
- should be ["horse","bear","eagle"]
- shouldn't be ["dog","cat","mouse"] (as this one is part of "china" country in the JSON)
NOTE: Although this is dummy data, this is more or less the issue I am solving. And yes, the JSON is showing the same key more than once.
In case you are looking for the first animal in the array, this is your answer.
select a.data->'animals'->0 as first_animal from myanimals a
where a.data->'region'->>'country'='us'
or without filter (for all countries)
select a.data->'region'->>'country' as country,
a.data->'animals'->0 as first_animal from myanimals a

reshape jq nested file and make csv

I've been struggling with this one for the whole day which i want to turn to a csv.
It represents the officers attached to company whose number is "OC418979" in the UK Company House API.
I've already truncated the json to contain just 2 objects inside "items".
What I would like to get is a csv like this
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
...
There are 2 extra complication: there are 2 types of "officers", some are people, some are companies, so not all key in people are present in the other and viceversa. I'd like these entries to be 'null'. Second complication is those nested objects like "name" which contains a comma in it! or address, which contains several sub-objects (which I guess I could flatten in pandas tho).
{
"total_results": 13,
"resigned_count": 9,
"links": {
"self": "/company/OC418979/officers"
},
"items_per_page": 35,
"etag": "bc7955679916b089445c9dfb4bc597aa0daaf17d",
"kind": "officer-list",
"active_count": 4,
"inactive_count": 0,
"start_index": 0,
"items": [
{
"officer_role": "llp-designated-member",
"name": "BARRICK, David James",
"date_of_birth": {
"year": 1984,
"month": 1
},
"appointed_on": "2017-09-15",
"country_of_residence": "England",
"address": {
"country": "United Kingdom",
"address_line_1": "Old Gloucester Street",
"locality": "London",
"premises": "27",
"postal_code": "WC1N 3AX"
},
"links": {
"officer": {
"appointments": "/officers/d_PT9xVxze6rpzYwkN_6b7og9-k/appointments"
}
}
},
{
"links": {
"officer": {
"appointments": "/officers/M2Ndc7ZjpyrjzCXdFZyFsykJn-U/appointments"
}
},
"address": {
"locality": "Tadcaster",
"country": "United Kingdom",
"address_line_1": "Westgate",
"postal_code": "LS24 9AB",
"premises": "5a"
},
"identification": {
"legal_authority": "UK",
"identification_type": "non-eea",
"legal_form": "UK"
},
"name": "PREMIER DRIVER LIMITED",
"officer_role": "corporate-llp-designated-member",
"appointed_on": "2017-09-15"
}
]
}
What I've been doing is creating new json objects extracting the fields I needed like this:
{officer_address:.items[]?.address, appointed_on:.items[]?.appointed_on, country_of_residence:.items[]?.country_of_residence, officer_role:.items[]?.officer_role, officer_dob:items.date_of_birth, officer_nationality:.items[]?.nationality, officer_occupation:.items[]?.occupation}
But the query runs for hours - and I am sure there is a quicker way.
Right now I am trying this new approach - creating a json whose root is the company number and as argument a list of its officers.
{(.links.self | split("/")[2]): .items[]}
Using jq, it's easier to extract values from the top-level object that will be shared and generate the desired rows. You'll want to limit the amounts of times you go through the items to at most once.
$ jq -r '(.links.self | split("/")[2]) as $companyCode
| .items[]
| [ $companyCode, .country_of_residence, .officer_role, .appointed_on ]
| #csv
' input.json
Ok, you want to scan the list of officers, extract some fields from there if they are present and write that in csv format.
First part is to extract the data from the json. Assuming you loaded it is a data Python object, you have:
print(data['items'][0]['officer_role'], data['items'][0]['appointed_on'],
data['items'][0]['country_of_residence'])
gives:
llp-designated-member 2017-09-15 England
Time to put everything together with the csv module:
import csv
...
with open('output.csv', 'w', newline='') as fd:
wr = csv.writer(fd)
for officer in data['items']:
_ = wr.writerow(('OC418979',
officer.get('country_of_residence',''),
officer.get('officer_role', ''),
officer.get('appointed_on', '')
))
The get method on a dictionnary allows to use a default value (here the empty string) if the key is not present, and the csv module ensures that if a field contains a comma, it will be enclosed in quotation marks.
With your example input, it gives:
OC418979,England,llp-designated-member,2017-09-15
OC418979,,corporate-llp-designated-member,2017-09-15

PostgreSQL, get JSON object field based on a the value of a parallel attribute

Suppose we are dealing with a JSON object where there can be multiple child nodes with the same structure, and we want to get the value of attribute B,C,D,etc. where attribute A equals a specific value. Below is an example.
{
"addresses": [{
"type": "home",
"address": "123 fake street",
"zip": "24301"
}, {
"type": "work",
"address": "346 Main street",
"zip": "24352"
}, {
"type": "PO Box",
"address": "PO BOX 132, New York, NY",
"zip": "10001"
}, {
"type": "second",
"address": "1600 Pennsylvania Ave.",
"zip": "90210"
}]}
Is there any JSON operator in PostgreSQL where I can get the zip code, where the address type is "work" or "home"? I am looking at https://www.postgresql.org/docs/current/static/functions-json.html and not finding what I'm looking for.
You need to "unnest" (i.e. normalize) the data, then you can apply a WHERE condition on it:
select t.adr ->> 'zip', t.adr ->> 'address'
from the_table
cross join lateral jsonb_array_elements(the_column -> 'addresses') as t(adr)
where t.adr ->> 'type' in ('work', 'home');
Online example: http://rextester.com/TDB99535

Access deeper elements of a JSON using postgresql 9.4

I want to be able to access deeper elements stored in a json in the field json, stored in a postgresql database. For example, I would like to be able to access the elements that traverse the path states->events->time from the json provided below. Here is the postgreSQL query I'm using:
SELECT
data#>> '{userId}' as user,
data#>> '{region}' as region,
data#>>'{priorTimeSpentInApp}' as priotTimeSpentInApp,
data#>>'{userAttributes, "Total Friends"}' as totalFriends
from game_json
WHERE game_name LIKE 'myNewGame'
LIMIT 1000
and here is an example record from the json field
{
"region": "oh",
"deviceModel": "inHouseDevice",
"states": [
{
"events": [
{
"time": 1430247045.176,
"name": "Session Start",
"value": 0,
"parameters": {
"Balance": "40"
},
"info": ""
},
{
"time": 1430247293.501,
"name": "Mission1",
"value": 1,
"parameters": {
"Result": "Win ",
"Replay": "no",
"Attempt Number": "1"
},
"info": ""
}
]
}
],
"priorTimeSpentInApp": 28989.41467999999,
"country": "CA",
"city": "vancouver",
"isDeveloper": true,
"time": 1430247044.414,
"duration": 411.53,
"timezone": "America/Cleveland",
"priorSessions": 47,
"experiments": [],
"systemVersion": "3.8.1",
"appVersion": "14312",
"userId": "ef617d7ad4c6982e2cb7f6902801eb8a",
"isSession": true,
"firstRun": 1429572011.15,
"priorEvents": 69,
"userAttributes": {
"Total Friends": "0",
"Device Type": "Tablet",
"Social Connection": "None",
"Item Slots Owned": "12",
"Total Levels Played": "0",
"Retention Cohort": "Day 0",
"Player Progression": "0",
"Characters Owned": "1"
},
"deviceId": "ef617d7ad4c6982e2cb7f6902801eb8a"
}
That SQL query works, except that it doesn't give me any return values for totalFriends (e.g. data#>>'{userAttributes, "Total Friends"}' as totalFriends). I assume that part of the problem is that events falls within a square bracket (I don't know what that indicates in the json format) as opposed to a curly brace, but I'm also unable to extract values from the userAttributes key.
I would appreciate it if anyone could help me.
I'm sorry if this question has been asked elsewhere. I'm so new to postgresql and even json that I'm having trouble coming up with the proper terminology to find the answers to this (and related) questions.
You should definitely familiarize yourself with the basics of json
and json functions and operators in Postgres.
In the second source pay attention to the operators -> and ->>.
General rule: use -> to get a json object, ->> to get a json value as text.
Using these operators you can rewrite your query in the way which returns correct value of 'Total Friends':
select
data->>'userId' as user,
data->>'region' as region,
data->>'priorTimeSpentInApp' as priotTimeSpentInApp,
data->'userAttributes'->>'Total Friends' as totalFriends
from game_json
where game_name like 'myNewGame';
Json objects in square brackets are elements of a json array.
Json arrays may have many elements.
The elements are accessed by an index.
Json arrays are indexed from 0 (the first element of an array has an index 0).
Example:
select
data->'states'->0->'events'->1->>'name'
from game_json
where game_name like 'myNewGame';
-- returns "Mission1"
select
data->'states'->0->'events'->1->>'name'
from game_json
where game_name like 'myNewGame';
This did help me