I am using Spark Jobserver https://github.com/spark-jobserver/spark-jobserver and Apache Spark for some analytic processing.
I am receiving back the following structure from jobserver when a job finishes
"status": "OK",
"result": [
"[17799.91015625,null,hello there how areyou?]",
"[50000.0,null,Hi, im fine]",
"[0.0,null,All good]"
]
The result doesnt contain valid json, as explained here:
https://github.com/spark-jobserver/spark-jobserver/issues/176
So I'm trying to convert the returned structure into a json structure, however I cant simply make the result string insert ' (single quotes) based on the comma delimiter, as sometimes the result contains a comma itself.
How can i convert a spark Sql row into a json object in the above situation?
I actually found a better way in the end,
from 1.3.0 onwards you can use .toJSON on a Dataframe to convert it to json
df.toJSON.collect()
to output a dataframes schema to json you can use
df.schema.json
Related
Athena (Trino SQL) parsing JSON document (table column called document 1 in Athena) using fields (dot notation)
If the underlying json (table column called document 1 in Athena) is in the form of {a={b ...
I can parse it in Athena (Trino SQL) using
document1.a.b
However, if the JSON contains {a={"text": value1 ...
the quote marks will not parse correctly.
Is there a way to do JSON parsing of a 'field' with quotes?
If not, is there an elegant way of parsing the "text" and obtain the string in value 1? [Please see my comment below].
I cannot change the quotes in the json and its Athena "table" so I would need something that works in Trino SQL syntax.
The error message is in the form of: SQL Error [100071] [HY000]: [Simba][AthenaJDBC](100071) An error has been thrown from the AWS Athena client. SYNTAX_ERROR: Expression [redacted] is not of type ROW
NOTE: This is not a duplicate of Oracle Dot Notation Question
Dot notation works only for columns types as struct<…>. You can do that for JSON data, but judging from the error and your description this seems not to be the case. I assume your column is of type string.
If you have JSON data in a string column you can use JSON functions to parse and extract parts of them with JSONPath.
I create a table in Athena with below structure
CREATE EXTERNAL TABLE s3_json_objects (
devId string,
type string,
status string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://mybucket/folder1/data/athena_test/';
S3 bucket objects contains JSON structure like this
{
"devId": "00abcdef1122334401",
"type": "lora",
"status": "huihuhukiyg"
}
However below SQL working correctly and return the correct result for only count
SELECT count(*) as total_s3_objects FROM "athena_db"."s3_json_objects"
BUT whenever I query below SQL select statement to fetch the
JSON values from S3, It's returns result sets with empty values for columns
SELECT devid FROM "athena_db"."s3_json_objects"
SELECT json_extract(devid , '$.devid') as Id FROM "athena_db"."s3_json_objects"
SELECT * FROM "athena_db"."s3_json_objects"
Also, I review these links before post this question on StackOverflow and AWS Athena doc
Can't read json file via Amazon Athena
AWS Athena json_extract query from string field returns empty values
Any comments or suggestions would be much appreciated.
The JSON must be in a single line, as mentioned in this page of the AWS Athena documentation. You can have multiple JSON objects on separate lines, but each complete object must only span one line.
Example (this could all be in one S3 object):
{"devId": "a1", "type": "b1", "status": "c1"}
{"devId": "a2", "type": "b2", "status": "c2"}
Glue can read multi-line json objects because it has spark engine under the hood. One workaround is, transform those json objects to parquet using glue if you can't easily make those json objects on line.
Use jsonlines to convert JSON to jsonlines and then Athena will be able to fetch all row.
I have searched a lot but not found exact slution.
I have a REST service, in response of which I get rows and each row in a JSON, as given bellow:
{"event":"click1","properties":{ "time":"2 dec 2018","clicks":29,"parent":"jbar","isLast":"NO"}}
{"event":"click2","properties":{ "time":"2 dec 2018","clicks":35,"parent":"jbar3","isLast":"NO"}}
{"event":"click3","properties":{ "time":"2 dec 2018","clicks":10,"parent":"jbar2","isLast":"NO"}}
{"event":"click4","properties":{ "time":"2 dec 2018","clicks":9,"parent":"jbar1","isLast":"YES"}}
Each row is a JSON (all are similar to each other). I have a database table having all those fields as columns. I wanted to loop through these and upload all data in Talend. What I have tried is following:
tRestClient--tNormalize--tExtractJsonFields--tOracleOutput
and provided loop criteria and mapping in tExtractJsonFields component but it is not working and throwing me error saying "json can not be null or empty"
Need help in doing that.
Since your webservice returns multiple json objects in the response, it's not valid json but rather a json document.
You need to break it into individual json objects.
You can add a tNormalize between tRESTClient and tExtractJsonFields, and normalize the json document on "\n" character.
The error "json can not be null or empty" is due to an error in your Jsonpath queries. You have to set the loop query to "$", and reference the json properties using "event", "properties.time"
could you try this :
In your tExtractJsonFields, configure the property readBy to JsonPath without loop
Can I get help in creating a table on AWS Athena.
For a sample example of data :
[{"lts": 150}]
AWS Glue generate the schema as :
array (array<struct<lts:int>>)
When I try to use the created table by AWS Glue to preview the table, I had this error:
HIVE_BAD_DATA: Error parsing field value for field 0: org.openx.data.jsonserde.json.JSONObject cannot be cast to org.openx.data.jsonserde.json.JSONArray
The message error is clear, but I can't find the source of the problem!
Hive running under AWS Athena is using Hive-JSON-Serde to serialize/deserialize JSON. For some reason, they don't support just any standard JSON. They ask for one record per line, without an array. In their words:
The following example will work.
{ "key" : 10 }
{ "key" : 20 }
But this won't:
{
"key" : 20,
}
Nor this:
[{"key" : 20}]
You should create a JSON classifier to convert array into list of object instead of a single array object. Use JSON path $[*] in your classifier and then set up crawler to use it:
Edit crawler
Expand 'Description and classifiers'
Click 'Add' on the left pane to associate you classifier with crawler
After that remove previously created table and re-run the crawler. It will create a table with proper scheme but I think Athena will still be complaining when you will try to query it. However, now you can read from that table using Glue ETL job and process single record object instead of array-objects
This json - [{"lts": 150}] would work like a charm with below query:-
select n.lts from table_name
cross join UNNEST(table_name.array) as t (n)
The output would be as below:-
But I have faced a challenge with json like - [{"lts": 150},{"lts": 250},{"lts": 350}].
Even if there are 3 elements in the JSON, the query is returning only the first element. This may be because of the limitation listed by #artikas.
Definitely, we can change the json like below to make it work:-
{"lts": 150}
{"lts": 250}
{"lts": 350}
Please post if anyone is having a better solution to it.
The current Postgresql version (9.4) supports json and jsonb data type as described in http://www.postgresql.org/docs/9.4/static/datatype-json.html
For instance, JSON data stored as jsonb can be queried via SQL query:
SELECT jdoc->'guid', jdoc->'name'
FROM api
WHERE jdoc #> '{"company": "Magnafone"}';
As a Sparker user, is it possible to send this query into Postgresql via JDBC and receive the result as DataFrame?
What I have tried so far:
val url = "jdbc:postgresql://localhost:5432/mydb?user=foo&password=bar"
val df = sqlContext.load("jdbc",
Map("url"->url,"dbtable"->"mydb", "driver"->"org.postgresql.Driver"))
df.registerTempTable("table")
sqlContext.sql("SELECT data->'myid' FROM table")
But sqlContext.sql() was unable to understand the data->'myid' part in the SQL.
It is not possible to query json / jsonb fields dynamically from Spark DataFrame API. Once data is fetched to Spark it is converted to string and is no longer a queryable structure (see: SPARK-7869).
As you've already discovered you can use dbtable / table arguments to pass a subquery directly to the source and use it to extract fields of interest. Pretty much the same rule applies to any non-standard type, calling stored procedures or any other extensions.