SQLite query by json object property within an array - json

I've recently started using SQLite with the JSON1 extension which allows me to store and query dynamic/json data.
Lets take for example the following table and data structure:
# documents table:
--------------------------------------------
id | json
---- ------------------------------------
1 | [{"id": 1}, {"id": 2}, {"id": 3}]
2 | [{"id": 11}, {"id": 12}, {"id": 13}]
The problem I stumbled on is that there doesn't seems to be an easy way to query objects within an array without specifying an index. Or in other words, consider the following pseudo query:
SELECT *
FROM documents
WHERE json_extract(json, "$[*].id") > 1
# expect to return all rows that have json.*.id greater than 1
The above doesn't work because instead of [*] you have to specify a concrete array index.
One workaround of this could be to use json_each or json_tree but that can get pretty quickly out of hand if you have to handle nested array objects, eg. sub1.*.sub2.*.sub3.id
I found that the MySQL json data type supports [*] but I wasn't able to find anything similar for SQLite.
Is there some "hidden" syntax to specify [*] in json path queries for SQLite that I'm missing or this is a limitation of the JSON1 extension?

Related

How to query json file located at s3 using presto

I have one json file stored in amazon-s3 location, I want to query this json file using presto. how can I achieve this?
Option 1 - Presto on EMR with json_extract built-in function
I am supposing that you have already launched Presto using EMR.
The easiest way to do this would be to use the json_extract function that comes by default with Presto.
So imagine you have a json file on s3 like this:
{"a": "a_value1", "b": { "bb": "bb_value1" }, "c": "c_value1"}
{"a": "a_value2", "b": { "bb": "bb_value2" }, "c": "c_value2"}
{"a": "a_value3", "b": { "bb": "bb_value3" }, "c": "c_value3"}
...
...
Each row represents a json tree object.
So you can simply define in presto a table with one field which is of string type, and then easily query the table with json_extract.
SELECT json_extract(json_field, '$.b.bb') as extract
FROM my_table
The result would be something like:
| extract |
|-----------------|
| bb_value1 |
| bb_value2 |
| bb_value3 |
This can be a fast and easy way to read a json file using presto, but unfortunately this doesn't scale well on big json files.
Some presto docs on json_extract: https://prestodb.github.io/docs/current/functions/json.html#json_extract
Option 2 - Presto on EMR with a specific Serde for json files
You can also customize your presto in bootstrap phase of your emr cluster, by adding custom plugins or SerDe libraries.
So you just have to choose one of the available JSON SerDe libraries (e.g. org.openx.data.jsonserde.JsonSerDe) and follow their guide to define a table that matches the structure of the Json file.
You will be able to access to the fields of the json file in a similar way of the json_extract (using the dotted notation), and it should be faster and scale well on big files. Unfortunately using this method you have 2 main problems:
1) Defining a table for complex files is like being in hell.
2) You may have internal java cast exception, because the data in the json couldn't be easily casted by the SerDe library.
Option 3 - Athena Built-In JSON Serde
https://docs.aws.amazon.com/athena/latest/ug/json.html
It seems that you have some Json SerDe also built-in Athena, I have personally never tried these but they are managed by AWS so should be easier to set up everything.
Rather than installing and running your own Presto service, there are some other options you can try:
Amazon Athena is a fully-managed Presto service. You can use it to query large datastores in Amazon S3, including compressed and partitioned data.
Amazon S3 Select allows you to run a query on a single object stored in Amazon S3. This is possibly simpler for your particular use-case.

JSON parsing using String Condition with JMESPATH

I am facing an issue with parsing a json using Jmespath based on a string condition. I would want to get values of a key based on condition on string value of another key. Both of them are in same hierarchy of a given dictionary. However , the parent key of the dictionary is a variable.I am familiar with jq but jmespath is new to me. I am using it , as the current project depends on it. Hence no much choice on changing the parser.
Sample Json below :-
{
"people": {
"a": {"First":"James", "last": "1"},
"b": {"First":"Jacob", "last": "2"},
"c": {"First":"Jayden", "last": "3"},
"d": {"First":"different", "last" : "4"}
}
}
I would want to get the last's value where First's value starts with "J".
I have tried referring to articles provided in official site at http://jmespath.org/tutorial.html. However , most of them concentrate on a standard key structure and not much on a variable key dictionary structure. Hence am unable to write a jmespath query for a given json.
The jq equivalent for achieving the intended result is given below :-
.people | .[] | select (.First | startswith("J")) | .last
The closest jmespath query , that I could logically arrive at based on my understanding is :-
people.*[?starts_with(First,`J`)].last
However , the above query returns blank result.
Expected output is
"d","e","f"
I am unable to understand where I am going wrong.
It would be nice if someone can point me to a good article or help me find the solution to the above issue.
Thanks alot
UPDATE :
The solution is to use values(#).
Reference link
https://github.com/jmespath/jmespath.site/issues/24
So one of the possible solution for the above ask is
people.values(#)[?starts_with(First,`J`)].last
So for any variable key , we can use values(#) to filter projections further down the structure.

How can I validate Json schema in spark 2.X?

Using Spark streaming (written in Scala) to read messages from Kafka.
The messages are all Strings in Json format.
Defining the expected schema in a local variable expectedSchema
then parsing the Strings in the RDD to Json
spark.sqlContext.read.schema(schema).json(rdd.toDS())
The problem: Spark will process all the records/rows as long as it has some fields that I try to read, even if the actual Json format (i.e schema) of the input row (String) doesn't match my expectedSchema.
Assume expected schema looks like this (in Json): {"a": 1,"b": 2, "c": 3}
and input row looks like this: {"a": 1, "c": 3}
Spark will process the input without failing.
I tried using the solution described here: How do I apply schema with nullable = false to json reading
but assert(readJson.schema == expectedSchema) never fails, even when I deliberately send input rows with wrong Json schema.
Is there a way for me to verify that the actual schema of a given input row, matches my expected schema?
Is there a way for me to insert a null value to "fill" fields missing from "corrupt" schema row?

Postgres json type inner Query [duplicate]

I am looking for some docs and/or examples for the new JSON functions in PostgreSQL 9.2.
Specifically, given a series of JSON records:
[
{name: "Toby", occupation: "Software Engineer"},
{name: "Zaphod", occupation: "Galactic President"}
]
How would I write the SQL to find a record by name?
In vanilla SQL:
SELECT * from json_data WHERE "name" = "Toby"
The official dev manual is quite sparse:
http://www.postgresql.org/docs/devel/static/datatype-json.html
http://www.postgresql.org/docs/devel/static/functions-json.html
Update I
I've put together a gist detailing what is currently possible with PostgreSQL 9.2.
Using some custom functions, it is possible to do things like:
SELECT id, json_string(data,'name') FROM things
WHERE json_string(data,'name') LIKE 'G%';
Update II
I've now moved my JSON functions into their own project:
PostSQL - a set of functions for transforming PostgreSQL and PL/v8 into a totally awesome JSON document store
Postgres 9.2
I quote Andrew Dunstan on the pgsql-hackers list:
At some stage there will possibly be some json-processing (as opposed
to json-producing) functions, but not in 9.2.
Doesn't prevent him from providing an example implementation in PLV8 that should solve your problem. (Link is dead now, see modern PLV8 instead.)
Postgres 9.3
Offers an arsenal of new functions and operators to add "json-processing".
The manual on new JSON functionality.
The Postgres Wiki on new features in pg 9.3.
The answer to the original question in Postgres 9.3:
SELECT *
FROM json_array_elements(
'[{"name": "Toby", "occupation": "Software Engineer"},
{"name": "Zaphod", "occupation": "Galactic President"} ]'
) AS elem
WHERE elem->>'name' = 'Toby';
Advanced example:
Query combinations with nested array of records in JSON datatype
For bigger tables you may want to add an expression index to increase performance:
Index for finding an element in a JSON array
Postgres 9.4
Adds jsonb (b for "binary", values are stored as native Postgres types) and yet more functionality for both types. In addition to expression indexes mentioned above, jsonb also supports GIN, btree and hash indexes, GIN being the most potent of these.
The manual on json and jsonb data types and functions.
The Postgres Wiki on JSONB in pg 9.4
The manual goes as far as suggesting:
In general, most applications should prefer to store JSON data as
jsonb, unless there are quite specialized needs, such as legacy
assumptions about ordering of object keys.
Bold emphasis mine.
Performance benefits from general improvements to GIN indexes.
Postgres 9.5
Complete jsonb functions and operators. Add more functions to manipulate jsonb in place and for display.
Major good news in the release notes of Postgres 9.5.
With Postgres 9.3+, just use the -> operator. For example,
SELECT data->'images'->'thumbnail'->'url' AS thumb FROM instagram;
see http://clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json/ for some nice examples and a tutorial.
With postgres 9.3 use -> for object access. 4 example
seed.rb
se = SmartElement.new
se.data =
{
params:
[
{
type: 1,
code: 1,
value: 2012,
description: 'year of producction'
},
{
type: 1,
code: 2,
value: 30,
description: 'length'
}
]
}
se.save
rails c
SELECT data->'params'->0 as data FROM smart_elements;
returns
data
----------------------------------------------------------------------
{"type":1,"code":1,"value":2012,"description":"year of producction"}
(1 row)
You can continue nesting
SELECT data->'params'->0->'type' as data FROM smart_elements;
return
data
------
1
(1 row)

Postgresql from Json Object to array

How can I convert a Postgresql stored json of this form
{"Kategorie": [{"ID": "environment", "ID": "economy"}]}
to get ["environment", "economy"] only using Postgresqls json flavoured syntax. The array in the stored source has here two elements, but may contain more (or only one). And the resulting array should result in all value elements.
This may give you something to work with:
SELECT ARRAY(select json_extract_path_text(x, 'ID') from
json_array_elements(
'{"Kategorie": [{"ID": "environment"}, {"ID": "economy"}]}'::json->'Kategorie')
as x)
The result is a text array:
{environment,economy}
It is entirely possible that there's a cleaner way to do this :)
The JSON operators documentation has the details. (This is 9.3+ only, 9.2 had very few utility functions.)