How to speed up nested JSON query in PostgreSQL? - json

For our development of a flight retail engine we store orders as JSON documents in a PostgreSQL database.
The order table is defined as:
CREATE TABLE IF NOT EXISTS orders (
id SERIAL PRIMARY KEY,
order_data JSONB NOT NULL
);
A simplified version of a typical order document looks like this:
{
"orderID":"ORD000001",
"invalid":false,
"creationDate":"2017-11-19T15:49:53.897",
"orderItems":[
{
"orderItemID":"ITEM000001",
"flight":{
"id":"FL000001",
"segments":[
{
"origin":"FRA",
"destination":"LHR",
"departure":"2018-05-12T14:00:00",
"arrival":"2018-05-12T14:40:00",
"marketingCarrier":"LH",
"marketingFlightNumber":"LH908"
}
]
},
"passenger":{
"lastName":"Test",
"firstName":"Thomas",
"passengerTypeCode":"ADT"
}
},
{
"orderItemID":"ITEM000002",
"flight":{
"id":"FL000002",
"segments":[
{
"origin":"LHR",
"destination":"FRA",
"departure":"2018-05-17T11:30:00",
"arrival":"2018-05-17T14:05:00",
"marketingCarrier":"LH",
"marketingFlightNumber":"LH905"
}
]
},
"passenger":{
"lastName":"Test",
"firstName":"Thomas",
"passengerTypeCode":"ADT"
}
}
]
}
The number of entries for this table can grow rather larger (up to over 100 million).
Creating a GIN index on "orderID" works fine and, as expected, significantly speeds up queries for orders with a specific ID.
But we also require a fast execution time for much more complex requests like searching for orders with a specific flight segment.
Thanks to this thread I was able to write a request like
SELECT *
FROM orders,
jsonb_array_elements(order_data->'orderItems') orderItems,
jsonb_array_elements(orderItems->'flight'->'segments') segments
WHERE order_data->>'invalid'='false'
AND segments->>'origin'='LHR'
AND ( (segments->>'marketingCarrier'='LH' AND segments->>'marketingFlightNumber'='LH905') OR (segments->>'operatingCarrier'='LH' AND segments->>'operatingFlightNumber'='LH905') )
AND segments->>'departure' BETWEEN '2018-05-17T10:00:00' AND '2018-05-17T18:00:00'
This works fine, but is too slow for our requirements.
What is the best way to speed up such a query?
Creating a materialized view like
CREATE MATERIALIZED VIEW order_segments AS
SELECT id, order_data->>'orderID' AS orderID, segments->>'origin' AS origin, segments->>'marketingCarrier' AS marketingCarrier, segments->>'marketingFlightNumber' AS marketingFlightNumber, segments->>'operatingCarrier' AS operatingCarrier, segments->>'operatingFlightNumber' AS operatingFlightNumber, segments->>'departure' AS departure
FROM orders,
jsonb_array_elements(order_data -> 'orderItems') orderItems,
jsonb_array_elements(orderItems -> 'flight'->'segments') segments
WHERE order_data->>'invalid'='false';
works, but has the disadvantage of not being updated automatically.
So, how would I define indices on the orders table to achieve fast execution times? Or is there an entirely different solution?

Finally found an answer to my own question:
Setting an index
CREATE INDEX ix_order_items ON orders USING gin (((order_data->'orderItems')) jsonb_path_ops)
and using the request
SELECT DISTINCT id, order_data
FROM orders,
jsonb_array_elements(order_data -> 'orderItems') orderItems,
jsonb_array_elements(orderItems -> 'flight'->'segments') segments
WHERE id IN
( SELECT id
FROM orders
WHERE order_data->'orderItems'#>'[{"flight": {"segments": [{"origin":"LHR"}]}}]'
AND (
order_data->'orderItems'#>'[{"flight": {"segments": [{"marketingCarrier":"LH","marketingFlightNumber":"LH905"}]}}]'
OR
order_data->'orderItems'#>'[{"flight": {"segments": [{"operatingCarrier":"LH","operatingFlightNumber":"LH905"}]}}]'
)
)
AND order_data#>'{"invalid": false}'
AND segments->>'departure' BETWEEN '2018-05-17T10:00:00' AND '2018-05-17T18:00:00'
speeds up the request from several seconds to a few milliseconds.

Related

MySQL - top product exported by each state based on the sum of individual export_values

I am trying to write a query in Peewee with a MySQL database to return the top product exported by each state based on the sum of the export_values for a given product-state pair. I am wondering what the most optimal SQL query to achieve that would be. Here's the schema of the table I care about:
Trade
- product (varchar)
- state (varchar)
- export_value (numeric)
...
The fields I need to select are: state, product, and total_export_value.
Any guidance on how to design this query? If possible, also on how to translate it into Peewee (I am very new to it).
EDIT:
Here's the Peewee query I've tried:
subquery = (
models.Trade.select(
models.Trade.state.alias("state_1"),
models.Trade.product.alias("product_1"),
fn.SUM(models.Trade.export_value).alias("export_value_1")
).where(
models.Trade.origin_country == origin_country,
models.Trade.year == args["year"]
).group_by(
models.Trade.state,
models.Trade.product
).alias("subquery")
)
query = (
models.Trade.select(
models.Trade.state,
models.Trade.product,
fn.MAX(subquery.c.export_value_1).alias("export_value")
).join(
subquery, on=(
(models.Trade.state == subquery.c.state_1) &
(models.Trade.product == subquery.c.product_1)
)
).group_by(
models.Trade.state
)
)
It's not working for my needs because MySQL is not selecting the appropriate product, but the state and total_export_value are selected just fine. I suspect it's because of the way it's joining the two queries and that product is not used in the GROUP BY of the query.

mySQL JSON : search array of objects where property value in list

I have a JSON column, manifest, containing an array of objects.
I need to return all table rows where any of the objects in their array have a slide_id that is present in a sub select.
The structure of the JSON field is..
{ matrix:[
{
row:1,
col:1,
slide_id:1
},
{
row:1,
col:2,
slide_id:5
}
]
}
So I want to run something like this....
SELECT id FROM presentation WHERE manifest->'$.matrix[*].slide_id' IN ( (SELECT id from slides WHERE date_deleted IS NOT NULL) );
But this doesn't work as manifest->'$.matrix[*].slide_id' returns a JSON array for each row.
I have managed to get this to work, but its amazingly slow as it scans the whole table...
SELECT
p.id
FROM
(
SELECT id,
manifest->'$.matrix[*].slide_id' as slide_ids
FROM `presentation`
) p
INNER JOIN `pp_slides` s
ON JSON_CONTAINS(p.slide_ids, CAST(s.id as json), '$')
WHERE s.date_deleted IS NOT NULL
If I filter it down to an individual presentation ID, then its not too bad, but still takes 700 ms for a presentation with a couple of hundred slides in it. Is there a cleaner way to do this?
I suppose the best way would be to refactor it to store the matrix as a relational table....

N1QL Query ARRAY_CONTAINS speed

I have the documents of the following form that I need to query:
{
"id": "-KWiJ1LlYbXSSRUmocwK",
"ownerID": "72f16d9d-b905-498c-a7ff-9702cdcae996",
"orgID": "20071513",
"teams": [
"5818f7a75f84c800079186a8",
"5818cbb25f84c800079186a7"
]
}
And I'll want to be able to query based on ownerID and the teams array. My query currently looks like so:
SELECT id FROM
default AS p
WHERE p.ownerID = $1
OR ARRAY_CONTAINS(p.teams, $2)
ORDER BY id
So I can get documents with the expected ownerID as well as documents that have a specific team id in the teams array. This query does work, but I'm concerned about performance when I have a lot of documents, and possibly some documents have up to 20 teams assigned.
Am I on the right track?
EDIT: Couchbase ver 4.1
Couchbase 4.5 introduced array indexing. This allows you to index individual elements of an array, in your case the teams array. This will be essential for the performance of your query. With 4.5.1 or 4.6, you will do:
CREATE INDEX idx_owner ON default( ownerID );
CREATE INDEX idx_teams ON default( DISTINCT ARRAY t FOR t IN teams END );
SELECT id
FROM default AS p
WHERE p.ownerID = $1
UNION
SELECT id
FROM default AS p
WHERE ANY t IN p.teams SATISFIES t = $2 END;

SQLAlchemy foreign keys mapped to list of ids, not entities

In the usual Customer with Orders example, this kind of SQLAlchemy code...
data = db.query(Customer)\
.join(Order, Customer.id == Order.cst_id)\
.filter(Order.amount>1000)
...would provide instances of the Customer model that are associated with e.g. large orders (amount > 1000). The resulting Customer instances would also include a list of their orders, since in this example we used backref for that reason:
class Order:
...
customer = relationship("customers", backref=backref('orders'))
The problem with this, is that iterating over Customer.orders means that the DB will return complete instances of Order - basically doing a 'select *' on all the columns of Order.
What if, for performance reasons, one wants to e.g. read only 1 field from Order (e.g. the id) and have the .orders field inside Customer instances be a simple list of IDs?
customers = db.query(Customer)....
...
pdb> print customers[0].orders
[2,4,7]
Is that possible with SQLAlchemy?
What you could do is make a query this way:
(
session.query(Customer.id, Order.id)
.select_from(Customer)
.join(Customer.order)
.filter(Order.amount > 1000)
)
It doesn't produce the exact result as what you have asked, but it gives you a list of tuples which looks like [(customer_id, order_id), ...].
I am not entirely sure if you can eagerly load order_ids into Customer object, but I think it should, you might want to look at joinedload, subqueryload and perhaps go through the relationship-loading docs if that helps.
In this case it works you could write it as;
(
session.query(Customer)
.select_from(Customer)
.join(Customer.order)
.options(db.joinedload(Customer.orders))
.filter(Order.amount > 1000)
)
and also use noload to avoid loading other columns.
I ended up doing this optimally - with array aggregation:
data = db.query(Customer).with_entities(
Customer,
func.ARRAY_AGG(
Order.id,
type_=ARRAY(Integer, as_tuple=True)).label('order_ids')
).outerjoin(
Orders, Customer.id == Order.cst_id
).group_by(
Customer.id
)
This returns tuples of (CustomerEntity, list) - which is exactly what I wanted.

How to SQL Select a one to many relation and merge the output

This has been dramatically updated since i got closer to the solution
I guess the title is not the best but I did not know how to explain it better.
I have locations with coordinates in two related tables. Table locations(id, name, description, created_at) and locations_coordinates(location_id, lat, lng, coordinates_order).
I am storing an unknown amount of coordinates (a polygon), that`s why I use two tables.
Now I am running the following query
SELECT l.id,
l.name,
l.description,
l.created_at,
GROUP_CONCAT(Concat(c.lat, ":", c.lng) ORDER BY c.coordinate_order ASC
SEPARATOR
', ') AS coordinates
FROM locations l
LEFT JOIN locations_coordinates c
ON l.id = c.location_id
WHERE l.id = ' . $id . '
GROUP BY l.id,
l.name,
l.description,
l.created_at ASC
So i get the following output (using id = 3):
[
{
"id":"3",
"name":"Stadthalle",
"description":"Die Wiener Stadthalle",
"created_at":"2012-01-07 14:22:06",
"coordinates":"48.201187:16.334213, 48.200665:16.331606, 48.202989:16.331091, 48.203075:16.334192"
}
]
What i would like to get is the latitude and longitude pairs together, something like:
[
{
"id":"3",
"name":"Stadthalle",
"description":"Die Wiener Stadthalle",
"created_at":"2012-01-07 14:22:06",
"coordinates":[
[
"48.201187,16.334213"
],
[
"48.200665,16.331606"
],
[
"48.202989,16.331091"
],
[
"48.203075,16.334192"
]
]
}
]
So my question is: is there a way to get the needed output with SQL only? If not, can my query be improved, so that I have it easier doing it with application code (PHP for me) ?
UPDATE:
I`m using MyISAM and in the "locations_coordinates" table, "locations_id" and "coordinates_order" are PRIMARY and "coordinates_order" is set to AUTO INCREMENT, so it starts always a new series of order numbers when inserted. (I need this so i can select the coordinates in the right order later).
as long as you setup foreign keys( multiple if need be ), however make sure you index the keys for more speed.
then sure you can do multiple joins like you mention above, as long as locations and coordinates have the foreign key user.id
I answered a similar question a few days ago actually (funny enough it had to do with locations/coordinates ) , unfortunate I cant click on any links tonight to forward you, browse my profile maybe.
I resolves similar problems in this way:
First, get all the location records I need, Put them in a hash map with the ID as key.
Second, get all the coordinate records where location_id in location records id.
Maybe: select * from coordinate where location_id in (?, ?, ...)
Third, iterate the coordinate records and set lat and lng column into the corresponding location record.
If the size of location records is not too large, it need only two SQL statements. Thus, we get a better performance.