N1QL Query ARRAY_CONTAINS speed - couchbase

I have the documents of the following form that I need to query:
{
"id": "-KWiJ1LlYbXSSRUmocwK",
"ownerID": "72f16d9d-b905-498c-a7ff-9702cdcae996",
"orgID": "20071513",
"teams": [
"5818f7a75f84c800079186a8",
"5818cbb25f84c800079186a7"
]
}
And I'll want to be able to query based on ownerID and the teams array. My query currently looks like so:
SELECT id FROM
default AS p
WHERE p.ownerID = $1
OR ARRAY_CONTAINS(p.teams, $2)
ORDER BY id
So I can get documents with the expected ownerID as well as documents that have a specific team id in the teams array. This query does work, but I'm concerned about performance when I have a lot of documents, and possibly some documents have up to 20 teams assigned.
Am I on the right track?
EDIT: Couchbase ver 4.1

Couchbase 4.5 introduced array indexing. This allows you to index individual elements of an array, in your case the teams array. This will be essential for the performance of your query. With 4.5.1 or 4.6, you will do:
CREATE INDEX idx_owner ON default( ownerID );
CREATE INDEX idx_teams ON default( DISTINCT ARRAY t FOR t IN teams END );
SELECT id
FROM default AS p
WHERE p.ownerID = $1
UNION
SELECT id
FROM default AS p
WHERE ANY t IN p.teams SATISFIES t = $2 END;

Related

json_set returns desired result (in SQLITE Browser) but does not update the table

This is my table:
CREATE TABLE orders(
id integer primary key,
order_uid text unique,
created_at text,
updated_at text,
created_by text,
updated_by text,
client text,
phone text,
device text,
items json,
comments text
)
'items' is a list of dictionaries - valid json.
This is what 'items' looks like:
[
{
"am_pm": "AM",
"brand_": "EEE",
"quantity": 8,
"code": "1-936331-67-5",
"delivery_date": "2020-04-19",
"supplier": "XXX",
"part_uid": "645039eb-82f4-4eed-b5f9-115b09679c66",
"name": "WWWWWW",
"price": 657,
"status": "Not delivered"
},
{
"am_pm": "AM",
"brand_": "DDDDDDD",
...
},
...
]
This is what I'm running (in 'execute sql' tab in sqlitebrowser V. 3.11.2, SQLite version 3.31.1), and it looks like it returns the desired results, however not reflected in the actual table, it doesn't update it:
select json_set(value, "$.am_pm", "Tequilla") from orders, json_each(orders.items, '$')
where orders.id=2 and json_extract(value, '$.part_uid') = '35f81391-392b-4d5d-94b4-a5639bba8591'
I also ran
update orders
set items = (select json_set(orders.items, '$.am_pm', "Tequilla") from orders, json_each(orders.items, '$'))
where orders.id=2
With the result being - it deleted the list of dicts and replaced it with a single dict, with the 'am_pm' field updated.
What is the correct sql statement, so I can update a single (or several) object/s in 'items'?
After much fiddling and posting on Sqlite forums as well, the optimal solution seems to be:
update orders
set items = (select json_set(items, fullkey||'.brand_', 'Teq')
from orders, json_each(items, '$')
where json_extract(value, '$.supplier') = 'XXX' and orders.id = 1)
where orders.id = 1
This will only update a single item in the json array, even if multiple items meet the criteria.
It would be helpful if someone more experienced in Sqlite could come up with a solution of updating multiple elements in a json array at once.

How to speed up nested JSON query in PostgreSQL?

For our development of a flight retail engine we store orders as JSON documents in a PostgreSQL database.
The order table is defined as:
CREATE TABLE IF NOT EXISTS orders (
id SERIAL PRIMARY KEY,
order_data JSONB NOT NULL
);
A simplified version of a typical order document looks like this:
{
"orderID":"ORD000001",
"invalid":false,
"creationDate":"2017-11-19T15:49:53.897",
"orderItems":[
{
"orderItemID":"ITEM000001",
"flight":{
"id":"FL000001",
"segments":[
{
"origin":"FRA",
"destination":"LHR",
"departure":"2018-05-12T14:00:00",
"arrival":"2018-05-12T14:40:00",
"marketingCarrier":"LH",
"marketingFlightNumber":"LH908"
}
]
},
"passenger":{
"lastName":"Test",
"firstName":"Thomas",
"passengerTypeCode":"ADT"
}
},
{
"orderItemID":"ITEM000002",
"flight":{
"id":"FL000002",
"segments":[
{
"origin":"LHR",
"destination":"FRA",
"departure":"2018-05-17T11:30:00",
"arrival":"2018-05-17T14:05:00",
"marketingCarrier":"LH",
"marketingFlightNumber":"LH905"
}
]
},
"passenger":{
"lastName":"Test",
"firstName":"Thomas",
"passengerTypeCode":"ADT"
}
}
]
}
The number of entries for this table can grow rather larger (up to over 100 million).
Creating a GIN index on "orderID" works fine and, as expected, significantly speeds up queries for orders with a specific ID.
But we also require a fast execution time for much more complex requests like searching for orders with a specific flight segment.
Thanks to this thread I was able to write a request like
SELECT *
FROM orders,
jsonb_array_elements(order_data->'orderItems') orderItems,
jsonb_array_elements(orderItems->'flight'->'segments') segments
WHERE order_data->>'invalid'='false'
AND segments->>'origin'='LHR'
AND ( (segments->>'marketingCarrier'='LH' AND segments->>'marketingFlightNumber'='LH905') OR (segments->>'operatingCarrier'='LH' AND segments->>'operatingFlightNumber'='LH905') )
AND segments->>'departure' BETWEEN '2018-05-17T10:00:00' AND '2018-05-17T18:00:00'
This works fine, but is too slow for our requirements.
What is the best way to speed up such a query?
Creating a materialized view like
CREATE MATERIALIZED VIEW order_segments AS
SELECT id, order_data->>'orderID' AS orderID, segments->>'origin' AS origin, segments->>'marketingCarrier' AS marketingCarrier, segments->>'marketingFlightNumber' AS marketingFlightNumber, segments->>'operatingCarrier' AS operatingCarrier, segments->>'operatingFlightNumber' AS operatingFlightNumber, segments->>'departure' AS departure
FROM orders,
jsonb_array_elements(order_data -> 'orderItems') orderItems,
jsonb_array_elements(orderItems -> 'flight'->'segments') segments
WHERE order_data->>'invalid'='false';
works, but has the disadvantage of not being updated automatically.
So, how would I define indices on the orders table to achieve fast execution times? Or is there an entirely different solution?
Finally found an answer to my own question:
Setting an index
CREATE INDEX ix_order_items ON orders USING gin (((order_data->'orderItems')) jsonb_path_ops)
and using the request
SELECT DISTINCT id, order_data
FROM orders,
jsonb_array_elements(order_data -> 'orderItems') orderItems,
jsonb_array_elements(orderItems -> 'flight'->'segments') segments
WHERE id IN
( SELECT id
FROM orders
WHERE order_data->'orderItems'#>'[{"flight": {"segments": [{"origin":"LHR"}]}}]'
AND (
order_data->'orderItems'#>'[{"flight": {"segments": [{"marketingCarrier":"LH","marketingFlightNumber":"LH905"}]}}]'
OR
order_data->'orderItems'#>'[{"flight": {"segments": [{"operatingCarrier":"LH","operatingFlightNumber":"LH905"}]}}]'
)
)
AND order_data#>'{"invalid": false}'
AND segments->>'departure' BETWEEN '2018-05-17T10:00:00' AND '2018-05-17T18:00:00'
speeds up the request from several seconds to a few milliseconds.

mySQL JSON : search array of objects where property value in list

I have a JSON column, manifest, containing an array of objects.
I need to return all table rows where any of the objects in their array have a slide_id that is present in a sub select.
The structure of the JSON field is..
{ matrix:[
{
row:1,
col:1,
slide_id:1
},
{
row:1,
col:2,
slide_id:5
}
]
}
So I want to run something like this....
SELECT id FROM presentation WHERE manifest->'$.matrix[*].slide_id' IN ( (SELECT id from slides WHERE date_deleted IS NOT NULL) );
But this doesn't work as manifest->'$.matrix[*].slide_id' returns a JSON array for each row.
I have managed to get this to work, but its amazingly slow as it scans the whole table...
SELECT
p.id
FROM
(
SELECT id,
manifest->'$.matrix[*].slide_id' as slide_ids
FROM `presentation`
) p
INNER JOIN `pp_slides` s
ON JSON_CONTAINS(p.slide_ids, CAST(s.id as json), '$')
WHERE s.date_deleted IS NOT NULL
If I filter it down to an individual presentation ID, then its not too bad, but still takes 700 ms for a presentation with a couple of hundred slides in it. Is there a cleaner way to do this?
I suppose the best way would be to refactor it to store the matrix as a relational table....

Active Record (MYSQL) - Select distinct ids from multiple columns

Ver 14.14 Distrib 5.1.73
activerecord (4.1.14)
I have a trade model that belongs to a lender and borrower. I want to find all uniq counterparties to an institution's trades in one SQL query. The query below works, but only because I flatten & unique-ify the array after the SQL query:
Trade.where("borrower_id = :id OR lender_id = :id", id: institution.id).uniq.pluck(:lender_id, :borrower_id).flatten.uniq
(I know this includes the institution itself, so we normalize after with [1,2,3,4] - [1])
But what I'd like to do is use a Group By clause or something so that my SQL query handles the flatten.uniq part.
The below does not work because it returns a nested array of unique combinations of lender_id and borrower_id:
Trade.where("borrower_id = :id OR lender_id = :id", id: institution.id).group(:lender_id, :borrower_id).uniq.pluck(:lender_id, :borrower_id)
=> [[1,2], [1,3], [2,3]]
I want a flat array of JUST unique ids: [1,2,3]
Any ideas? Thanks!
I don't understand what you're trying to, or why you'd want to include a GROUP BY clause in the absence of any aggregating functions.
FWIW, a valid query might look like this...
SELECT DISTINCT t.lender_id
, t.borrower_id
FROM trades t
WHERE 28 IN(t.borrower_id,t.lender_id);

Django ORM - Grouped aggregates with different select clauses

Imagine we have the Django ORM model Meetup with the following definition:
class Meetup(models.Model):
language = models.CharField()
speaker = models.CharField()
date = models.DateField(auto_now=True)
I'd like to use a single query to fetch the language, speaker and date for the
latest event for each language.
>>> Meetup.objects.create(language='python', speaker='mike')
<Meetup: Meetup object>
>>> Meetup.objects.create(language='python', speaker='ryan')
<Meetup: Meetup object>
>>> Meetup.objects.create(language='node', speaker='noah')
<Meetup: Meetup object>
>>> Meetup.objects.create(language='node', speaker='shawn')
<Meetup: Meetup object>
>>> Meetup.objects.values("language").annotate(latest_date=models.Max("date")).values("language", "speaker", "latest_date")
[
{'speaker': u'mike', 'language': u'python', 'latest_date': ...},
{'speaker': u'ryan', 'language': u'python', 'latest_date': ...},
{'speaker': u'noah', 'language': u'node', 'latest_date': ...},
{'speaker': u'shawn', 'language': u'node', 'latest_date': ...},
]
D'oh! We're getting the latest event, but for the wrong grouping!
It seems like I need a way to GROUP BY the language but SELECT on a different
set of fields?
Update - this sort of query seems fairly easy to express in SQL:
SELECT language, speaker, MAX(date)
FROM app_meetup
GROUP BY language;
I'd love a way to do this without using Django's raw() - is it possible?
Update 2 - after much searching, it seems there are similar questions on SO:
Django Query that gets the most recent objects
How can I do a greatest n per group query in Django
MySQL calls this sort of query a group-wise maximum of a certain column.
Update 3 - in the end, with #danihp's help, it seems the best you can do
is two queries. I've used the following approach:
# Abuse the fact that the latest Meetup always has a higher PK to build
# a ValuesList of the latest Meetups grouped by "language".
latest_meetup_pks = (Meetup.objects.values("language")
.annotate(latest_pk=Max("pk"))
.values_list("latest_pk", flat=True))
# Use a second query to grab those latest Meetups!
Meetup.objects.filter(pk__in=latest_meetup_pks)
This question is a follow up to my previous question:
Django ORM - Get latest record for group
This is the kind of queries that are easy to explain but hard to write. If this be SQL I will suggest to you a CTE filtered query with row rank over partition by language ordered by date ( desc )
But this is not SQL, this is django query api. Easy way is to do a query for each language:
languages = Meetup.objects.values("language", flat = True).distinct.order_by()
last_by_language = [ Meetup
.objects
.filter( language = l )
.latest( 'date' )
for l in languages
]
This crash if some language don't has meetings.
The other approach is to get all max data for each language:
last_dates = ( Meetup
.objects
.values("language")
.annotate(ldate=models.Max("date"))
.order_by() )
q= reduce(lambda q,meetup:
q | ( Q( language = meetup["language"] ) & Q( date = meetup["ldate"] ) ),
last_dates, Q())
your_query = Meetup.objects.filter(q)
Perhaps someone can explain how to do it in a single query without raw sql.
Edited due OP comment
You are looking for:
"SELECT language, speaker, MAX(date) FROM app_meetup GROUP BY language"
Not all rdbms supports this expression, because all fields that are not enclosed into aggregated functions on select clause should appear on group by clause. In your case, speaker is on select clause (without aggregated function) but not appear in group by.
In mysql they are not guaranties than showed result speaker was that match with max date. Because this, we are not facing a easy query.
Quoting MySQL docs:
In standard SQL, a query that includes a GROUP BY clause cannot refer
to nonaggregated columns in the select list that are not named in the
GROUP BY clause...However, this is useful primarily when all values
in each nonaggregated column not named in the GROUP BY are the same
for each group.
The most close query to match your requirements is:
Reults = ( Meetup
.objects
.values("language","speaker")
.annotate(ldate=models.Max("date"))
.order_by() )