Optimize query from view with UNION ALL - mysql

So I'm facing a difficult scenario, I have a legacy app, bad written and designed, with a table, t_booking. This app, has a calendar view, where, for every hall, and for every day in the month, shows its reservation status, with this query:
SELECT mr1b.id, mr1b.idreserva, mr1b.idhotel, mr1b.idhall, mr1b.idtiporeserva, mr1b.date, mr1b.ampm, mr1b.observaciones, mr1b.observaciones_bookingarea, mr1b.tipo_de_navegacion, mr1b.portal, r.estado
FROM t_booking mr1b
LEFT JOIN a_reservations r ON mr1b.idreserva = r.id
WHERE mr1b.idhotel = '$sIdHotel' AND mr1b.idhall = '$hall' AND mr1b.date = '$iAnyo-$iMes-$iDia'
AND IF (r.comidacena IS NULL OR r.comidacena = '', mr1b.ampm = 'AM', r.comidacena = 'AM' AND mr1b.ampm = 'AM')
AND (r.estado <> 'Cancelled' OR r.estado IS NULL OR r.estado = '')
LIMIT 1;
(at first there was also a ORDER BY r.estado DESC which I took out)
This query, after proper (I think) indexing, takes 0.004 seconds each, and the overall calendar view is presented in a reasonable time. There are indexes over idhotel, idhall, and date.
Now, I have a new module, well written ;-), which does reservations in another table, but I must present both types of reservations in same calendar view. My first approach was create a view, joining content of both tables, and selecting data for calendar view from this view instead of t_booking.
The view is defined like this:
CREATE OR REPLACE VIEW
t_booking_hall_reservation
AS
SELECT id,
idreserva,
idhotel,
idhall,
idtiporeserva,
date,
ampm,
observaciones,
observaciones_bookingarea,
tipo_de_navegacion, portal
FROM t_booking
UNION ALL
SELECT HR.id,
HR.convention_id as idreserva,
H.id_hotel as idhotel,
HR.hall_id as idhall,
99 as idtiporeserva,
date,
session as ampm,
observations as observaciones,
'new module' as observaciones_bookingarea,
'events' as tipo_de_navegacion,
'new module' as portal
FROM new_hall_reservation HR
JOIN a_halls H on H.id = HR.hall_id
;
(table new_hall_reservation has same indexes)
I tryed UNION ALL instead of UNION as I read this is much more efficient.
Well, the former query, changing t_booking for t_booking_hall_reservation, takes 1.5 seconds, to multiply for each hall and each day, which makes calendar view impossible to finish.
The app is spaguetti code, so, looping twice, once over t_booking and then over new_hall_reservation and combining results is somehow difficult.
Is it possible to tune the view to make this query fast enough? Another approach?
Thanks
PS: the less I modify original query, the less I'll need to modify the legacy app, which is, at less, risky to modify

This is too long for a comment.
A view is (almost) never going to help performance. Yes, they make queries simpler. Yes, they incorporate important logic. But no, they don't help performance.
One key problem is the execution of the view -- it doesn't generally take the filters in the overall tables into account (although the most recent versions of MySQL are better at this).
One suggestion -- which might be a fair bit of work -- is to materialize the view as a table. When the underlying tables change, you need to change t_booking_hall_reservation using triggers. Then you can create indexes on the table to achieve your performance goals.'

t_booking, unless it is a VIEW, needs
INDEX(idhotel, idhall, date)
VIEWs are syntactic sugar; they do not enhance performance; sometimes they are slower than the equivalent SELECT.

Related

Speed up a MySQL query that times out

I am trying to get data from multiple tables that need to be consolidated together for reporting. I am trying to get all the details of every header that is complete after a certain date and must check for the 0 date because of bad data.
I am working with legacy tables that I, sadly, did not create and cannot change. This statement times out in my application and takes about 40 seconds to run against a database on a local Virtual Machine. Is it possible to refactor the query for better performance? Any help is appreciated!
SELECT detail.*
FROM detail
JOIN header
ON detail.invoice = header.invoice
WHERE detail.dateapply != '0000-00-00'
AND header.dateapply >= '2017-09-17'
AND header.orderstatus IN ('complete', 'backorder')
ORDER BY detail.DateApply;`
For header:
INDEX(orderstatus, dateapply)
INDEX(invoice, orderstatus, dateapply)
For detail:
INDEX(invoice, dateapply)
INDEX(dateapply, invoice)
The ordering of the columns is deliberate. Since I can't tell which table the Optimizer will start with, I provided indexes that should be optimal either way.
If the two dateapply columns are in sync, then there are probably further optimizations.
Between my coworker and myself we came up with the below query that is returning results more quickly. The GROUP BY eliminated duplicate records. I would like to thank everyone for taking the time to help me look at my issue and look forward to the possibility of using indexes in the future.
SELECT d.*
FROM detail d
JOIN header h
ON d.invoice = h.invoice
WHERE d.qbposted = 0
AND d.dateapply != '0000-00-00'
AND h.dateapply >= '2017-09-17'
AND h.orderstatus IN ('complete', 'backorder')
GROUP BY d.rec ORDER BY d.dateapply;

Right way to phrase MySQL query across many (possible empty) tables

I'm trying to do what I think is a set of simple set operations on a database table: several intersections and one union. But I don't seem to be able to express that in a simple way.
I have a MySQL table called Moment, which has many millions of rows. (It happens to be a time-series table but that doesn't impact on my problem here; however, these data have a column 'source' and a column 'time', both indexed.) Queries to pull data out of this table are created dynamically (coming in from an API), and ultimately boil down to a small pile of temporary tables indicating which 'source's we care about, and maybe the 'time' ranges we care about.
Let's say we're looking for
(source in Temp1) AND (
((source in Temp2) AND (time > '2017-01-01')) OR
((source in Temp3) AND (time > '2016-11-15'))
)
Just for excitement, let's say Temp2 is empty --- that part of the API request was valid but happened to include 'no actual sources'.
If I then do
SELECT m.* from Moment as m,Temp1,Temp2,Temp3
WHERE (m.source = Temp1.source) AND (
((m.source = Temp2.source) AND (m.time > '2017-01-01')) OR
((m.source = Temp3.source) AND (m.time > '2016-11'15'))
)
... I get a heaping mound of nothing, because the empty Temp2 gives an empty Cartesian product before we get to the WHERE clause.
Okay, I can do
SELECT m.* from Moment as m
LEFT JOIN Temp1 on m.source=Temp1.source
LEFT JOIN Temp2 on m.source=Temp2.source
LEFT JOIN Temp3 on m.source=Temp3.source
WHERE (m.source = Temp1.source) AND (
((m.source = Temp2.source) AND (m.time > '2017-01-01')) OR
((m.source = Temp3.source) AND (m.time > '2016-11-15'))
)
... but this takes >70ms even on my relatively small development database.
If I manually eliminate the empty table,
SELECT m.* from Moment as m,Temp1,Temp3
WHERE (m.source = Temp1.source) AND (
((m.source = Temp3.source) AND (m.time > '2016-11-15'))
)
... it finishes in 10ms. That's the kind of time I'd expect.
I've also tried putting a single unmatchable row in the empty table and doing SELECT DISTINCT, and it splits the difference at ~40ms. Seems an odd solution though.
This really feels like I'm just conceptualizing the query wrong, that I'm asking the database to do more work than it needs to. What is the Right Way to ask the database this question?
Thanks!
--UPDATE--
I did some actual benchmarks on my actual database, and came up with some really unexpected results.
For the scenario above, all tables indexed on the columns being compared, with an empty table,
doing it with left joins took 3.5 minutes (!!!)
doing it without joins (just 'FROM...WHERE') and adding a null row to the empty table, took 3.5 seconds
even more striking, when there wasn't an empty table, but rather ~1000 rows in each of the temporary tables,
doing the whole thing in one query took 28 minutes (!!!!!), but,
doing each of the three AND clauses separately and then doing the final combination in the code took less than a second.
I still feel I'm expressing the query in some foolish way, since again, all I'm trying to do is one set union (OR) and a few set intersections. It really seems like the DB is making this gigantic Cartesian product when it seriously doesn't need to. All in all, as pointed out in the answer below, keeping some of the intelligence up in the code seems to be the better approach here.
There are various ways to tackle the problem. Needless to say it depends on
how many queries are sent to the database,
the amount of data you are processing in a time interval,
how the database backend is configured to manage it.
For your use case, a little more information would be helpful. The optimization of your query by using CASE/COUNT(*) or CASE/LIMIT combinations in queries to sort out empty tables would be one option. However, if-like queries cost more time.
You could split the SQL code to downgrade the scaling of the problem from 1*N^x to y*N^z, where z should be smaller than x.
You said that an API is involved, maybe you are able handle the temporary "no data" tables differently or even don't store them?
Another option would be to enable query caching:
https://dev.mysql.com/doc/refman/5.5/en/query-cache-configuration.html

Low maintainence alternatives to indexing a view that can't be indexed in SQL server?

I'm trying to index my views since the data is relatively static and it could increase performance.
I cannot index the view because it contains a "ranking or aggregate window function". Is there a workaround for that?
SELECT r.Id, r.Value, r.TestSessionId, t.Type AS TestType, r.StudentId, ROW_NUMBER() OVER (partition BY r.StudentId, r.TestSessionId ORDER BY r.Id) AS AttemptNumber
FROM dbo.Responses r
INNER JOIN dbo.TestSessions ts ON r.TestSessionId = ts.Id
INNER JOIN dbo.Tests t ON ts.TestId = t.Id
This view just adds an attempt number to student responses to questions, and I thought this would be a perfect scenario for an indexed view, but SQL Server doesn't support indexes on views with window functions.
I could generate a cache table manually, but I want this to be low maintenance so I don't have to remember to do something like that:
For example, perhaps I could create some kind of trigger (I'm not familiar with triggers) that inserts the view into a cache table when the base table is changed... which is basically what an index on a view is supposed to do under the hood (although more efficiency because it can update the index rather than completely replacing it when the base table data changes).

Is it a bad idea to keep a subtotal field in database

I have a MySQL table that represents a list of orders and a related child table that represents the shipments associated with each order (some orders have more than one shipment, but most have just one).
Each shipment has a number of costs, for example:
ItemCost
ShippingCost
HandlingCost
TaxCost
There are many places in the application where I need to get consolidated information for the order such as:
TotalItemCost
TotalShippingCost
TotalHandlingCost
TotalTaxCost
TotalCost
TotalPaid
TotalProfit
All those fields are dependent on the aggregated values in the related shipment table. This information is used in other queries, reports, screens, etc., some of which have to return a result on tens of thousands of records quickly for a user.
As I see it, there are a few basic ways to go with this:
Use a subquery to calculate these items from the shipment table whenever they are needed. This complicates things quite a bit for all the queried that needs all or part of this information. It is also slow.
Create a view that exposes the subqueries as simple fields. This keeps the reports that needs them simple.
Add these fields in the order table. These would give me the performance I am looking for, at the expense of having to duplicate data and calculate it when I make any changes to the shipment records.
One other thing, I am using a business layer that exposes functions to get this data (for example GetOrders(filter)) and I don't need the subtotals each time (or only some of them some of the time), so generating a subquery each time (even from a view) is probably a bad idea.
Are there any best practices that anybody can point me to help me decide what the best design for this is?
Incidentally, I ended up doing #3 primarily for performance and query simplicity reasons.
Update:
Got lots of great feedback pretty quickly, thank you all. To give a bit more background, one of the places the information is shown is on the admin console where I have a potentially very long list of orders and needs to show TotalCost, TotalPaid, and TotalProfit for each.
Theres absolutely nothing wrong with doing rollups of your statistical data and storing it to enhance application performance. Just keep in mind that you will probably need to create a set of triggers or jobs to keep the rollups in sync with your source data.
I would probably go about this by caching the subtotals in the database for fastest query performance if most of the time you're doing reads instead of writes. Create an update trigger to recalculate the subtotal when the row changes.
I would only use a view to calculate them on SELECT if the number of rows was typically pretty small and access somewhat infrequent. Performance will be much better if you cache them.
Option 3 is the fastest
If and when you are running into performance issues and if you cannot solve these any other way, option #3 is the way to go.
Use triggers to do the updating
You should use triggers after insert, update and delete to keep the subtotals in your order table in sync with the underlying data.
Take special care when retrospectively changing prices and stuff as this will require a full recalc of all subtotals.
So you will need a lot of triggers, that usually don't do much most of the time.
if a taxrate changes, it will change in the future, for orders that you don't yet have
If the triggers take a lot of time, make sure you do these updates in off-peak hours.
Run an automatic check periodically to make sure the cached values are correct
You may also want to keep a golden subquery in place that calculates all the values and checkes them against the stored values in the order table.
Run this query every night and have it report any abnormalities, so that you can see when the denormalized values are out-of-sync.
Do not do any invoicing on orders that have not been processed by the validation query
Add an extra date field to table order called timeoflastsuccesfullvalidation and have it set to null if the validation was unsuccessful.
Only invoice items with a dateoflastsuccesfullvalidation less than 24 hours ago.
Of course you don't need to check orders that are fully processed, only orders that are pending.
Option 1 may be fast enough
With regards to #1
It is also slow.
That depends a lot on how you query the DB.
You mention subselects, in the below mostly complete skeleton query I don't see the need for many subselects, so you have me puzzled there a bit.
SELECT field1,field2,field3
, oifield1,oifield2,oifield3
, NettItemCost * (1+taxrate) as TotalItemCost
, TotalShippingCost
, TotalHandlingCost
, NettItemCost * taxRate as TotalTaxCost
, (NettItemCost * (1+taxrate)) + TotalShippingCost + TotalHandlingCost as TotalCost
, TotalPaid
, somethingorother as TotalProfit
FROM (
SELECT o.field1,o.field2, o.field3
, oi.field1 as oifield1, i.field2 as oifield2 ,oi.field3 as oifield3
, SUM(c.productprice * oi.qty) as NettItemCost
, SUM(IFNULL(sc.shippingperkg,0) * oi.qty * p.WeightInKg) as TotalShippingCost
, SUM(IFNULL(hc.handlingperwhatever,0) * oi.qty) as TotalHandlingCost
, t.taxrate as TaxRate
, IFNULL(pay.amountpaid,0) as TotalPaid
FROM orders o
INNER JOIN orderitem oi ON (oi.order_id = o.id)
INNER JOIN products p ON (p.id = oi.product_id)
INNER JOIN prices c ON (c.product_id = p.id
AND o.orderdate BETWEEN c.validfrom AND c.validuntil)
INNER JOIN taxes t ON (p.tax_id = t.tax_id
AND o.orderdate BETWEEN t.validfrom AND t.validuntil)
LEFT JOIN shippingcosts sc ON (o.country = sc.country
AND o.orderdate BETWEEN sc.validfrom AND sc.validuntil)
LEFT JOIN handlingcost hc ON (hc.id = oi.handlingcost_id
AND o.orderdate BETWEEN hc.validfrom AND hc.validuntil)
LEFT JOIN (SELECT SUM(pay.payment) as amountpaid FROM payment pay
WHERE pay.order_id = o.id) paid ON (1=1)
WHERE o.id BETWEEN '1245' AND '1299'
GROUP BY o.id DESC, oi.id DESC ) AS sub
Thinking about it, you would need to split this query up for stuff that's relevant per order and per order_item but I'm lazy to do that now.
Speed tips
Make sure you have indexes on all fields involved in the join-criteria.
Use a MEMORY table for the smaller tables, like tax and shippingcost and use a hash index for the id's in the memory-tables.
I would avoid #3 as possible as I can. I prefer that for different reasons:
It's too hard to discuss performance without measurement. Imaging the user is shopping around, adding order items into an order; every time an item is added, you need to update the order record, which may not be necessary (some sites only show order total when you click shopping cart and ready to checkout).
Having a duplicated column is asking for bugs - you cannot expect every future developer/maintainer to be aware of this extra column. Triggers can help but I think triggers should only be used as a last resort to address a bad database design.
A different database schema can be used for reporting purpose. The reporting database can be highly de-normalized for performance purpose without complicating the main application.
I tend to put the actual logic for computing subtotal at application layer because subtotal is actually an overloaded thing related to different contexts - sometimes you want the "raw subtotal", sometimes you want the subtotal after applying discount. You just cannot keep adding columns to the order table for different scenario.
It's not a bad idea, unfortunately MySQL doesn't have some features that would make this really easy - computed columns and indexed (materialized views). You can probably simulate it with a trigger.

MYSQL Triple Join Performance help, Copying to Tmp Table

I'm working on a query for a news site, which will find FeaturedContent for display on the main homepage. Content marked this way is tagged as 'FeaturedContent', and ordered in a featured table by 'homepage'. I currently have the desired output, but the query runs in over 3 seconds, which I need to cut down on. How does one optimize a query like the one which follows?
EDIT: Materialized the view every minute as suggested, down to .4 seconds:
SELECT f.position, s.item_id, s.item_type, s.title, s.caption, s.date
FROM live.search_all s
INNER JOIN live.tags t
ON s.item_id = t.item_id AND s.item_type = t.item_type AND t.tag = 'FeaturedContent'
LEFT OUTER JOIN live.featured f
ON s.item_id = f.item_id AND s.item_type = f.item_type AND f.feature_type = 'homepage'
ORDER BY position IS NULL, position ASC, date
This returns all the homepage features in order, followed by other featured content ordered by date.
The explain looks like this:
|-id---|-select_type-|-table-|-type---|-possible_keys---------|-key--------|-key_len-|-ref---------------------------------------|-rows--|-Extra-------------------------------------------------------------|
|-1----|-SIMPLE------|-t2----|-ref----|-PRIMARY,tag_index-----|-tag_index--|-303-----|-const-------------------------------------|-2-----|-Using where; Using index; Using temporary; Using filesort;--------|
|-1----|-SIMPLE------|-t-----|-ref----|-PRIMARY---------------|-PRIMARY----|-4-------|-newswires.t2.id---------------------------|-1974--|-Using index-------------------------------------------------------|
|-1----|-SIMPLE------|-s-----|-eq_ref-|-PRIMARY, search_index-|-PRIMARY----|-124-----|-newswires.t.item_id,newswires.t.item_type-|-1-----|-------------------------------------------------------------------|
|-1----|-SIMPLE------|-f-----|-index--|-NULL------------------|-PRIMARY----|-190-----|-NULL--------------------------------------|-13----|-Using index-------------------------------------------------------|
And the Profile is as follows:
|-Status---------------|-Time-----|
|-starting-------------|-0.000091-|
|-Opening tables-------|-0.000756-|
|-System lock----------|-0.000005-|
|-Table lock-----------|-0.000008-|
|-init-----------------|-0.000004-|
|-checking permissions-|-0.000001-|
|-checking permissions-|-0.000001-|
|-checking permissions-|-0.000043-|
|-optimizing-----------|-0.000019-|
|-statistics-----------|-0.000127-|
|-preparing------------|-0.000023-|
|-Creating tmp table---|-0.001802-|
|-executing------------|-0.000001-|
|-Copying to tmp table-|-0.311445-|
|-Sorting result-------|-0.014819-|
|-Sending data---------|-0.000227-|
|-end------------------|-0.000002-|
|-removing tmp table---|-0.002010-|
|-end------------------|-0.000005-|
|-query end------------|-0.000001-|
|-freeing items--------|-0.000296-|
|-logging slow query---|-0.000001-|
|-cleaning up----------|-0.000007-|
I'm new to reading the EXPLAIN output, so I'm unsure if I have a better ordering available, or anything rather simple that could be done to speed these along.
The search_all table is the materialized view table which is periodically updated, while the tags and featured tables are views. These views are not optional, and cannot be worked around.
The tags view combines tags and a relational table to get back a listing of tags according to item_type and item_id, but the other views are all simple views of one table.
EDIT: With the materialized view, the biggest bottleneck seems to be the 'Copying to temp table' step. Without ordering the output, it takes .0025 seconds (much better!) but the final output does need ordered. Is there any way to enhance the performance of that step, or work around it?
Sorry if the formatting is difficult to read, I'm new and unsure how it is regularly done.
Thanks for your help! If anything else is needed, please let me know!
EDIT: Table sizes, for reference:
Tag Relations: 197,411
Tags: 16,897
Stories: 51,801
Images: 28,383
Videos: 2,408
Featured: 13
I think optimizing your query alone won't be very useful. First thoughts are that joining a subquery, itself made of UNIONs, is alone a double bottleneck for performance.
If you have the option to change your database structure, then I would suggest to merge the 3 tables stories, images and videos into one, if they are, as it looks like, very similar (adding them a type ENUM('story', 'image', 'video')) to differentiate the records; this would remove both the subquery and the union.
Also, it looks like your views on stories and videos, are not using an indexed field to filter content. Are you querying an indexed column?
It's a pretty tricky problem without knowing your full table structure and the repartition of your data!
Another option, which would not involve bringing modifications to your existing database (especially if it is already in production), would be to "cache" this information into another table, which would be periodically refreshed by a cron job.
The caching can be done at different levels, either on the full query, or on subparts of it (independent views, or the 3 unions merged into a single cache table, etc.)
The viability of this option depends on whether it is acceptable to display slightly outdated data, or not. It might be acceptable for just some parts of your data, which may imply that you will cache just a subset of the tables/views involved in the query.