Is it a bad idea to keep a subtotal field in database - mysql

I have a MySQL table that represents a list of orders and a related child table that represents the shipments associated with each order (some orders have more than one shipment, but most have just one).
Each shipment has a number of costs, for example:
ItemCost
ShippingCost
HandlingCost
TaxCost
There are many places in the application where I need to get consolidated information for the order such as:
TotalItemCost
TotalShippingCost
TotalHandlingCost
TotalTaxCost
TotalCost
TotalPaid
TotalProfit
All those fields are dependent on the aggregated values in the related shipment table. This information is used in other queries, reports, screens, etc., some of which have to return a result on tens of thousands of records quickly for a user.
As I see it, there are a few basic ways to go with this:
Use a subquery to calculate these items from the shipment table whenever they are needed. This complicates things quite a bit for all the queried that needs all or part of this information. It is also slow.
Create a view that exposes the subqueries as simple fields. This keeps the reports that needs them simple.
Add these fields in the order table. These would give me the performance I am looking for, at the expense of having to duplicate data and calculate it when I make any changes to the shipment records.
One other thing, I am using a business layer that exposes functions to get this data (for example GetOrders(filter)) and I don't need the subtotals each time (or only some of them some of the time), so generating a subquery each time (even from a view) is probably a bad idea.
Are there any best practices that anybody can point me to help me decide what the best design for this is?
Incidentally, I ended up doing #3 primarily for performance and query simplicity reasons.
Update:
Got lots of great feedback pretty quickly, thank you all. To give a bit more background, one of the places the information is shown is on the admin console where I have a potentially very long list of orders and needs to show TotalCost, TotalPaid, and TotalProfit for each.

Theres absolutely nothing wrong with doing rollups of your statistical data and storing it to enhance application performance. Just keep in mind that you will probably need to create a set of triggers or jobs to keep the rollups in sync with your source data.

I would probably go about this by caching the subtotals in the database for fastest query performance if most of the time you're doing reads instead of writes. Create an update trigger to recalculate the subtotal when the row changes.
I would only use a view to calculate them on SELECT if the number of rows was typically pretty small and access somewhat infrequent. Performance will be much better if you cache them.

Option 3 is the fastest
If and when you are running into performance issues and if you cannot solve these any other way, option #3 is the way to go.
Use triggers to do the updating
You should use triggers after insert, update and delete to keep the subtotals in your order table in sync with the underlying data.
Take special care when retrospectively changing prices and stuff as this will require a full recalc of all subtotals.
So you will need a lot of triggers, that usually don't do much most of the time.
if a taxrate changes, it will change in the future, for orders that you don't yet have
If the triggers take a lot of time, make sure you do these updates in off-peak hours.
Run an automatic check periodically to make sure the cached values are correct
You may also want to keep a golden subquery in place that calculates all the values and checkes them against the stored values in the order table.
Run this query every night and have it report any abnormalities, so that you can see when the denormalized values are out-of-sync.
Do not do any invoicing on orders that have not been processed by the validation query
Add an extra date field to table order called timeoflastsuccesfullvalidation and have it set to null if the validation was unsuccessful.
Only invoice items with a dateoflastsuccesfullvalidation less than 24 hours ago.
Of course you don't need to check orders that are fully processed, only orders that are pending.
Option 1 may be fast enough
With regards to #1
It is also slow.
That depends a lot on how you query the DB.
You mention subselects, in the below mostly complete skeleton query I don't see the need for many subselects, so you have me puzzled there a bit.
SELECT field1,field2,field3
, oifield1,oifield2,oifield3
, NettItemCost * (1+taxrate) as TotalItemCost
, TotalShippingCost
, TotalHandlingCost
, NettItemCost * taxRate as TotalTaxCost
, (NettItemCost * (1+taxrate)) + TotalShippingCost + TotalHandlingCost as TotalCost
, TotalPaid
, somethingorother as TotalProfit
FROM (
SELECT o.field1,o.field2, o.field3
, oi.field1 as oifield1, i.field2 as oifield2 ,oi.field3 as oifield3
, SUM(c.productprice * oi.qty) as NettItemCost
, SUM(IFNULL(sc.shippingperkg,0) * oi.qty * p.WeightInKg) as TotalShippingCost
, SUM(IFNULL(hc.handlingperwhatever,0) * oi.qty) as TotalHandlingCost
, t.taxrate as TaxRate
, IFNULL(pay.amountpaid,0) as TotalPaid
FROM orders o
INNER JOIN orderitem oi ON (oi.order_id = o.id)
INNER JOIN products p ON (p.id = oi.product_id)
INNER JOIN prices c ON (c.product_id = p.id
AND o.orderdate BETWEEN c.validfrom AND c.validuntil)
INNER JOIN taxes t ON (p.tax_id = t.tax_id
AND o.orderdate BETWEEN t.validfrom AND t.validuntil)
LEFT JOIN shippingcosts sc ON (o.country = sc.country
AND o.orderdate BETWEEN sc.validfrom AND sc.validuntil)
LEFT JOIN handlingcost hc ON (hc.id = oi.handlingcost_id
AND o.orderdate BETWEEN hc.validfrom AND hc.validuntil)
LEFT JOIN (SELECT SUM(pay.payment) as amountpaid FROM payment pay
WHERE pay.order_id = o.id) paid ON (1=1)
WHERE o.id BETWEEN '1245' AND '1299'
GROUP BY o.id DESC, oi.id DESC ) AS sub
Thinking about it, you would need to split this query up for stuff that's relevant per order and per order_item but I'm lazy to do that now.
Speed tips
Make sure you have indexes on all fields involved in the join-criteria.
Use a MEMORY table for the smaller tables, like tax and shippingcost and use a hash index for the id's in the memory-tables.

I would avoid #3 as possible as I can. I prefer that for different reasons:
It's too hard to discuss performance without measurement. Imaging the user is shopping around, adding order items into an order; every time an item is added, you need to update the order record, which may not be necessary (some sites only show order total when you click shopping cart and ready to checkout).
Having a duplicated column is asking for bugs - you cannot expect every future developer/maintainer to be aware of this extra column. Triggers can help but I think triggers should only be used as a last resort to address a bad database design.
A different database schema can be used for reporting purpose. The reporting database can be highly de-normalized for performance purpose without complicating the main application.
I tend to put the actual logic for computing subtotal at application layer because subtotal is actually an overloaded thing related to different contexts - sometimes you want the "raw subtotal", sometimes you want the subtotal after applying discount. You just cannot keep adding columns to the order table for different scenario.

It's not a bad idea, unfortunately MySQL doesn't have some features that would make this really easy - computed columns and indexed (materialized views). You can probably simulate it with a trigger.

Related

Optimize query from view with UNION ALL

So I'm facing a difficult scenario, I have a legacy app, bad written and designed, with a table, t_booking. This app, has a calendar view, where, for every hall, and for every day in the month, shows its reservation status, with this query:
SELECT mr1b.id, mr1b.idreserva, mr1b.idhotel, mr1b.idhall, mr1b.idtiporeserva, mr1b.date, mr1b.ampm, mr1b.observaciones, mr1b.observaciones_bookingarea, mr1b.tipo_de_navegacion, mr1b.portal, r.estado
FROM t_booking mr1b
LEFT JOIN a_reservations r ON mr1b.idreserva = r.id
WHERE mr1b.idhotel = '$sIdHotel' AND mr1b.idhall = '$hall' AND mr1b.date = '$iAnyo-$iMes-$iDia'
AND IF (r.comidacena IS NULL OR r.comidacena = '', mr1b.ampm = 'AM', r.comidacena = 'AM' AND mr1b.ampm = 'AM')
AND (r.estado <> 'Cancelled' OR r.estado IS NULL OR r.estado = '')
LIMIT 1;
(at first there was also a ORDER BY r.estado DESC which I took out)
This query, after proper (I think) indexing, takes 0.004 seconds each, and the overall calendar view is presented in a reasonable time. There are indexes over idhotel, idhall, and date.
Now, I have a new module, well written ;-), which does reservations in another table, but I must present both types of reservations in same calendar view. My first approach was create a view, joining content of both tables, and selecting data for calendar view from this view instead of t_booking.
The view is defined like this:
CREATE OR REPLACE VIEW
t_booking_hall_reservation
AS
SELECT id,
idreserva,
idhotel,
idhall,
idtiporeserva,
date,
ampm,
observaciones,
observaciones_bookingarea,
tipo_de_navegacion, portal
FROM t_booking
UNION ALL
SELECT HR.id,
HR.convention_id as idreserva,
H.id_hotel as idhotel,
HR.hall_id as idhall,
99 as idtiporeserva,
date,
session as ampm,
observations as observaciones,
'new module' as observaciones_bookingarea,
'events' as tipo_de_navegacion,
'new module' as portal
FROM new_hall_reservation HR
JOIN a_halls H on H.id = HR.hall_id
;
(table new_hall_reservation has same indexes)
I tryed UNION ALL instead of UNION as I read this is much more efficient.
Well, the former query, changing t_booking for t_booking_hall_reservation, takes 1.5 seconds, to multiply for each hall and each day, which makes calendar view impossible to finish.
The app is spaguetti code, so, looping twice, once over t_booking and then over new_hall_reservation and combining results is somehow difficult.
Is it possible to tune the view to make this query fast enough? Another approach?
Thanks
PS: the less I modify original query, the less I'll need to modify the legacy app, which is, at less, risky to modify
This is too long for a comment.
A view is (almost) never going to help performance. Yes, they make queries simpler. Yes, they incorporate important logic. But no, they don't help performance.
One key problem is the execution of the view -- it doesn't generally take the filters in the overall tables into account (although the most recent versions of MySQL are better at this).
One suggestion -- which might be a fair bit of work -- is to materialize the view as a table. When the underlying tables change, you need to change t_booking_hall_reservation using triggers. Then you can create indexes on the table to achieve your performance goals.'
t_booking, unless it is a VIEW, needs
INDEX(idhotel, idhall, date)
VIEWs are syntactic sugar; they do not enhance performance; sometimes they are slower than the equivalent SELECT.

Right way to phrase MySQL query across many (possible empty) tables

I'm trying to do what I think is a set of simple set operations on a database table: several intersections and one union. But I don't seem to be able to express that in a simple way.
I have a MySQL table called Moment, which has many millions of rows. (It happens to be a time-series table but that doesn't impact on my problem here; however, these data have a column 'source' and a column 'time', both indexed.) Queries to pull data out of this table are created dynamically (coming in from an API), and ultimately boil down to a small pile of temporary tables indicating which 'source's we care about, and maybe the 'time' ranges we care about.
Let's say we're looking for
(source in Temp1) AND (
((source in Temp2) AND (time > '2017-01-01')) OR
((source in Temp3) AND (time > '2016-11-15'))
)
Just for excitement, let's say Temp2 is empty --- that part of the API request was valid but happened to include 'no actual sources'.
If I then do
SELECT m.* from Moment as m,Temp1,Temp2,Temp3
WHERE (m.source = Temp1.source) AND (
((m.source = Temp2.source) AND (m.time > '2017-01-01')) OR
((m.source = Temp3.source) AND (m.time > '2016-11'15'))
)
... I get a heaping mound of nothing, because the empty Temp2 gives an empty Cartesian product before we get to the WHERE clause.
Okay, I can do
SELECT m.* from Moment as m
LEFT JOIN Temp1 on m.source=Temp1.source
LEFT JOIN Temp2 on m.source=Temp2.source
LEFT JOIN Temp3 on m.source=Temp3.source
WHERE (m.source = Temp1.source) AND (
((m.source = Temp2.source) AND (m.time > '2017-01-01')) OR
((m.source = Temp3.source) AND (m.time > '2016-11-15'))
)
... but this takes >70ms even on my relatively small development database.
If I manually eliminate the empty table,
SELECT m.* from Moment as m,Temp1,Temp3
WHERE (m.source = Temp1.source) AND (
((m.source = Temp3.source) AND (m.time > '2016-11-15'))
)
... it finishes in 10ms. That's the kind of time I'd expect.
I've also tried putting a single unmatchable row in the empty table and doing SELECT DISTINCT, and it splits the difference at ~40ms. Seems an odd solution though.
This really feels like I'm just conceptualizing the query wrong, that I'm asking the database to do more work than it needs to. What is the Right Way to ask the database this question?
Thanks!
--UPDATE--
I did some actual benchmarks on my actual database, and came up with some really unexpected results.
For the scenario above, all tables indexed on the columns being compared, with an empty table,
doing it with left joins took 3.5 minutes (!!!)
doing it without joins (just 'FROM...WHERE') and adding a null row to the empty table, took 3.5 seconds
even more striking, when there wasn't an empty table, but rather ~1000 rows in each of the temporary tables,
doing the whole thing in one query took 28 minutes (!!!!!), but,
doing each of the three AND clauses separately and then doing the final combination in the code took less than a second.
I still feel I'm expressing the query in some foolish way, since again, all I'm trying to do is one set union (OR) and a few set intersections. It really seems like the DB is making this gigantic Cartesian product when it seriously doesn't need to. All in all, as pointed out in the answer below, keeping some of the intelligence up in the code seems to be the better approach here.
There are various ways to tackle the problem. Needless to say it depends on
how many queries are sent to the database,
the amount of data you are processing in a time interval,
how the database backend is configured to manage it.
For your use case, a little more information would be helpful. The optimization of your query by using CASE/COUNT(*) or CASE/LIMIT combinations in queries to sort out empty tables would be one option. However, if-like queries cost more time.
You could split the SQL code to downgrade the scaling of the problem from 1*N^x to y*N^z, where z should be smaller than x.
You said that an API is involved, maybe you are able handle the temporary "no data" tables differently or even don't store them?
Another option would be to enable query caching:
https://dev.mysql.com/doc/refman/5.5/en/query-cache-configuration.html

Stock management database design

I'm creating an Intranet for my company, and we want to have a stock management in it. We sell and rent alarm systems, and we want to have a good overview of what product is still in our offices, what has been rented or sold, at what time, etc.
At the moment I thought about this database design :
Everytime we create a new contract, this contract is about a location or a sale of an item. So we have an Product table (which is the type of product : alarms, alarm watches, etc.), and an Item table, which is the item itself, with it unique serial number. I thought about doing this, because I'll need to have a trace of where a specific item is, if it's at a client house (rented), if it's sold, etc. Products are related to a specific supplier, to whom we can take orders. But here, I have a problem, shouldn't the order table be related to Product ?
The main concern here is the link between Stock, Item, Movement stock. I wanted to create a design where I'd be able to see when a specific Item is pulled out of our stock, and when it enters the stock with the date. That's why I thought about a Movement_stock table. The Type_Movement is either In / Out.
But I'm a bit lost here, I really don't know how to do it nicely. That's why I'm asking for a bit of help.
I have the same need, and here is how I tackled your stock movement issue (which became my issue too).
In order to modelize stock movement (+/-), I have my supplying and my order tables. Supplying act as my +stock, and my orders my -stock.
If we stop to this, we could compute our actual stock which would be transcribed into this SQL query:
SELECT
id,
name,
sup.length - ord.length AS 'stock'
FROM
product
# Computes the number of items arrived
INNER JOIN (
SELECT
productId,
SUM(quantity) AS 'length'
FROM
supplying
WHERE
arrived IS TRUE
GROUP BY
productId
) AS sup ON sup.productId = product.id
# Computes the number of order
INNER JOIN (
SELECT
productId,
SUM(quantity) AS 'length'
FROM
product_order
GROUP BY
productId
) AS ord ON ord.productId = product.id
Which would give something like:
id name stock
=========================
1 ASUS Vivobook 3
2 HP Spectre 10
3 ASUS Zenbook 0
...
While this could save you one table, you will not be able to scale with it, hence the fact that most of the modelization (imho) use an intermediate stock table, mostly for performance concerns.
One of the downside is the data duplication, because you will need to rerun the query above to update your stock (see the updatedAt column).
The good side is client performance. You will deliver faster responses through your API.
I think another downside could be if you are managing high traffic store. You could imagine creating another table that stores the fact that a stock is being recomputed, and make the user wait until the recomputation is finished (push request or long polling) in order to check if every of his/her items are still available (stock >= user demand). But that is another deal...
Anyway even if the stock recomputation query is using anonymous subqueries, it should actually be quite fast enough in most of the relatively medium stores.
Note
You see in the product_order, I duplicated the price and the vat. This is for reliability reasons: to freeze the price at the moment of the purchase, and to be able to recompute the total with a lot of decimals (without loosing cents in the way).
Hope it helps someone passing by.
Edit
In practice, I use it with Laravel, and I use a console command, which will compute my product stock in batch (I also use an optional parameter to compute only for a certain product id), so my stock is always correct (relative to the query above), and I never manually update the stock table.
This is an interesting discussion and one that also could be augmented with stock availability as of a certain date...
This means storing:
Planned Orders for the Product on a certain date
Confirmed Orders as of a certain date
Orders Delivered
Orders Returned (especially if this is a hire product)
Each one of these product movements could be from and to a location
The user queries would then include:
What is my overall stock on hand
What is due to be delivered on a certain date
What will the stock on hand be as of a date overall
What will the stock on hand be as of a date for a location
The inventory design MUST take into account the queries and use cases of the users to determine design and also breaking normalisation rules to provide adequate performance at the right time.
Lots to consider and it all depends on the software use cases.

Best way to do a query with a large number of possible joins

On the project I'm working on we have an activity table and each activity can be linked to one of about 20 different "activity details" tables...
e.g. If the activity was of type "work", then it would have a corresponding activity_details_work record, if it was of type "sick leave" then it would have a corresponding activity_details_sickleave record and so on.
Currently we are loading the activities and then for each activity we have a separate query to go fetch the activity details from the relevant table. This obviously doesn't scale well if you have thousands of activities.
So my initial thought was to have a single query which fetches the activities and joins the details in one go e.g.
SELECT * FROM activity
LEFT JOIN activity_details_1_work ON ...
LEFT JOIN activity_details_2_sickleave ON ...
LEFT JOIN activity_details_3_travelwork ON ...
...etc...
LEFT JOIN activity_details_20_yearleave ON ...
But this will result in each record having 100's of fields, most of which are empty and that feels nasty.
Lazy-loading the details isn't really an option either as the details are almost always requested in the core logic, at least for the main types anyway.
Is there a super clever way of doing this that I'm not thinking of?
Thanks in advance
My suggestion is to define a view for each ActivityType, that is tailored specifically to that activity.
Then add an index on the Activity table lead by the ActivityType field. Cluster said index unless there is an overwhelming need for some other to be clustered (or performance benchmarking shows some other clustering selection to be more performant).
Is there a particular reason why this degree of denormalization was designed in? Is that reason well known?
Chances are your activity tables are like (date_from, date_to, with_who, descr) or something to that effect. As Pieter suggested, consider tossing in a type varchar or enum field in there, so as to deal with a single details table.
If there are rational reasons to keep the tables apart, consider adding triggers that maintain boolean/tinyint fields (has_work, has_sickleave, etc), or a bit string (has_activites_of_type where the first position amounts to has_work, the next to has_sickleave, etc.).
Either way, you'll probably be better off by fetching the activity's details in one or more separate queries -- if only to avoid field name collisions.
I don't think enum is the way to go, because as you say there might be 1000's of activities, then altering your activity table would become an issue.
There is no point doing a left join on a large number of tables either.
So the options that you have are :
See this The first comment might be useful.
I am guessing that your activity table has a field called activity_type_id.
Build a table called activity_types containing fields activity_type_id, activity_name, activity_details_table_name. First query in the following way
activity
inner join
activity_types
using( activity_type_id )
This query gives you the table name on which to query for the details.
This way you can add any new activity type just by adding a row in the activity_types table.

What is more efficient(speed/memory): a join or multiple selects

I have the following tables:
users
userId|name
items
itemId|userId|description
What I want to achieve: I want to read from the database all users and their items (an user can have multiple items). All this data I want it stored in a structure like the following:
User {
id
name
array<Item>
}
where Item is
Item {
itemId
userId
description
}
My first option would be to call a SELECT * from users, partially fill an array with users and after that for each user do a SELECT * from items where userId=wantedId and complete the array of items.
Is this approach correct, or should I use a join for this?
A reason that I don't want to use join is that I have a lot of redundant data:
userId1|name1|ItemId11|description11
userId1|name1|ItemId12|description12
userId1|name1|ItemId13|description13
userId1|name1|ItemId14|description14
userId2|name2|ItemId21|description21
userId2|name2|ItemId22|description22
userId2|name2|ItemId23|description23
userId2|name2|ItemId24|description24
by redundant I mean: userId1,name1 and userId2,name2
Is my reason justified?
LATER EDIT: I added to the title speed or memory when talking about efficiency
You're trading off network roundtrips for bytes on the wire and in RAM. Network latency is usually the bigger problem, since memory is cheap and networks have gotten faster. It gets worse as the size of the first result set grows - Google for "(n+1) query problem".
I'd prefer the JOIN. Don't write it using SELECT *; that's a bad idea in almost every case. You should spell out precisely what columns you want.
Join is the best performance way. Reduce overhead and you can use relationated indexes. You can test .. but i'm sure that joins are more fast and optimized than multiple selects
The answer is: it depends.
Multiple SELECT:
If you end up issuing lots of queries to populate the description, the you have to take into account that you'll end up with a lot of round trips to the database.
Using a JOIN:
Yes, you'll be returning more data, but you've only got one round trip.
You've mentioned that you'll partially fill an array with users. Do you know how many users you'll want to fill in advance, because in that case I would use the following (I'm using Oracle here):
select *
from item a,
(select * from
(select *
from user
order by user_id)
where rownum < 10) b
where a.user_id = b.user_id
order by a.user_id
That would return all the items for the first 10 users only (that way most of the work is done on the database itself, rather than getting all the users back, discarding all but the first ten...)