How to speed up sql queries ? Indexes? - mysql

I have the following database structure :
create table Accounting
(
Channel,
Account
)
create table ChannelMapper
(
AccountingChannel,
ShipmentsMarketPlace,
ShipmentsChannel
)
create table AccountMapper
(
AccountingAccount,
ShipmentsComponent
)
create table Shipments
(
MarketPlace,
Component,
ProductGroup,
ShipmentChannel,
Amount
)
I have the following query running on these tables and I'm trying to optimize the query to run as fast as possible :
select Accounting.Channel, Accounting.Account, Shipments.MarketPlace
from Accounting join ChannelMapper on Accounting.Channel = ChannelMapper.AccountingChannel
join AccountMapper on Accounting.Accounting = ChannelMapper.AccountingAccount
join Shipments on
(
ChannelMapper.ShipmentsMarketPlace = Shipments.MarketPlace
and ChannelMapper.AccountingChannel = Shipments.ShipmentChannel
and AccountMapper.ShipmentsComponent = Shipments.Component
)
join (select Component, sum(amount) from Shipment group by component) as Totals
on Shipment.Component = Totals.Component
How do I make this query run as fast as possible ? Should I use indexes ? If so, which columns of which tables should I index ?
Here is a picture of my query plan :
Thanks,

Indexes are essential to any database.
Speaking in "layman" terms, indexes are... well, precisely that. You can think of an index as a second, hidden, table that stores two things: The sorted data and a pointer to its position in the table.
Some thumb rules on creating indexes:
Create indexes on every field that is (or will be) used in joins.
Create indexes on every field on which you want to perform frequent where conditions.
Avoid creating indexes on everything. Create index on the relevant fields of every table, and use relations to retrieve the desired data.
Avoid creating indexes on double fields, unless it is absolutely necessary.
Avoid creating indexes on varchar fields, unless it is absolutely necesary.
I recommend you to read this: http://dev.mysql.com/doc/refman/5.5/en/using-explain.html

Your JOINS should be the first place to look. The two most obvious candidates for indexes are AccountMapper.AccountingAccount and ChannelMapper.AccountingChannel.
You should consider indexing Shipments.MarketPlace,Shipments.ShipmentChannel and Shipments.Component as well.
However, adding indexes increases the workload in maintaining them. While they might give you a performance boost on this query, you might find that updating the tables becomes unacceptably slow. In any case, the MySQL optimiser might decide that a full scan of the table is quicker than accessing it by index.
Really the only way to do this is to set up the indexes that would appear to give you the best result and then benchmark the system to make sure you're getting the results you want here, whilst not compromising the performance elsewhere. Make good use of the EXPLAIN statement to find out what's going on, and remember that optimisations made by yourself or the optimiser on small tables may not be the same optimisations you'd need on larger ones.

The other three answers seem to have indexes covered so this is in addition to indexes. You have no where clause which means you are always selecting the whole darn database. In fact, your database design doesn't have anything useful in this regard, such as a shipping date. Think about that.
You also have this:
join (select Component, sum(amount) from Shipment group by component) as Totals
on Shipment.Component = Totals.Component
That's all well and good but you don't select anything from this subquery. Therefore why do you have it? If you did want to select something, such as the sum(amount), you will have to give that an alias to make it available in the select clause.

Related

Join 10 tables on a single join id called session_id that's stored in session table. Is this good/bad practice?

There's 10 tables all with a session_id column and a single session table. The goal is to join them all on the session table. I get the feeling that this is a major code smell. Is this good/bad practice ?
What problems could occur?
Whether this is a good design or not depends deeply on what you are trying to represent with it. So, it might be OK or it might not be... there's no way to tell just from your question in its current form.
That being said, there are couple ways to speed up a join:
Use indexes.
Use covering indexes.
Under the right DBMS, you could use a materialized view to store pre-joined rows. You should be able to simulate that under MySQL by maintaining a special table via triggers (or even manually).
Don't join a table unless you actually need its fields. List only the fields you need in the SELECT list (instead of blindly using *). The fastest operation is the one you don't have to do!
And above all, measure on representative amounts of data! Possible results:
It's lightning fast. Yay!
It's slow, but it doesn't matter that it's slow (i.e. rarely used / not important).
It's slow and it matters that it's slow. Strap-in, you have work to do!
We need Query with 11 joins and the EXPLAIN posted in the original question when it is available, please. And be kind to your community, for every table involved post as well SHOW CREATE TABLE tblname SHOW INDEX FROM tblname to avoid additional requests for these 11 tables. And we will know scope of data and cardinality involved for each indexed column.
of Course more join kills performance.
but it depends !! if your data model is like that then you can't help yourself here unless complete new data model re-design happen !!
1) is it a online(real time transaction ) DB or offline DB (data warehouse)
if online , then better maintain single table. keep data in one table , let column increase in size.!!
if offline , it's better to maintain separate table , because you are not going to required all column always.!!

MySQL temporary indexes for user-defined queries

I am building an analytics platform where users can create reports and such against a MySQL database. Some of the tables in this database are pretty huge (billions of rows), so for all of the features so far I have indexes built to speed up each query.
However, the next feature is to add the ability for a user to define their own query so that they can analyze data in ways that we haven't pre-defined. They have full read permission to the relevant database, so basically any SELECT query is a valid query for them to enter. This creates problems, however, if a query is defined that filters or joins on a column we haven't currently indexed - sometimes to the point of taking over a minute for a simple query to execute - something as basic as:
SELECT tbl1.a, tbl2.b, SUM(tbl3.c)
FROM
tbl1
JOIN tbl2 ON tbl1.id = tbl2.id
JOIN tbl3 ON tbl1.id = tbl3.id
WHERE
tbl1.d > 0
GROUP BY
tbl1.a, tbl1.b, tbl3.c, tbl1.d
Now, assume that we've only created indexes on columns not appearing in this query so far. Also, we don't want too many indexes slowing down inserts, updates, and deletes (otherwise the simple solution would be to build an index on every column accessible by the users).
My question is, what is the best way to handle this? Currently, I'm thinking that we should scan the query, build indexes on anything appearing in a WHERE or JOIN that isn't already indexed, execute the query, and then drop the indexes that were built afterwards. However, the main things I'm unsure about are a) is there already some best practice for this sort of use case that I don't know about? and b) would the overhead of building these indexes be enough that it would negate any performance gains the indexes provide?
If this strategy doesn't work, the next option I can see working is to collect statistics on what types of queries the users run, and have some regular job periodically check what commonly used columns are missing indexes and create them.
If using MyISAM, then performing an ALTER statement on tables with large (billions of rows) in order to add an index will take a considerable amount of time, probably far longer than the 1 minute you've said for the statement above (and you'll need another ALTER to drop the index afterwards). During that time, the table will be locked meaning other users can't execute their own queries.
If your tables use the InnoDB engine and you're running MySQL 5.1+, then CREATE / DROP index statements shouldn't lock the table, but it still may take some time to execute.
There's a good rundown of the history of ALTER TABLE [here][1].
I'd also suggest that automated query analysis to identify and build indeces would quite difficult to get right. For example, what about cases such as selecting by foo.a but ordering by foo.b? This kind of query often needs a covering index over multiple columns, otherwise you may find your server tries a filesort on a huge resultset which can cause big problems.
Giving your users an "explain query" option would be a good first step. If they know enough SQL to perform custom queries then they should be able to analyse EXPLAIN in order to best execute their query (or at least realise that a given query will take ages).
So, going further with my idea, I propose you segment your datas into well identified views. You used abstract names so I can't reuse your business model, but I'll take a virtual example.
Say you have 3 tables:
customer (gender, social category, date of birth, ...)
invoice (date, amount, ...)
product (price, date of creation, ...)
you would create some sorts of materialized views for specific segments. It's like adding a business layer on top of the very bottom data representation layer.
For example, we could identify the following segments:
seniors having at least 2 invoices
invoices of 2013 with more than 1 product
How to do that? And how to do that efficiently? Regular views won't help your problem because they will have poor explain plans on random queries. What we need is a real physical representation of these segments. We could do something like this:
CREATE TABLE MV_SENIORS_WITH_2_INVOICES AS
SELECT ... /* select from the existing tables */
;
/* add indexes: */
ALTER TABLE MV_SENIORS_WITH_2_INVOICES ADD CONSTRAINT...
... etc.
So now, your guys just have to query MV_SENIORS_WITH_2_INVOICES instead of the original tables. Since there are less records, and probably more indexes, the performances will be better.
We're done! Oh wait, no :-)
We need to refresh these datas, a bit like a FAST REFRESH in Oracle. MySql does not have (not that I know... someone corrects me?) a similar system, so we have to create some triggers for that.
CREATE TRIGGER ... AFTER INSERT ON `seniors`
... /* insert the datas in MV_SENIORS_WITH_2_INVOICES if it matches the segment */
END;
Now we're done!

Retrieve min and max values from different tables with same strucure

I have some logs tables with the same structure. Each tables is related to a site and count billion of entries. The reason of this split is to perform quick and efficient query, because 99.99% of the query are related to the site.
But at this time, I would like to retrieve the min and max value of a column of these tables?
I can't manage to write the SQL request. Should I use UNION?
I am just looking for the request concept, not the final SQL request.
You could use a UNION, yes. Something like this should do:
SELECT MAX(PartialMax) AS TotalMax
FROM
( SELECT MAX(YourColumn) AS PartialMax FROM FirstTable UNION ALL SELECT MAX(YourColumn) AS PartialMax FROM SecondTable ) AS X;
If you have an index over the column you want to find a MAX inside, you should have very good performance as the query should seek to the end of the index on that column to find the maximum value very rapidly. Without an index on that column, the query has to scan the whole table to find the maximum value since nothing inherently orders it.
Added some details to address a concern about "enormous queries".
I'm not sure what you mean by "enormous". You could create a VIEW that does the UNIONs for you; then, you use the view and it will make the query very small:
SELECT MAX(YourColumn) FROM YourView;
but that just optimizes for the size of your query's text. Why do you believe it is important to optimize for that? The VIEW can be helpful for maintenance -- if you add or remove a partition, just fix the view appropriately. But a long query text shouldn't really be a problem.
Or by "enormous", are you worried about the amount of I/O the query will do? Nothing can help that much, aside from making sure each table has an index on YourColumn so that maximum value on each partition can be found very quickly.

Why is MYSQL so slow at processing "AND" statements

When I enter the query:
SELECT A.T1, B.T2, from A, B where A.T3 = B.T3 and A.T4 = B.T4
mysql hangs. However, when I get rid of one one of the constraints:
SELECT A.T1, B.T2, from A, B where A.T3 = B.T3
mysql returns the result almost immediately. Why is this?
You're doing this in a terribly slow and inefficient way. It has nothing to do with the AND clause itself, but the way your tables are designed and the way you are trying to perform your query.
Try something along the lines of:
SELECT A.T1, B.T2
FROM A
JOIN B ON B.T3 = A.T3 AND A.T4 = B.T4
Also, to further increase performance, put the table with the most columns in FROM clause and not the JOIN.
Also, since it's such a large set of tables, you should consider indexing them.
CREATE INDEX index_name
ON A (T4)
GO
CREATE INDEX index_name2
ON B (T4)
GO
You only need to create an Index Once! You do not need to do this every time you run your query. Generally, once you create the index you don't need to worry about it again.
More about indexing: http://en.wikipedia.org/wiki/Database_index
Most frequently when this happens it has nothing to do with the constraints themselves, but with the relationship of the constraints to the indexes. Take a look at the WHERE clause with respect to the defined indexes on the tables. The order of the WHERE clause can make a difference also.
If you can, try going away from SQL-1 type JOINs to the more visible and flexible SQL-2 type. See madburn's answer.

Sorting items filtered by tag

I want to implement a very common feature - filtering some items by tag. There are many tutorials on the internet with examples of how to do it. The query is quite simple and fast (assuming proper indexes exist).
But usually the filtered items need to be sorted by some field. For example, when you filter questions by tag on SO, you get your results sorted.
To accomplish this task (assuming we need to sort by rating), one could write:
SELECT item.id FROM item
INNER JOIN taggeditem ON taggeditem.item_id = item.id
WHERE
taggeditem.tag_id = 1234
ORDER BY item.rating DESC
We have indexes (taggeditem.tag_id), (item.id), (item.rating)
The problem with this query is that mysql can't use index on item.rating, because the key used to fetch the rows is not the same as the one used in the ORDER BY (MySQL: ORDER BY Optimization). This leads to using a temporary table and filesort, which in turn leads to slow execution time.
The solution I came up with is to denormalize sort field to the taggeditem table, so that I could create index (tag_id, item_rating) on taggeditem.
I've searched for similar questions at SO, and found only this one: Mysql slow query: INNER JOIN + ORDER BY causes filesort. The solution was the same.
So, I want to ask, is this a common solution to this problem? Is it a good practice to denormalize a bunch of sort fields to taggeditem, such as created, rating? At SO you can sort using 4 different parameters (newest, hot, votes, active) - does it mean that they denormalized fields which are used to sort results?
Are there any alternatives to this solution?
There is a standard alternative - change server system variables.
For example, you can experiment with sort_buffer_size value (default 2MB).
More about it.
As soon as you're using a JOIN, and filter out on the joined table, you're stuck with bad performance.
As you said, the only way to avoid this is to create a denormalized table.
For SO's sorts, I think they have no such issue: they just have to sort answers by a column of the answers' table (something like SELECT * FROM answers WHERE question_id = 1234 SORT BY answer_date, with an index on question_id, answer_date)
I'm also looking for such solutions, with multi-valued columns, and that's really difficult (denormalized data would be huge, as it needs to cross all values in the multi-valued columns)