so I have a 560mb db with the largest table 500mb(over 10 million rows)
my query hase to join 5 tables and takes about 10 seconds to finish....
SELECT DISTINCT trips.tripid AS tripid,
stops.stopdescrption AS "perron",
Date_format(segments.segmentstart, "%H:%i") AS "time",
Date_format(trips.tripend, "%H:%i") AS "arrival",
Upper(routes.routepublicidentifier) AS "lijn",
plcend.placedescrption AS "destination"
FROM calendar
JOIN trips
ON calendar.vsid = trips.vsid
JOIN routes
ON routes.routeid = trips.routeid
JOIN places plcstart
ON plcstart.placeid = trips.placeidstart
JOIN places plcend
ON plcend.placeid = trips.placeidend
JOIN segments
ON segments.tripid = trips.tripid
JOIN stops
ON segments.stopid = stops.stopid
WHERE stops.stopid IN ( 43914, 23899, 23925, 23908,
23913, 19899, 23871, 43902,
23876, 25563, 18956, 19912,
23889, 23861, 23879, 23884,
23856, 19920, 19898, 23916,
23894, 20985, 23930, 20932,
20986, 22434, 20021, 19893,
19903, 19707, 19935 )
AND calendar.vscdate = Str_to_date('25-10-2011', "%e-%c-%Y")
AND segments.segmentstart >= Str_to_date('15:56', "%H:%i")
AND routes.routeservicetype = 0
AND segments.segmentstart > "00:00:00"
ORDER BY segments.segmentstart
what are things I can do to speed this up? any tips are welcome, i'm pretty new to sql...
but I can't change the structure of the db because it's not mine...
Use EXPLAIN to find the bottlenecks: http://dev.mysql.com/doc/refman/5.0/en/explain.html
Then perhaps, add indexes.
If you don't need to select ALL rows, use LIMIT to limit returned result count.
Just looking at the query, I would say that you should make sure that you have indexes on trips.vsid, calendar.vscdate, segments.segmentstart and routes.routeservicetype. I assume that there is already indexes on all the primary keys in the tables.
Using explain as Briedis suggested would show you how well the indexes work.
You might want to add covering indexes for some tables, like for example an index on trips.vsid where tripid and routeid are included. That way the database can use only the index for the data that is needed from the table, and not read from the actual table.
Edit:
The execution plan tells you that it successfully uses indexes for everything except the segments table, where it does a table scan and filters by the where condition. You should try to make a covering index for segments.segmentstart by including tripid and stopid.
Try adding a clusters index to the routes table on both routeservicetype and routeid.
Depending on the frequency of the data within the routeservicetype field, you may get an improvement by shrinking the amount of data being compared in the join to the trips table.
Looking at the explain plan, you may also want to force the sequence of the table usage by using STRAIGHT_JOIN instead of JOIN (or INNER JOIN), as I've had real improvements with this technique.
Essentially, put the table with the smallest row-count of extracted data at the beginning of the query, and the largest row count table at the end (in this case possibly the segments table?), with the exception of simple lookups (eg. for descriptions).
You may also consider altering the WHERE clause to filter the segments table on stopid instead of the stops table, and creating a clustered index on the segments table on (stopid, tripid and segmentstart) - this index will be effectively able to satisfy two joins and two where clauses from a single index...
To build the index...
ALTER TABLE segments ADD INDEX idx_qry_helper ( stopid, tripid, segmentstart );
And the altered WHERE clause...
WHERE segments.stopid IN ( 43914, 23899, 23925, 23908,
23913, 19899, 23871, 43902,
23876, 25563, 18956, 19912,
23889, 23861, 23879, 23884,
23856, 19920, 19898, 23916,
23894, 20985, 23930, 20932,
20986, 22434, 20021, 19893,
19903, 19707, 19935 )
:
:
At the end of the day, a 10 second response for what appears to be a complex query on a fairly large dataset, isn't all that bad!
Related
I have 15 million customers data in 3 columns and the columns indexed in each:
customers
Index ix_disabled_customer_id on zen_customers(customers_id, disabled);
customer_attribute
Index ix_attribute_id_and_name onzen_customers(attribute_id, attribute_name);
customer_attribute_value.
Index ix_attribute_id_and_customer_id on `zen_customers`(customers_id, attribute_id);
I am trying to filter the customers using Gender and it takes too long to return the results.
Following is the query
SELECT tcav.customers_id AS customers_id
FROM customer_attribute_value tcav
JOIN customer_attribute tca
JOIN customers zc
WHERE tcav.attribute_id = tca.attribute_id
AND tca.attribute_name = "Gender"
AND tcav.attribute_value = "M"
AND zc.customers_id = tcav.customers_id
AND zc.disabled = 0;
Image Added for Explain Extended plan
It would really appreciated if I could get ideas to optimize this filtering. Thanks
First, using ON clauses instead of the WHERE clause for joining tables is recommended. It is unlikely to have any effect on performance, but it does help improve the ability to see what columns relate which tables.
SELECT tcav.customers_id AS customers_id
FROM tulip_customer_attribute_value tcav
JOIN tulip_customer_attribute tca
ON tcav.attribute_id = tca.attribute_id
JOIN zen_customers zc
ON zc.customers_id = tcav.customers_id
WHERE tca.attribute_name = "Gender"
AND tcav.attribute_value = "M"
AND zc.disabled = 0
Add the following indexes:
tulip_customer_attribute (attribute_name,attribute_id)
tulip_customer_attribute_value (attribute_id,attribute_value,customers_id)
The order of the columns in the indexes is important.
EAV schema has many problems. In this case, you are expending a lot of space and time looking up the 'gender' when it could more efficiently be simply in the main table.
Your schema has made it immensely worse by normalizing out the values, rather than putting them in the attribute table.
Follow the tag [entity-attribute-value] for further enlightenment.
Until you seriously revamp the schema, the performance will go from bad to terrible as the data grows.
i have a problem with a query for a web site. This is the situation:
I have 3 table:
articoli = where there are all article
clasart = where there are all the matches between the code article and class code - 32314 rows
classificazioni = where there are all matches between class code and name of class - 2401 rows
and this is the query
SELECT a.clar_classi , b.CLA_DESCRI
FROM clasart a JOIN (
SELECT art.AI_CAPOCODI, art.ai_codirest
FROM (select * from clasart where clar_azienda = 'SRL') a
JOIN (
SELECT AI_CAPOCODI, AI_CODIREST,AI_DT_CREAZ,
AI_DESCRIZI, AI_CATEMERC, concat(AI_CAPOCODI, AI_CODIREST) as codice, aI_grupscon
FROM articoli
WHERE AI_AZIENDA = 'SRL' AND AI_CATEMERC LIKE '0101______' AND AI_FLAG_NOW = 0 AND AI_CAPOCODI <> 'zzz'
) art ON trim(a.CLAR_ARTICO) = art.AI_CODIREST
JOIN classificazioni b ON a.CLAR_CLASSI = b.CLA_CODICE
WHERE b.CLA_CODICE LIKE 'AA51__'
group by CLAR_ARTICO) art ON trim(CLAR_ARTICO) = concat(art.AI_CAPOCODI, art.ai_codirest)
JOIN classificazioni b ON a.CLAR_CLASSI = b.CLA_CODICE
WHERE CLAR_AZIENDA = 'SRL' AND CLAR_CLASSI like 'CO____'
The time of run is 16 second. The time increase to 16 second when join with classificazioni.
You can help me? Thanks
Introduce following indexing using the queries below and after that the query will start running within a second or two:
ALTER TABLE articoli ADD INDEX idx_artc_az_cat_flg_cap (AI_AZIENDA, AI_FLAG_NOW, AI_CAPOCODI, AI_CATEMERC);
The above query will introduce the multi-column indexes on articoli table. The indexing work similar way how hash tables or keys of the array work to directly identifying the row on which the target value(s) match. Using multi-column will result in comparison of less number of rows.
Do not use trim(a.CLAR_ARTICO): make sure that before insertion the values are trimmed but not at the time of joining. This can result in skipping the index files and the join comparison can be expensive this way.
Let's move to next steps:
Introduce index on clar_azienda using following query:
ALTER TABLE clasart ADD INDEX idx_cls_az (clar_azienda);
If art.AI_CODIREST is not a primary/foreign key you'll need to introduce index there using the query below:
ALTER TABLE classificazioni ADD INDEX idx_clsi_cd (CLA_CODICE);
We are almost done, you'll just need to index CLAR_AZIENDA as well the same way how I indexed the above columns. Let me also tell you what is what in index column last query so you can write your own.
ALTER TABLE <tableName> ADD INDEX <indexKey (<column to be indexed>);
Let me know if you still have issues, remember you can run these queries after selecting your database from PhpMyAdmin (SQL tabl) or on mysql console.
How do I figure out which columns to index?
SELECT a.ORD_ID AS Manual_Added_Orders,
a.ORD_poOrdID_List AS Auto_Added_Orders,
a.ORDPOITEM_ModelNumber,
a.ORDPO_Number,
a.ORDPOITEM_ID,
(SELECT sum(ORDPOITEM_Qty) AS ORDPOITEM_Qty
FROM orderpoitems
WHERE ORDPOITEM_ModelNumber = a.ORDPOITEM_ModelNumber
AND ORDPO_Number = 123007)
AS ORDPOITEM_Qty,
a.ORDPO_TrackingNumber,
a.ORDPOITEM_Received,
a.ORDPOITEM_ReceivedQty,
a.ORDPOITEM_ReceivedBy,
b.ORDPO_ID
FROM orderpoitems a
LEFT JOIN orderpo b ON (a.ORDPO_Number = b.ORDPO_Number)
WHERE a.ORDPO_Number = 123007
GROUP BY a.ORDPOITEM_ModelNumber
ORDER BY a.ORD_poOrdID_List, a.ORD_ID
I did the explain that is how I am getting these pictures... I added a few indexes... still not looking good.
Well firstly your query could be simplified to:
SELECT a.ORD_ID AS Manual_Added_Orders,
a.ORD_poOrdID_List AS Auto_Added_Orders,
a.ORDPOITEM_ModelNumber,
a.ORDPO_Number,
a.ORDPOITEM_ID,
SUM(ORDPOITEM_Qty) AS ORDPOITEM_Qty
a.ORDPO_TrackingNumber,
a.ORDPOITEM_Received,
a.ORDPOITEM_ReceivedQty,
a.ORDPOITEM_ReceivedBy,
b.ORDPO_ID
FROM orderpoitems a
LEFT JOIN orderpo b ON (a.ORDPO_Number = b.ORDPO_Number)
WHERE a.ORDPO_Number = 123007
GROUP BY a.ORDPOITEM_ModelNumber
ORDER BY a.ORD_poOrdID_List, a.ORD_ID
Secondly I would start by creating a index on the orderpoitems.ORDPO_Number and orderpo.ORDPO_number
Bit hard to say without the table structures.
Read up on indexes and covering index
From what you have, start with what is in your where clause AND join criteria to another table. Also, include if possible and practical, those columns used in group by / order by as order by is typically a killer when finishing a query.
That said, I would have an index on your OrderPOItems table on
( ordpo_number, orderpoitem_ModelNumber, ord_poordid_list, ord_id )
This way, the FIRST element hits your WHERE clause. Next the column for your data grouping, finally, the columns for your order by. This way, the joins and qualifying components can be "covered" from the index alone without having to go to the raw data pages for the rest of the columns being returned. Hopefully a good jump start specific to your scenario.
I have a query on a fact table "foo_success" in a star schema, which has about 6 million rows. This table holds (integer) references to dimension tables and nothing else. We use MyISAM as storage engine.
The query:
SELECT
hierarchy.level0name,
hierarchy.level1name,
hierarchy.level0,
hierarchy.level1,
date.date,
address.city,
user.emailAddress,
foo_object.name,
foo_object.type,
user_group.groupId,
COUNT(user.id) AS count_user_id,
SUM(foo_object_statistic.passes) AS sum_foo_object_statistic_passes,
SUM(foo_object_statistic.starts) AS sum_foo_object_statistic_starts,
SUM(foo_object_statistic.calls) AS sum_foo_object_statistic_calls
FROM
foo_success,
user,
user_group,
address,
hierarchy,
foo_object,
foo_object_statistic,
date
WHERE (foo_success.userDimensionId = user.id)
AND (foo_success.userGroupDimensionId = user_group.id)
AND (foo_success.addressDimensionId = address.id)
AND (foo_success.hierarchyDimensionId = hierarchy.id)
AND (foo_success.fooObjectDimensionId = foo_object.id)
AND (foo_success.fooObjectStatisticDimensionId = foo_object_statistic.id)
AND (foo_success.dateDimensionId=date.id)
AND hierarchy.level0 = 'XYZ'
AND hierarchy.level1 IS NOT NULL
AND hierarchy.level2 IS NOT NULL
AND hierarchy.level3 IS NOT NULL
AND hierarchy.level4 IS NOT NULL
AND hierarchy.level5 IS NOT NULL
AND hierarchy.level6 IS NULL
AND hierarchy.level7 IS NULL
GROUP BY hierarchy.level0, foo_object.fooObjectId
LIMIT 0, 25;
What I've tried so far:
This is the simple join version, which equals the INNER JOIN alternative in speed.
There are indices on all fields which are joined or which are part of a condition.
I did use EXPLAIN on this query and found that the query cost (# of processed rows) is 128596 for the table user and 77 for the table foo_success.
I tried to remove the dependency on the user table, which leads to a # of processed rows of over 6 million in the fact table foo_success.
It takes about 1,5 minutes to finish this query, which is far off my expectations for a data warehouse star schema optimized on read speed. Is there any way I can optimize this monster?
The inefficiency of the query mostly comes from transfering a lot of data you do not actually use: the fields hierarchy.level1name, hierarchy.level0name, hierarchy.level1, date.date, address.city, user.emailAddress, foo_object.name, foo_object.type, user_group.groupId are not included in GROUP BY clause, which means that the information is retrieved for each row, loaded in memory and then just discarded.
What I would recommend is to concentrate retrieving of all sufficient ids and aggregation results in a subquery and then join to the rest of the tables, so that each join would produce not more than a single row (you can even move the LIMIT clause in the subquery to minimize the required subsequent JOIN operations). After that, you may discover, that you do not have some useful indexes.
How would I make this query run faster...?
SELECT account_id,
account_name,
account_update,
account_sold,
account_mds,
ftp_url,
ftp_livestatus,
number_digits,
number_cw,
client_name,
ppc_status,
user_name
FROM
Accounts,
FTPDetails,
SiteNumbers,
Clients,
PPC,
Users
WHERE Accounts.account_id = FTPDetails.ftp_accountid
AND Accounts.account_id = SiteNumbers.number_accountid
AND Accounts.account_client = Clients.client_id
AND Accounts.account_id = PPC.ppc_accountid
AND Accounts.account_designer = Users.user_id
AND Accounts.account_active = 'active'
AND FTPDetails.ftp_active = 'active'
AND SiteNumbers.number_active = 'active'
AND Clients.client_active = 'active'
AND PPC.ppc_active = 'active'
AND Users.user_active = 'active'
ORDER BY
Accounts.account_update DESC
Thanks in advance :)
EXPLAIN query results:
I don't really have any foreign keys set up...I was trying to avoid making alterations to the database as will have to do a complete overhaul soon.
only primary keys are the id of each table e.g. account_id, ftp_id, ppc_id ...
Indexes
You need - at least - an index on every field that is used in a JOIN condition.
Indexes on the fields that appear in WHERE or GROUP BY or ORDER BY clauses are most of the time useful, too.
When in a table, two or more fields are used in JOIns (or WHERE or GROUP BY or ORDER BY), a compound (combined) index of these (two or more) fields may be better than separate indexes. For example in the SiteNumbers table, possible indexes are the compound (number_accountid, number_active) or (number_active, number_accountid).
Condition in fields that are Boolean (ON/OFF, active/inactive) are sometimes slowing queries (as indexes are not selective and thus not very helpful). Restructuring (father normalizing) the tables is an option in that case but probably you can avoid the added complexity.
Besides the usual advice (examine the EXPLAIN plan, add indexes where needed, test variations of the query),
I notice that in your query there is a partial Cartesian Product. The table Accounts has a one-to-many relationships to three tables FTPDetails, SiteNumbers and PPC. This has the effect that if you have for example 1000 accounts, and every account is related to, say, 10 FTPDetails, 20 SiteNumbers and 3 PPCs, the query will return for every account 600 rows (the product of 10x20x3). In total 600K rows where many data are duplicated.
You could instead split the query into three plus one for base data (Account and the rest tables). That way, only 34K rows of data (having smaller length) would be transfered :
Accounts JOIN Clients JOIN Users
(with all fields needed from these tables)
1K rows
Accounts JOIN FTPDetails
(with Accounts.account_id and all fields from FTPDetails)
10K rows
Accounts JOIN SiteNumbers
(with Accounts.account_id and all fields from SiteNumbers)
20K rows
Accounts JOIN PPC
(with Accounts.account_id and all fields from PPC)
3K rows
and then use the data from the 4 queries in the client side to show combined info.
I would add the following indexes:
Table Accounts
index on (account_designer)
index on (account_client)
index on (account_active, account_id)
index on (account_update)
Table FTPDetails
index on (ftp_active, ftp_accountid)
Table SiteNumbers
index on (number_active, number_accountid)
Table PPC
index on (ppc_active, ppc_accountid)
Use EXPLAIN to find out which index could be used and which index is actually used. Create an appropriate index if necessary.
If FTPDetails.ftp_active only has the two valid entries 'active' and 'inactive', use BOOL as data type.
As a side note: I strongly suggest using explicit joins instead of implicit ones:
SELECT
account_id, account_name, account_update, account_sold, account_mds,
ftp_url, ftp_livestatus,
number_digits, number_cw,
client_name,
ppc_status,
user_name
FROM Accounts
INNER JOIN FTPDetails
ON Accounts.account_id = FTPDetails.ftp_accountid
AND FTPDetails.ftp_active = 'active'
INNER JOIN SiteNumbers
ON Accounts.account_id = SiteNumbers.number_accountid
AND SiteNumbers.number_active = 'active'
INNER JOIN Clients
ON Accounts.account_client = Clients.client_id
AND Clients.client_active = 'active'
INNER JOIN PPC
ON Accounts.account_id = PPC.ppc_accountid
AND PPC.ppc_active = 'active'
INNER JOIN Users
ON Accounts.account_designer = Users.user_id
AND Users.user_active = 'active'
WHERE Accounts.account_active = 'active'
ORDER BY Accounts.account_update DESC
This makes the query much more readable because the join condition is close to the name of the table that is being joined.
EXPLAIN, benchmark different options. For starters, I'm sure that several queries will be faster than this monster. First, because query optimiser will spend a lot of time examining what join order is the best (5!=120 possibilities). And second, queries like SELECT ... WHERE ....active = 'active' will be cached (though it depends on an amount of data changes).
One of your main problems is here: x.y_active = 'active'
Problem: low cardinality
The active field is a boolean field with 2 possible values, as such it has very low cardinality.
MySQL (or any SQL for that matter will not use an index when 30% or more of the rows have the same value).
Forcing the index is useless because it will make your query slower, not faster.
Solution: partition your tables
A solution is to partition your tables on the active columns.
This will exclude all non-active fields from consideration, and will make the select act as if you actually have a working index on the xxx-active fields.
Sidenote
Please don't ever use implicit where joins, it's much too error prone and consufing to be useful.
Use a syntax like Oswald's answer instead.
Links:
Cardinality: http://en.wikipedia.org/wiki/Cardinality_(SQL_statements)
Cardinality and indexes: http://www.bennadel.com/blog/1424-Exploring-The-Cardinality-And-Selectivity-Of-SQL-Conditions.htm
MySQL partitioning: http://dev.mysql.com/doc/refman/5.5/en/partitioning.html