I have a query on a fact table "foo_success" in a star schema, which has about 6 million rows. This table holds (integer) references to dimension tables and nothing else. We use MyISAM as storage engine.
The query:
SELECT
hierarchy.level0name,
hierarchy.level1name,
hierarchy.level0,
hierarchy.level1,
date.date,
address.city,
user.emailAddress,
foo_object.name,
foo_object.type,
user_group.groupId,
COUNT(user.id) AS count_user_id,
SUM(foo_object_statistic.passes) AS sum_foo_object_statistic_passes,
SUM(foo_object_statistic.starts) AS sum_foo_object_statistic_starts,
SUM(foo_object_statistic.calls) AS sum_foo_object_statistic_calls
FROM
foo_success,
user,
user_group,
address,
hierarchy,
foo_object,
foo_object_statistic,
date
WHERE (foo_success.userDimensionId = user.id)
AND (foo_success.userGroupDimensionId = user_group.id)
AND (foo_success.addressDimensionId = address.id)
AND (foo_success.hierarchyDimensionId = hierarchy.id)
AND (foo_success.fooObjectDimensionId = foo_object.id)
AND (foo_success.fooObjectStatisticDimensionId = foo_object_statistic.id)
AND (foo_success.dateDimensionId=date.id)
AND hierarchy.level0 = 'XYZ'
AND hierarchy.level1 IS NOT NULL
AND hierarchy.level2 IS NOT NULL
AND hierarchy.level3 IS NOT NULL
AND hierarchy.level4 IS NOT NULL
AND hierarchy.level5 IS NOT NULL
AND hierarchy.level6 IS NULL
AND hierarchy.level7 IS NULL
GROUP BY hierarchy.level0, foo_object.fooObjectId
LIMIT 0, 25;
What I've tried so far:
This is the simple join version, which equals the INNER JOIN alternative in speed.
There are indices on all fields which are joined or which are part of a condition.
I did use EXPLAIN on this query and found that the query cost (# of processed rows) is 128596 for the table user and 77 for the table foo_success.
I tried to remove the dependency on the user table, which leads to a # of processed rows of over 6 million in the fact table foo_success.
It takes about 1,5 minutes to finish this query, which is far off my expectations for a data warehouse star schema optimized on read speed. Is there any way I can optimize this monster?
The inefficiency of the query mostly comes from transfering a lot of data you do not actually use: the fields hierarchy.level1name, hierarchy.level0name, hierarchy.level1, date.date, address.city, user.emailAddress, foo_object.name, foo_object.type, user_group.groupId are not included in GROUP BY clause, which means that the information is retrieved for each row, loaded in memory and then just discarded.
What I would recommend is to concentrate retrieving of all sufficient ids and aggregation results in a subquery and then join to the rest of the tables, so that each join would produce not more than a single row (you can even move the LIMIT clause in the subquery to minimize the required subsequent JOIN operations). After that, you may discover, that you do not have some useful indexes.
Related
I have 15 million customers data in 3 columns and the columns indexed in each:
customers
Index ix_disabled_customer_id on zen_customers(customers_id, disabled);
customer_attribute
Index ix_attribute_id_and_name onzen_customers(attribute_id, attribute_name);
customer_attribute_value.
Index ix_attribute_id_and_customer_id on `zen_customers`(customers_id, attribute_id);
I am trying to filter the customers using Gender and it takes too long to return the results.
Following is the query
SELECT tcav.customers_id AS customers_id
FROM customer_attribute_value tcav
JOIN customer_attribute tca
JOIN customers zc
WHERE tcav.attribute_id = tca.attribute_id
AND tca.attribute_name = "Gender"
AND tcav.attribute_value = "M"
AND zc.customers_id = tcav.customers_id
AND zc.disabled = 0;
Image Added for Explain Extended plan
It would really appreciated if I could get ideas to optimize this filtering. Thanks
First, using ON clauses instead of the WHERE clause for joining tables is recommended. It is unlikely to have any effect on performance, but it does help improve the ability to see what columns relate which tables.
SELECT tcav.customers_id AS customers_id
FROM tulip_customer_attribute_value tcav
JOIN tulip_customer_attribute tca
ON tcav.attribute_id = tca.attribute_id
JOIN zen_customers zc
ON zc.customers_id = tcav.customers_id
WHERE tca.attribute_name = "Gender"
AND tcav.attribute_value = "M"
AND zc.disabled = 0
Add the following indexes:
tulip_customer_attribute (attribute_name,attribute_id)
tulip_customer_attribute_value (attribute_id,attribute_value,customers_id)
The order of the columns in the indexes is important.
EAV schema has many problems. In this case, you are expending a lot of space and time looking up the 'gender' when it could more efficiently be simply in the main table.
Your schema has made it immensely worse by normalizing out the values, rather than putting them in the attribute table.
Follow the tag [entity-attribute-value] for further enlightenment.
Until you seriously revamp the schema, the performance will go from bad to terrible as the data grows.
I have a database that holds the details of various different events and all of the odds that bookmakers are offering on those events. I have the following query which I am using to get the best odds for each different type of bet for each event:
SELECT
eo1.id,
eo1.event_id,
eo1.market_id,
IF(markets.display_name IS NULL, markets.name, markets.display_name) AS market_name,
IF(market_values.display_name IS NULL, market_values.name, market_values.display_name) AS market_value_name,
eo2.bookmaker_id,
eo2.value
FROM event_odds AS eo1
JOIN markets ON eo1.market_id = markets.id AND markets.enabled = 1
JOIN market_values on eo1.market_value_id = market_values.id
JOIN bookmakers on eo1.bookmaker_id = bookmakers.id AND bookmakers.enabled = 1
JOIN event_odds AS eo2
ON
eo1.event_id = eo2.event_id
AND eo1.market_id = eo2.market_id
AND eo1.market_value_id = eo2.market_value_id
AND eo2.value = (
SELECT MAX(value)
FROM event_odds
WHERE event_odds.event_id = eo1.event_id
AND event_odds.market_id = eo1.market_id
AND event_odds.market_value_id = eo1.market_value_id
)
WHERE eo1.`event_id` = 6708
AND markets.name != '-'
GROUP BY eo1.market_id, eo1.market_value_id
ORDER BY markets.sort_order, market_name, market_values.id
This returns exactly what I want however since the database has grown in size it's started to be very slow. I currently have just over 500,000 records in the event odds table and the query takes almost 2 minutes to run. The hardware is decent spec, all of the columns are indexed correctly and the table engine being used is MyISAM for all tables. How can I optimise this query so it runs quicker?
For this query, you want to be sure you have an index on event_odds(event_id, market_id, market_value_id, value).
In addition, you want indexes on:
markets(id, enabled, name)
bookmakers(id, enabled)
Note that composite indexes are quite different from multiple indexes with one column each.
Create a MySQL view for this SQL. Try to fetch data from that MySQL view then. This would help in increasing the speed and can reduce complexity. Try pagination for listing using limit. This will also reduce the load on server. Try to indexes for typical columns
I am trying to optimize this query as good as possible,but still i am getting query locks due to this query.Can any one provide some suggestions in improving it.The query fetches the last one day entries from the table.
The QUERY:
SELECT CR.id,
CR.servicecode,
CR.leadtime,
CR.redirecturl,
CRE.custemail,
CRE.custlname,
CRE.custfname,
CRE.duration,
CR.userid,
AA.lpintrotimearr,
AA.lpintrotimedep,
AA.landdatetimearr,
AA.landdatetimedep,
CR.newcustid,
cre.CRE.custmobilephone,
CRE.brandname
FROM response CR
LEFT JOIN agreement AA
ON CR.id = AA.id
LEFT JOIN request CRE
ON CRE.id = CR.id
WHERE CR.id > '20120617145243'
AND CR.approved = 1
AND CR.chlapproved != 0
AND CR.chlapproved IS NOT NULL
AND AA.id IS NOT NULL
AND ( AA.stdsign != 'on'
OR AA.stdsign IS NULL )
AND ( AA.ivaflag = 0
OR AA.ivaflag IS NULL )
AND ( AA.opt IS NULL
OR AA.opt = 0 );
The EXPLAIN:
One way is to index all 3(AA.stdsign,AA.ivaflag and AA.opts) columns but all the three flags (AA.stdsign,AA.ivaflag and AA.opts) can have only 3 different values.Will indexing these reduce query run time?
All the ids are of varchar(60) data type.
There isn't much to be improved on the query itself.
On the other hand, setting an index on AA.stdsign, AA.ivaflag and AA.opts should help a lot.
As your EXPLAIN indicates, no suitable key is found for your AA table and all 534956 rows must be scanned to satisfy the WHERE clause.
[edit]
One last tip: using large column types (such as VARCHAR(60)) for your primary keys is probably sub-optimal.
First reason: every time you need to reference a row (e.g. in a foreign key), you need another VARCHAR(60).
Second reason: comparisons on strings are slower than on integers (hence it may render a JOIN slower than necessary)
You may want to add an INT column to your tables, and use it as primary key.
so I have a 560mb db with the largest table 500mb(over 10 million rows)
my query hase to join 5 tables and takes about 10 seconds to finish....
SELECT DISTINCT trips.tripid AS tripid,
stops.stopdescrption AS "perron",
Date_format(segments.segmentstart, "%H:%i") AS "time",
Date_format(trips.tripend, "%H:%i") AS "arrival",
Upper(routes.routepublicidentifier) AS "lijn",
plcend.placedescrption AS "destination"
FROM calendar
JOIN trips
ON calendar.vsid = trips.vsid
JOIN routes
ON routes.routeid = trips.routeid
JOIN places plcstart
ON plcstart.placeid = trips.placeidstart
JOIN places plcend
ON plcend.placeid = trips.placeidend
JOIN segments
ON segments.tripid = trips.tripid
JOIN stops
ON segments.stopid = stops.stopid
WHERE stops.stopid IN ( 43914, 23899, 23925, 23908,
23913, 19899, 23871, 43902,
23876, 25563, 18956, 19912,
23889, 23861, 23879, 23884,
23856, 19920, 19898, 23916,
23894, 20985, 23930, 20932,
20986, 22434, 20021, 19893,
19903, 19707, 19935 )
AND calendar.vscdate = Str_to_date('25-10-2011', "%e-%c-%Y")
AND segments.segmentstart >= Str_to_date('15:56', "%H:%i")
AND routes.routeservicetype = 0
AND segments.segmentstart > "00:00:00"
ORDER BY segments.segmentstart
what are things I can do to speed this up? any tips are welcome, i'm pretty new to sql...
but I can't change the structure of the db because it's not mine...
Use EXPLAIN to find the bottlenecks: http://dev.mysql.com/doc/refman/5.0/en/explain.html
Then perhaps, add indexes.
If you don't need to select ALL rows, use LIMIT to limit returned result count.
Just looking at the query, I would say that you should make sure that you have indexes on trips.vsid, calendar.vscdate, segments.segmentstart and routes.routeservicetype. I assume that there is already indexes on all the primary keys in the tables.
Using explain as Briedis suggested would show you how well the indexes work.
You might want to add covering indexes for some tables, like for example an index on trips.vsid where tripid and routeid are included. That way the database can use only the index for the data that is needed from the table, and not read from the actual table.
Edit:
The execution plan tells you that it successfully uses indexes for everything except the segments table, where it does a table scan and filters by the where condition. You should try to make a covering index for segments.segmentstart by including tripid and stopid.
Try adding a clusters index to the routes table on both routeservicetype and routeid.
Depending on the frequency of the data within the routeservicetype field, you may get an improvement by shrinking the amount of data being compared in the join to the trips table.
Looking at the explain plan, you may also want to force the sequence of the table usage by using STRAIGHT_JOIN instead of JOIN (or INNER JOIN), as I've had real improvements with this technique.
Essentially, put the table with the smallest row-count of extracted data at the beginning of the query, and the largest row count table at the end (in this case possibly the segments table?), with the exception of simple lookups (eg. for descriptions).
You may also consider altering the WHERE clause to filter the segments table on stopid instead of the stops table, and creating a clustered index on the segments table on (stopid, tripid and segmentstart) - this index will be effectively able to satisfy two joins and two where clauses from a single index...
To build the index...
ALTER TABLE segments ADD INDEX idx_qry_helper ( stopid, tripid, segmentstart );
And the altered WHERE clause...
WHERE segments.stopid IN ( 43914, 23899, 23925, 23908,
23913, 19899, 23871, 43902,
23876, 25563, 18956, 19912,
23889, 23861, 23879, 23884,
23856, 19920, 19898, 23916,
23894, 20985, 23930, 20932,
20986, 22434, 20021, 19893,
19903, 19707, 19935 )
:
:
At the end of the day, a 10 second response for what appears to be a complex query on a fairly large dataset, isn't all that bad!
How would I make this query run faster...?
SELECT account_id,
account_name,
account_update,
account_sold,
account_mds,
ftp_url,
ftp_livestatus,
number_digits,
number_cw,
client_name,
ppc_status,
user_name
FROM
Accounts,
FTPDetails,
SiteNumbers,
Clients,
PPC,
Users
WHERE Accounts.account_id = FTPDetails.ftp_accountid
AND Accounts.account_id = SiteNumbers.number_accountid
AND Accounts.account_client = Clients.client_id
AND Accounts.account_id = PPC.ppc_accountid
AND Accounts.account_designer = Users.user_id
AND Accounts.account_active = 'active'
AND FTPDetails.ftp_active = 'active'
AND SiteNumbers.number_active = 'active'
AND Clients.client_active = 'active'
AND PPC.ppc_active = 'active'
AND Users.user_active = 'active'
ORDER BY
Accounts.account_update DESC
Thanks in advance :)
EXPLAIN query results:
I don't really have any foreign keys set up...I was trying to avoid making alterations to the database as will have to do a complete overhaul soon.
only primary keys are the id of each table e.g. account_id, ftp_id, ppc_id ...
Indexes
You need - at least - an index on every field that is used in a JOIN condition.
Indexes on the fields that appear in WHERE or GROUP BY or ORDER BY clauses are most of the time useful, too.
When in a table, two or more fields are used in JOIns (or WHERE or GROUP BY or ORDER BY), a compound (combined) index of these (two or more) fields may be better than separate indexes. For example in the SiteNumbers table, possible indexes are the compound (number_accountid, number_active) or (number_active, number_accountid).
Condition in fields that are Boolean (ON/OFF, active/inactive) are sometimes slowing queries (as indexes are not selective and thus not very helpful). Restructuring (father normalizing) the tables is an option in that case but probably you can avoid the added complexity.
Besides the usual advice (examine the EXPLAIN plan, add indexes where needed, test variations of the query),
I notice that in your query there is a partial Cartesian Product. The table Accounts has a one-to-many relationships to three tables FTPDetails, SiteNumbers and PPC. This has the effect that if you have for example 1000 accounts, and every account is related to, say, 10 FTPDetails, 20 SiteNumbers and 3 PPCs, the query will return for every account 600 rows (the product of 10x20x3). In total 600K rows where many data are duplicated.
You could instead split the query into three plus one for base data (Account and the rest tables). That way, only 34K rows of data (having smaller length) would be transfered :
Accounts JOIN Clients JOIN Users
(with all fields needed from these tables)
1K rows
Accounts JOIN FTPDetails
(with Accounts.account_id and all fields from FTPDetails)
10K rows
Accounts JOIN SiteNumbers
(with Accounts.account_id and all fields from SiteNumbers)
20K rows
Accounts JOIN PPC
(with Accounts.account_id and all fields from PPC)
3K rows
and then use the data from the 4 queries in the client side to show combined info.
I would add the following indexes:
Table Accounts
index on (account_designer)
index on (account_client)
index on (account_active, account_id)
index on (account_update)
Table FTPDetails
index on (ftp_active, ftp_accountid)
Table SiteNumbers
index on (number_active, number_accountid)
Table PPC
index on (ppc_active, ppc_accountid)
Use EXPLAIN to find out which index could be used and which index is actually used. Create an appropriate index if necessary.
If FTPDetails.ftp_active only has the two valid entries 'active' and 'inactive', use BOOL as data type.
As a side note: I strongly suggest using explicit joins instead of implicit ones:
SELECT
account_id, account_name, account_update, account_sold, account_mds,
ftp_url, ftp_livestatus,
number_digits, number_cw,
client_name,
ppc_status,
user_name
FROM Accounts
INNER JOIN FTPDetails
ON Accounts.account_id = FTPDetails.ftp_accountid
AND FTPDetails.ftp_active = 'active'
INNER JOIN SiteNumbers
ON Accounts.account_id = SiteNumbers.number_accountid
AND SiteNumbers.number_active = 'active'
INNER JOIN Clients
ON Accounts.account_client = Clients.client_id
AND Clients.client_active = 'active'
INNER JOIN PPC
ON Accounts.account_id = PPC.ppc_accountid
AND PPC.ppc_active = 'active'
INNER JOIN Users
ON Accounts.account_designer = Users.user_id
AND Users.user_active = 'active'
WHERE Accounts.account_active = 'active'
ORDER BY Accounts.account_update DESC
This makes the query much more readable because the join condition is close to the name of the table that is being joined.
EXPLAIN, benchmark different options. For starters, I'm sure that several queries will be faster than this monster. First, because query optimiser will spend a lot of time examining what join order is the best (5!=120 possibilities). And second, queries like SELECT ... WHERE ....active = 'active' will be cached (though it depends on an amount of data changes).
One of your main problems is here: x.y_active = 'active'
Problem: low cardinality
The active field is a boolean field with 2 possible values, as such it has very low cardinality.
MySQL (or any SQL for that matter will not use an index when 30% or more of the rows have the same value).
Forcing the index is useless because it will make your query slower, not faster.
Solution: partition your tables
A solution is to partition your tables on the active columns.
This will exclude all non-active fields from consideration, and will make the select act as if you actually have a working index on the xxx-active fields.
Sidenote
Please don't ever use implicit where joins, it's much too error prone and consufing to be useful.
Use a syntax like Oswald's answer instead.
Links:
Cardinality: http://en.wikipedia.org/wiki/Cardinality_(SQL_statements)
Cardinality and indexes: http://www.bennadel.com/blog/1424-Exploring-The-Cardinality-And-Selectivity-Of-SQL-Conditions.htm
MySQL partitioning: http://dev.mysql.com/doc/refman/5.5/en/partitioning.html