Joining two large tables in mysql giving server time out - mysql

This query is inefficient and unable to execute. track and desiredspeed table have almost million records.... after this we want to self join the track table for further processing. any efficient approach to execute bellow query is appreciated..
select
t_id,
route_id,
t.timestamp,
s_lat,
s_long,
longitude,
latitude,
SQRT(POW((latitude - d_lat),2) + POW((longitude - d_long),2)) as dst,
SUM(speed*18/5)/count(*) as speed,
'20' as actual_speed,
((20-(speed*18/5))/(speed*18/5))*100 as speed_variation
from
track t,
desiredspeed s
WHERE
LEFT(s_lat,6) = LEFT(latitude,6)
AND LEFT(s_long,6)=LEFT(longitude,6)
AND t_id > 53445
group by
route_id,
s_lat,
s_long
order by
t_id asc

firstly you are using sybase join syntax i would change that
you are also performing two computations per join across large datasets this is likely to be inefficient
this will not be able to use an index as you are performing computation on the column, either store the data precomputed or alternately add a computed column based on the rule applied above, and index accordingly
Finally it may be quicker if you used temp tables or common Table expressions (although do not know MySQL too well here)

Related

How to improve query performance with order by, group by and joins

I Had a problem with order by when joins multiple tables which have millions of data. But I got solution as instead of join with distinct use of EXISTS will improve performance from the following question
How to improve order by performance with joins in mysql
SELECT
`tracked_twitter` . *,
COUNT( * ) AS twitterContentCount,
retweet_count + favourite_count + reply_count AS engagement
FROM
`tracked_twitter`
INNER JOIN
`twitter_content`
ON `tracked_twitter`.`id` = `twitter_content`.`tracked_twitter_id`
INNER JOIN
`tracker_twitter_content`
ON `twitter_content`.`id` = `tracker_twitter_content`.`twitter_content_id`
WHERE
`tracker_twitter_content`.`tracker_id` = '88'
GROUP BY
`tracked_twitter`.`id`
ORDER BY
twitterContentCount DESC LIMIT 20 OFFSET 0
But that method solves if I only need the result set from the parent table. What if, I want to execute grouped count and other math functions in other than parent table. I wrote a query that solves my criteria, but it takes 20 sec to execute. How can I optimize it ??.
Thanks in advance
Given the query is already fairly simple the options I'd look in to are ...
Execution plan (to find any missing indexes you could add)
caching (to ensure SQL already has all the data in ram)
de-normalisation (to turn the query in to flat select)
cache the data in the application (so you could use something like PLINQ on it)
Use a ram based store (redis, elastic)
File group adjustments (physically move the db to faster discs)
Partition your tables (to spread the raw data over multiple physical discs)
The further you go down this list the more involved the solutions become.
I guess it depends how fast you need the query to be and how much you need your solution to scale.

Can you index subqueries?

I have a table and a query that looks like below. For a working example, see this SQL Fiddle.
SELECT o.property_B, SUM(o.score1), w.score
FROM o
INNER JOIN
(
SELECT o.property_B, SUM(o.score2) AS score FROM o GROUP BY property_B
) w ON w.property_B = o.property_B
WHERE o.property_A = 'specific_A'
GROUP BY property_B;
With my real data, this query takes 27 seconds. However, if I first create w as a temporary Table and index property_B, it all together takes ~1 second.
CREATE TEMPORARY TABLE w AS
SELECT o.property_B, SUM(o.score2) AS score FROM o GROUP BY property_B;
ALTER TABLE w ADD INDEX `property_B_idx` (property_B);
SELECT o.property_B, SUM(o.score1), w.score
FROM o
INNER JOIN w ON w.property_B = o.property_B
WHERE o.property_A = 'specific_A'
GROUP BY property_B;
DROP TABLE IF EXISTS w;
Is there a way to combine the best of these two queries? I.e. a single query with the speed advantages of the indexing in the subquery?
EDIT
After Mehran's answer below, I read this piece of explanation in the MySQL documentation:
As of MySQL 5.6.3, the optimizer more efficiently handles subqueries in the FROM clause (that is, derived tables):
...
For cases when materialization is required for a subquery in the FROM clause, the optimizer may speed up access to the result by adding an index to the materialized table. If such an index would permit ref access to the table, it can greatly reduce amount of data that must be read during query execution. Consider the following query:
SELECT * FROM t1
JOIN (SELECT * FROM t2) AS derived_t2 ON t1.f1=derived_t2.f1;
The optimizer constructs an index over column f1 from derived_t2 if doing so would permit the use of ref access for the lowest cost execution plan. After adding the index, the optimizer can treat the materialized derived table the same as a usual table with an index, and it benefits similarly from the generated index. The overhead of index creation is negligible compared to the cost of query execution without the index. If ref access would result in higher cost than some other access method, no index is created and the optimizer loses nothing.
First of all you need to know that creating a temporary table is absolutely a feasible solution. But in cases no other choice is applicable which is not true here!
In your case, you can easily boost your query as FrankPl pointed out because your sub-query and main-query are both grouping by the same field. So you don't need any sub-queries. I'm going to copy and paste FrankPl's solution for the sake of completeness:
SELECT o.property_B, SUM(o.score1), SUM(o.score2)
FROM o
GROUP BY property_B;
Yet it doesn't mean it's impossible to come across a scenario in which you wish you could index a sub-query. In which cases you've got two choices, first is using a temporary table as you pointed out yourself, holding the results of the sub-query. This solution is advantageous since it is supported by MySQL for a long time. It's just not feasible if there's a huge amount of data involved.
The second solution is using MySQL version 5.6 or above. In recent versions of MySQL new algorithms are incorporated so an index defined on a table used within a sub-query can also be used outside of the sub-query.
[UPDATE]
For the edited version of the question I would recommend the following solution:
SELECT o.property_B, SUM(IF(o.property_A = 'specific_A', o.score1, 0)), SUM(o.score2)
FROM o
GROUP BY property_B
HAVING SUM(IF(o.property_A = 'specific_A', o.score1, 0)) > 0;
But you need to work on the HAVING part. You might need to change it according to your actual problem.
I am not really that familiar with MySql, I mostly worked with Oracle.
If you want a where-clause in the SUM, you can use decode or case.
it would look something like that
SELECT o.property_B, , SUM(decode(property_A, 'specific_A', o.score1, 0), SUM(o.score2)
FROM o
GROUP BY property_B;
or with case
SELECT o.property_B, , SUM(CASE
WHEN property_A = 'specific_A' THEN o.score1
ELSE 0
END ),
SUM(o.score2)
FROM o
GROUP BY property_B;
I do not see why you would need the join at all. I would assume that
SELECT o.property_B, SUM(o.score1), SUM(o.score2)
FROM o
GROUP BY property_B;
should give what you want, but with a much simpler and hence better to optimize statement.
It should be the duty of MySQL to optimise your query, and I don't think there is a way to create an index on the fly. However, you can try to force the use of the index of property_o (if you have it). See http://dev.mysql.com/doc/refman/5.1/en/index-hints.html
Also, you can merge the create and alter statements, if you prefer.

Reduce number of database IO or size of data operation?

To make the system more efficient, should we reduce the number of database IO or reduce the size of data operation?
More specifically, suppose I want to get top 60-70 objects.
1st approach:
By joining several tables, I got a huge table here. Then sorting the table based on some attributes, and return the top 70 objects with all its attributes and I only use the 60-70 objects.
2nd approach:
By joining less tables and sorting them, I got top 70 objects' ids, and then I do a second lookup for 60-70 objects based on their ids.
So which one is better in terms of efficiency, esp for MySQL.
It will depend on how you designed your query.
Usually JOIN operations are more efficient than using IN (group) or nested SELECTs, but when joining 3 or more tables you have to choose carefully the order to optimize it.
And of course, every table bind should envolve a PRIMARY KEY.
If the query remain too slow despite of your efforts, then you should use a cache. A new table, or even a file that will store the results of this query up to a given expiration time when it should be updated.
This is a common practice when the results of a heavy query are needed frequently in the system.
You can always count on MySQL Workbench to measure the speed of your queries and play with your options.
Ordinarily, the best way to take advantage of query optimization is to combine the two approaches you present.
SELECT col, col, col, col, etc
FROM tab1,
JOIN tabl2 ON col = col
JOIN tabl3 ON col = col
WHERE tab1.id IN
( SELECT distinct tab1.id
FROM whatever
JOIN whatever ON col = col
WHERE whatever
ORDER BY col DESC
LIMIT 70
)
See how that goes? You make a subquery to select the IDs, then use it in the main query.

SORTING OUT MULTIPLE SIMILAR TABLE IN A MYSQL VIEW

I all,
i have 2 similar very LARGE table(1M rows each) with the same layout, i would union them and sorting by a common column: start . also i would put a condition in "start" ie : start>X.
the problem is that the view doesnt take care abount start's index and the the complexity rise up much, a simple query takes about 15 seconds and inserting a LIMIT doesnt fix because the results are cutted off first.
CREATE VIEW CDR AS
(SELECT start, duration, clid, FROM cdr_md ORDER BY start LIMIT 1000)
UNION ALL
(SELECT start, duration, clid, FROM cdr_1025 ORDER BY start LIMIT 1000)
ORDER BY start ;
a query to:
SELECT * FROM CDR WHERE start>10
doesnt returns expected results cause LIMIT keyword cuts off results prior.
the expected results would be as a query like this:
CREATE VIEW CDR AS
(SELECT start, duration, clid, FROM cdr_md WHERE start>X ORDER BY start LIMIT 1000)
UNION ALL
(SELECT start, duration, clid, FROM cdr_1025 WHERE start>X ORDER BY start LIMIT 1000)
ORDER BY start ;
Is there a way to avoid this problem ?
Thaks all
Fabrizio
i have 2 similar table ... with the same layout
This is contrary to the Principle of Orthogonal Design.
Don't do it. At least not without very good reason—with suitable indexes, 1 million records per table is easily enough for MySQL to handle without any need for partitioning; and even if one did need to partition the data, there are better ways than this manual kludge (which can give rise to ambiguous, potentially inconsistent data and lead to redundancy and complexity in your data manipulation code).
Instead, consider combining your tables into a single one with suitable columns to distinguish the records' differences. For example:
CREATE TABLE cdr_combined AS
SELECT *, 'md' AS orig FROM cdr_md
UNION ALL
SELECT *, '1025' AS orig FROM cdr_1025
;
DROP TABLE cdr_md, cdr_1025;
If you will always be viewing your data along the previously "partitioned" axis, include the distinguishing columns as index prefixes and performance will generally improve versus having separate tables.
You then won't need to perform any UNION and your VIEW definition effectively becomes:
CREATE VIEW CDR AS
SELECT start, duration, clid, FROM cdr_combined ORDER BY start
However, be aware that queries on views may not always perform as well as using the underlying tables directly. As documented under Restrictions on Views:
View processing is not optimized:
It is not possible to create an index on a view.
Indexes can be used for views processed using the merge algorithm. However, a view that is processed with the temptable algorithm is unable to take advantage of indexes on its underlying tables (although indexes can be used during generation of the temporary tables).

How to optimize a JOIN and AVG statement for a ratings table

I basically have two tables, a 'server' table and a 'server_ratings' table. I need to optimize the current query that I have (It works but it takes around 4 seconds). Is there any way I can do this better?
SELECT ROUND(AVG(server_ratings.rating), 0), server.id, server.name
FROM server LEFT JOIN server_ratings ON server.id = server_ratings.server_id
GROUP BY server.id;
Query looks ok, but make sure you have proper indexes:
on id column in server table - probably primary key,
on server_id column in server_ratings table,
If it does not help, then add rating column into server table and calculate it on a constant basis (see this answer about Cron jobs). This way you will save the time you spend on calculations. They can be made separately eg. every minute, but probably some less frequent calculations are enough (depending on how dynamic is your data).
Also make sure you query proper table - in the question you have mentioned servers table, but in the code there is reference to server table. Probably a typo :)
This should be slightly faster, because the aggregate function is executed first, resulting in fewer JOIN operations.
SELECT s.id, s.name, r.avg_rating
FROM server s
LEFT JOIN (
SELECT server_id, ROUND(AVG(rating), 0) AS avg_rating
FROM server_ratings
GROUP BY server_id
) r ON r.server_id = s.id
But the major point are matching indexes. Primary keys are indexed automatically. Make sure you have one on server_ratings.server_id, too.