Can this MySQL subquery be optimised?

Can this MySQL subquery be optimised? - mysql

I have two tables, news and news_views. Every time an article is viewed, the news id, IP address and date is recorded in news_views.
I'm using a query with a subquery to fetch the most viewed titles from news, by getting the total count of views in the last 24 hours for each one.
It works fine except that it takes between 5-10 seconds to run, presumably because there's hundreds of thousands of rows in news_views and it has to go through the entire table before it can finish. The query is as follows, is there any way at all it can be improved?
SELECT n.title
, nv.views
FROM news n
LEFT
JOIN (
SELECT news_id
, count( DISTINCT ip ) AS views
FROM news_views
WHERE datetime >= SUBDATE(now(), INTERVAL 24 HOUR)
GROUP
BY news_id
) AS nv
ON nv.news_id = n.id
ORDER
BY views DESC
LIMIT 15

I don't think you need to calculate the count of views as a derived table:
SELECT n.id, n.title, count( DISTINCT nv.ip ) AS views
FROM news n
LEFT JOIN news_views nv
ON nv.news_id = n.id
WHERE nv.datetime >= SUBDATE(now(), INTERVAL 24 HOUR)
GROUP BY n.id, n.title
ORDER BY views DESC LIMIT 15
The best advice here is to run these queries through EXPLAIN (or whatever mysql's equivalent is) to see what the query will actually do - index scans, table scans, estimated costs, etc. Avoid full table scans.

Related

Correlated subquery "unknown column"

I'm getting a Unknown column 'sites.id' in 'where clause' on the following query:
SELECT id, COUNT( returning_visitors.per_ip ) as readers, AVG( returning_visitors.per_ip ) as avg_visits_pr
FROM sites
JOIN (
SELECT COUNT( * ) AS per_ip
FROM site_hits_unique
WHERE site_id = sites.id
AND date >= CURDATE( ) - INTERVAL 30 DAY
GROUP BY site_id, ip
HAVING per_ip > 1
) AS returning_visitors
WHERE id IN (162888, 42705, 11412)
I want to run the inner query for every sites.id (the example just uses a few IDs for testing purposes).
The correlated subquery is only one level deep, so I'm not quite sure why it's not getting sites.id.
Any ideas how to fix?

I found the reason why from http://dev.mysql.com/doc/refman/5.6/en/subquery-restrictions.html:
Subqueries in the FROM clause cannot be correlated subqueries. They
are materialized in whole (evaluated to produce a result set) during
query execution, so they cannot be evaluated per row of the outer
query. Before MySQL 5.6.3, materialization takes place before
evaluation of the outer query. As of 5.6.3, the optimizer delays
materialization until the result is needed, which may permit
materialization to be avoided. See Section 8.2.1.18.3, “Optimizing
Derived Tables (Subqueries) in the FROM Clause”.
Although I still need to figure out how to rewrite my query to make it work the way I want it to. Is a function necessary / feasible here?

You should rewrite your query in a way like:
SELECT id, COUNT( returning_visitors.per_ip ) as readers, AVG( returning_visitors.per_ip ) as avg_visits_pr
FROM sites
JOIN (
SELECT COUNT( * ) AS per_ip, site_id
FROM site_hits_unique
WHERE site_id IN (162888, 42705, 11412)
AND date >= CURDATE( ) - INTERVAL 30 DAY
GROUP BY site_id, ip
HAVING per_ip > 1
) AS returning_visitors
on id=returning_visitors.site_id
WHERE id IN (162888, 42705, 11412)

Mysql: Get records from last date

I want to get all records which are not "older" than 20 days. If there are no records within 20 days, I want all records from the most recent day. I'm doing this:
SELECT COUNT(DISTINCT t.id) FROM t
WHERE
(DATEDIFF(NOW(), t.created) <= 20
OR
(date(t.created) >= (SELECT max(date(created)) FROM t)));
This works so far, but it is awful slow. created is a datetime, might be due tue the conversion to a date... Any ideas how to speed this up?

SELECT COUNT(*) FROM (
SELECT * FROM t WHERE datediff(now(),created) between 0 and 20
UNION
SELECT * FROM (SELECT * FROM t WHERE created<now() LIMIT 1) last1
) last20d
I used the between clause just in case there might be dates in the future in the table. These will be excluded. Also you can simplify the select, if you just need the count() to
SELECT COUNT(*) FROM (
SELECT id FROM t WHERE datediff(now(),created) between 0 and 20
UNION
SELECT id FROM (SELECT id FROM t WHERE created<now() LIMIT 1) last1
) last20d
otherwise, in the first select version you can leave out the outer select if you want all the data of the chosen records. The UNION will make sure that duplicates will be excluded (in other cases I always use UNION ALL since it is faster).

Top-10 mysql query

I'm in need of a better way of retrieving top 10 distinct UID from some tables I have.
The setup:
Table user_view_tracker
Contains pairs of {user id (uid), timestamp (ts)}
Is growing every day (today it's 41k entries)
My goal:
To produce a top 10 of most viewed user id's in the table user_view_tracker
My current code is working, but killing the database slowly:
select
distinct uvt.uid as UID,
(select count(*) from user_view_tracker temp where temp.uid=uvt.uid and temp.ts>date_sub(now(),interval 1 month)) as CLICK
from user_view_tracker uvt
order by CLICK
limit 10
It's quite obvious that a different data structure would help. But I can't do that as of now.

First of all, delete that subquery, this should be enough ;)
select
uvt.uid as UID
,count(*) as CLICK
from
user_view_tracker uvt
where
uvt.ts > date_sub(now(),interval 1 month)
group by
uvt.uid
order by CLICK DESC
limit 10

Try:
select uid, count(*) as num_stamps
from user_view_tracker
where ts > date_sub(now(), interval 1 month)
group by uid
order by 2 desc limit 10
I kept your criteria as far as getting the count for just the past month. You can remove that line if you want to count all.
The removal of DISTINCT should improve performance. It is not necessary if you aggregate in your outer query and group by uid, as that will aggregate the data to one row per uid with the count.

You should use Aggregate functions in MySQL
SELECT UID, COUNT(ts) as Number_Of_Views FROM user_view_tracker
GROUP BY UID
ORDER BY Number_Of_Views DESC
LIMIT 10
A simple demo which selects the top 10 UID viewed
http://sqlfiddle.com/#!2/907c10/3

Need help optimizing 4 heavy queries on one webpage

I have four queries that run on one web page. I use them for statistics and they are taking too long to load.
Here are my current configurations
use the text wrapping button on pastebin to make it easier to read.
I have a lot of RAM dedicated to mysql but it still takes a long time. I have also index most of the columns.
I'm just trying to see what other options I have.
I put "show create table" and total count(*) in here. I'm going to rename everything and paste in SO. I agree that someone in the future may use it.

QUERY ONE
SELECT SQL_NO_CACHE
DATE_FORMAT(DateActioned,'%M-%Y') as val1,
COUNT(*) AS total_count
FROM
db.statisticsresults
WHERE
DID = 28
AND ActionTypeID = 1
AND DateActioned IS NOT NULL
GROUP BY
DATE_FORMAT(DateActioned, '%m-%y')
ORDER BY
YEAR( DateActioned ) DESC,
MONTH( DateActioned ) DESC
This, I would have a covering index based on your key elements so the engine does not have to go back to the raw data... Based on this and your following queries, I would have THAT column in the primary index position such as
StatisticsResults -- index ( DID, ActionTypeID, DateActioned )
The order by by respective year() descending and month() descending will do the same thing as your hard-coded references to FIND the field in the list.
QUERY TWO
-- 381.812
SELECT SQL_NO_CACHE
DATE_FORMAT(DateActioned,'%M-%Y') as val1,
COUNT(*) AS total_count
FROM
db.statisticsdivision
WHERE
DID = 28
AND ActionTypeID = 9
AND DateActioned IS NOT NULL
GROUP BY
DATE_FORMAT(DateActioned, '%m-%y')
ORDER BY
YEAR( DateActioned ) DESC,
MONTH( DateActioned ) DESC
ON this one, the DID = '28', I changed to DID = 28. If the column is numeric, don't offer confusion to the engine to try and convert one to the other. The same indexes from option 1 would apply here too.
QUERY THREE
-- 33.899
SELECT SQL_NO_CACHE DISTINCT
AID,
COUNT(*) AS acount
FROM
db.statisticsresults
JOIN db.division_id USING(AID)
WHERE
DID = 28
GROUP BY
AID
ORDER BY
count(*) DESC
LIMIT
19
This one looks like a bit of a waste... you are joining to the division table based on an "AID" column in the stats table. Why are you doing the join unless you actually are expecting some invalid "AID" values not in the division table? Again, change your "DID" column to 28 instead of '28'. Ensure your division table has its index on "AID" for the join. The SECOND index from query 1 appears to be your better option
QUERY FOUR
-- 21.403
SELECT SQL_NO_CACHE DISTINCT
TID,
tax,
agent,
COUNT(*) AS t_count
FROM
db.statisticsresults sr
JOIN db.tax_id USING(TID)
JOIN db.agent_id ai ON(ai.AID = sr.AID)
WHERE
DID = 28
GROUP BY
TID,
sr.AID
ORDER BY
COUNT(*) DESC
LIMIT 19
Again, "DID" column from '28' to 28
FOR your TAX_ID table, have a covering index on that too so it can handle the join
TO the agent table without going TO the raw page data
Tax_ID -- index ( tid, aid )
Finally, if you are dealing with your original list finding things only from Jan 2012 to Dec 2013, you can simplify querying the ENTIRE table of stats by adding to your WHERE clause...
AND DateActioned >= '2012-01-01'
So you completely skip over anything prior to 2012 (old data I presume?)

how to stop or limit select after selecting top 100 row?

I have a query like this -
SELECT c.msisdn,SUM(c.dataVolumeDownLink+c.dataVolumeUpLink) AS datasum
FROM cdr c
WHERE c.eveDate>='2013-10-29'
GROUP BY c.msisdn
ORDER BY datasum DESC;
This one taking 4 minutes. I have an index on evedate.
CDR table contains 2400000 records for each day from '2013-10-01' to '2013-10-30'. But I want to select only first 100 records. How I am suppose to optimize this query.
I have used limit clause but there is no benefit of it.
So please let me know how I can optimize this query.
Thank you.

you just put
LIMIT 100
after .... ORDER BY datasum DESC here;
like .... ORDER BY datasum DESC LIMIT 100;

If records are distributed evenly, one day would have 80k rows. GROUP BY operation over 80k might not take 4 minute (I guess)
I'm not sure you have following index:
INDEX(eveDate, msisdn)
with above index, records are sorted by eveDate and msisdn so GROUP BY operation is optimized. i.e, same msisdns are located same block. I guess following query is faster than your query.
Q1
SELECT x.msisdn, SUM(datasum)
FROM
(
SELECT c.msisdn AS msisdn,
SUM(c.dataVolumeDownLink+c.dataVolumeUpLink) AS datasum
FROM cdr c
WHERE c.eveDate>='2013-10-29'
GROUP BY eveDate, c.msisdn
) x
GROUP BY x.msisdn
ORDER BY SUM(datasum)
LIMIT 100;
or something like this.
Q2
SELECT c.msisdn SUM(c.dataVolumeDownLink+c.dataVolumeUpLink) AS datasum
FROM cdr c
WHERE c.eveDate>='2013-10-29'
GROUP BY c.msisdn
ORDER BY 100;
above query is simpler, but same msisdn can be located in another eveDate. so benefit from INDEX(eveDate, msisdn) is a little. If you disk has large free space, following INDEX makes execution only INDEX scan. no need for data. all required is in INDEX
INDEX(eveDate, msisdn, dataVolumeDownLink, dataVolumeUpLink)
UPDATED
hmm, If data is append only, and appended data is never changed. I wonder if make summary table for every day.
CREATE TABLE summary(eveDate, msisdn, datasum, INDEX(eveDate, msisdn);
and run following query every night via cronjob
INSERT INTO summary
SELECT NOW() c.msisdn,SUM(c.dataVolumeDownLink+c.dataVolumeUpLink) AS datasum
FROM cdr c
WHERE c.eveDate = NOW()
GROUP BY c.msisdn
then your query would be very simple.
SELECT msisdn, SUM(datasum) as datasum
FROM summary
WHERE eveDate BETWEEN ? AND ?

SELECT c.msisdn,SUM(c.dataVolumeDownLink+c.dataVolumeUpLink) AS datasum
FROM cdr c
WHERE c.eveDate>='2013-10-29'
GROUP BY c.msisdn
ORDER BY datasum DESC
LIMIT 0, 100;

SELECT c.msisdn,SUM(c.dataVolumeDownLink+c.dataVolumeUpLink) AS datasum
FROM
(select * from cdr where eveDate>='2013-10-29' limit 100) as c
GROUP BY c.msisdn
ORDER BY datasum DESC;
Small change from Larry's answer
Not completely sure if I understood the question correctly
This would first take the first 100 records and do the calculation on that.
So the final result may be less that 100 rows, based on the group by clause
EDIT :
As per your clarification, you will need to add an index on c.msisdn and add a limit clause at the end Remove the order by clause and put an outer query just to have the records ordered by
SELECT a.* FROM (
SELECT c.msisdn,SUM(c.dataVolumeDownLink+c.dataVolumeUpLink) AS datasum
FROM cdr c
WHERE c.eveDate>='2013-10-29'
GROUP BY c.msisdn limit 100) a
ORDER BY a.datasum DESC;
add an index on c.msisdn

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Can this MySQL subquery be optimised? - mysql

Related

Correlated subquery "unknown column"

Mysql: Get records from last date

Top-10 mysql query

Need help optimizing 4 heavy queries on one webpage

how to stop or limit select after selecting top 100 row?

Categories

Resources