I have a simple query across a 20 million record table, and I need a index that will improve the select statement for the following query:
SELECT count(item_id), count(distinct user_id)
FROM activity
INNER JOIN item on item.item_id = activity.item_id
WHERE item.item_id = 3839 and activity.created_at >= DATE_SUB(NOW(), INTERVAL 30 DAY)
I have a index on:
activity - activity_id (PRIMARY), item_id, created_at - All Single Index
item - item_id (PRIMARY)
With items that have a lot of content (like 600k), it takes like 4-5 seconds to run the query.
Any advice?
If there is a FOREIGN KEY constraint from activity to item (and assuming that user_id is in table activity), then your query is equivalent to:
SELECT COUNT(*), COUNT(DISTINCT user_id)
FROM activity
WHERE item_id = 3839
AND created_at >= DATE_SUB(NOW(), INTERVAL 30 DAY) ;
which should be more efficient as it has to get data from only one table or just one index. An index on (item_id, created_at, user_id) would be useful.
Related
I'm trying to optimize this query it returns multiple rows from the building_rent_prices and the building_weather and then groups them and calculates the average of their field. So far the tables are all under a million rows yet it takes several seconds, does anyone know how i could optimize this from composite indexes or rewriting the query? I'm assuming it should be able to be a 100ms or quicker query but so far it seems like it cant
SELECT b.*
, AVG(r.rent)
, AVG(w.high_temp)
FROM buildings b
LEFT
JOIN building_rent_prices r
ON r.building_id = b.building_id
LEFT
JOIN building_weather w
ON w.building_id = b.building_id
WHERE w.date BETWEEN CURDATE() AND CURDATE + INTERVAL 4 DAY
AND r.date BETWEEN CURDATE() AND CURDATE + INTERVAL 10 day
GROUP
BY b.building_id
ORDER
BY AVG(r.rent) / b.square_feet DESC
LIMIT 10;
Explain said the following:
1 SIMPLE building_rent_prices range
1 SIMPLE buildings eq_ref
1 SIMPLE building_weather ref
Using where; Using index; Using temporary; Using filesort
Using where
Using where; Using index
Im working on some test data heres the create table
CREATE TABLE building(
building_id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(255),
square_feet INT
);
CREATE TABLE building_weather(
building_weather_id INT PRIMARY KEY AUTO_INCREMENT,
building_id INT,
weather_date DATE,
high_temp INT
);
CREATE TABLE building_rates(
building_rate_id INT PRIMARY KEY AUTO_INCREMENT,
building_id INT,
weather_date DATE,
rate double
);
ALTER TABLE building_rates INDEX(building_id);
ALTER TABLE buildings INDEX(building_id);
ALTER TABLE building_weather INDEX(building_id);
This seems to working in under 1 second based on DRapp's answer without indexes(I still need to test that its valid)
select
B.*,
BRP.avgRent,
BW.avgTemp
from
( select building_id,
AVG( rent ) avgRent
from
building_rent_prices
where
date BETWEEN CURDATE() AND CURDATE() + 10
group by
building_id
order by
building_id ) BRP
JOIN buildings B
on BRP.building_id = B.building_id
left join ( select building_id,
AVG( hi_temp ) avgTemp
from building_weather
where date BETWEEN CURDATE() AND CURDATE() + 10
group by building_id) BW
on BRP.building_id = BW.building_id
GROUP BY BRP.building_id
ORDER BY BRP.avgRent / 1 DESC
LIMIT 10;
Let's take a look at this query in detail. You want to report two different kinds of averages for each building. You need to compute those in separate subqueries. If you don't you'll get the cartesian combinatorial explosion.
One is an average of eleven days' worth of rent prices. You get that data with this subquery:
SELECT building_id, AVG(rent) rent
FROM building_rent_prices
WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
GROUP BY building_id
This subquery can be optimized by a compound covering index on building_rent_prices, consisting of (date, building_id, rent).
The next is an average of five days' worth of temperature.
SELECT building_id, AVG(high_temp) high_temp
FROM building_weather
WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 4 DAY
GROUP BY building_id
This can be optimized by a compound covering index on building_weather, consisting of (date, building_id, high_temp).
Finally, you need to join these two subqueries to your buildings table to generate the final result set.
SELECT buildings.*, a.rent, b.high_temp
FROM buildings
LEFT JOIN (
SELECT building_id, AVG(rent) rent
FROM building_rent_prices
WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
GROUP BY building_id
) AS a ON buildings.building_id = a.building_id
LEFT JOIN (
SELECT building_id, AVG(high_temp) high_temp
FROM building_weather
WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 4 DAY
GROUP BY building_id
) AS b ON buildings.building_id = b.building_id
ORDER BY a.rent / buildings.square_feet DESC
LIMIT 10
Once the two subqueries are optimized, this one doesn't need anything except the building_id primary key.
In summary, to speed up this query, create the two compound indexes mentioned on the building_rent_prices and building_weather queries.
Don't use CURDATE + 4:
mysql> select CURDATE(), CURDATE() + 30, CURDATE() + INTERVAL 30 DAY;
+------------+----------------+-----------------------------+
| CURDATE() | CURDATE() + 30 | CURDATE() + INTERVAL 30 DAY |
+------------+----------------+-----------------------------+
| 2015-03-15 | 20150345 | 2015-04-14 |
+------------+----------------+-----------------------------+
Add INDEX(building_id) to the second and third tables.
If those don't fix it; come back with a revised query and schema, and I will look deeper.
First, your query to the WEATHER based table is only for 4 days, the RENT PRICES table is for 10 days. Since you don't have any join correlation between the two, you will result in a Cartesian result of 40 records per one building ID. Was that intentional or just not identified as an oops...
Second, I would adjust the query as I have below, but also, I have adjusted BOTH WEATHER and RENT PRICES tables to reflect the same date range period. I start with an sub query of just the prices and group by building and date, then join to buildings, then another sub query to weather grouped by building and date. But here, I join from the rent prices sub query to the weather sub query on both building ID AND date so it will at most retain a 1:1 ratio. I don't know why weather is even a consideration spanning date ranges.
However to help with indexes, I would suggest the following
Table Index on
buildings (Building_ID) <-- probably already exists as a PK
building_rent_prices (date, building_id, rent)
building_weather (date, building_id, hi_temp)
The purpose of the index is to take advantage of the WHERE clause (date first), THEN the GROUP BY ( building ID), and is a COVERING INDEX (includes the rent). Similarly for the building weather table for same reasons.
select
B.*,
BRP.avgRent,
BW.avgTemp
from
( select building_id,
AVG( rent ) avgRent
from
building_rent_prices
where
date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
group by
building_id
order by
building_id ) BRP
JOIN buildings B
on BRP.building_id = B.building_id
left join ( select building_id,
AVG( hi_temp ) avgTemp
from
building_weather
where
date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
group by
building_id ) BW
on BRP.building_id = BW.building_id
GROUP BY
BRP.building_id
ORDER BY
BRP.avgRent / B.square_feet DESC
LIMIT 10;
CLARIFICATION...
I cant guarantee the execution order, but in essence, the two ( queries ) for BPR and BW aliases, they would be done and executed quickly before any join took place. If you wanted the average across the (in my example) 10 days vs a per-day join, then I have removed the "date" as a component of the group, so each will return respectively at most, 1 per building.
Now, joining to the building table on just the 1:1:1 ratio will limit the records in the final result set. This should take care of your concern of the average over those days in question.
For anyone who has issues similar to mine the solution is to GROUP each table you would like to join using building_id that way you are joining one to one with every average. Ollie Jones query with JOIN rather than LEFT JOIN is the closest answer if you do not want results that don't have data in all tables. Also The main issue I had was that I forgot to place an index on a avg(low_temp) column so the INDEXES. What I learned from this is that if you do an aggregated function in your select it belongs in your indexes. I added low_temp to it.
building_weather (date, building_id, hi_temp, low_temp) AS suggested by Ollie and DR APP
ALTER TABLE building_weather ADD index(date, building_id, hi_temp, low_temp);
SELECT buildings.*, a.rent, b.high_temp, b.low_temp
FROM buildings
JOIN (
SELECT building_id, AVG(rent) rent
FROM building_rent_prices
WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 10 DAY
GROUP BY building_id
) AS a ON buildings.building_id = a.building_id
JOIN (
SELECT building_id, AVG(high_temp) high_temp, AVG(low_temp) low_temp
FROM building_weather
WHERE date BETWEEN CURDATE() AND CURDATE() + INTERVAL 4 DAY
GROUP BY building_id
) AS b ON buildings.building_id = b.building_id
ORDER BY a.rent / buildings.square_feet DESC
LIMIT 10
I have a table that has a unique key each time a user creates a case:
id|doctor_id|created_dt
--|---------|-----------
1|23 |datetimestamp
2|23 |datetimestamp
3|17 |datetimestamp
How can I select and return the average amount of entries a user has per month?
I have tried this:
SELECT avg (id)
FROM `cases`
WHERE created_dt BETWEEN DATE_SUB(CURDATE(),INTERVAL 90 DAY) AND CURDATE()
and doctor_id = 17
But this returns a ridiculously large value that cannot be true.
To clarify: I am trying to get something like doctor id 17 has an average of 2 entries per month into this table.
I think you were thrown off by the idea of "averaging". You don't want the average id, or average user_id. You want the average number of entries into the table, so you would use COUNT():
SELECT count(id)/3 AS AverageMonthlyCases
FROM `cases`
WHERE created_dt BETWEEN DATE_SUB(CURDATE(),INTERVAL 90 DAY) AND CURDATE()
group by doctor_id
Since you have a 90 day interval, you want to count the number of rows per 30 days, or the count/3.
SELECT AVG(cnt), user_id
FROM (
SELECT COUNT(id) cnt, user_id
FROM cases
WHERE created_dt BETWEEN <yourDateInterval>
GROUP BY user_id, year(created_dt), month(created_dt)
)
Since you need average number of entries, AVG function is not really applicable, because it is SUM()/COUNT() and obviously you do not need that (why would you need SUM of ids).
You need something like this
SELECT
doctor_id,
DATE(created_dt,'%m-%Y') AS month,
COUNT(id) AS visits
FROM `cases`
GROUP BY
`doctor_id`,
DATE(created_dt,'%m-%Y')
ORDER BY
`doctor_id` ASC,
DATE(created_dt,'%m-%Y') ASC
To get visits per month per doctor. If you want to average it, you can then use something like
SELECT
doctor_id,
SUM(visits)/COUNT(month) AS `average`
FROM (
SELECT
doctor_id,
DATE(created_dt,'%m-%Y') AS month,
COUNT(id) AS visits
FROM `cases`
GROUP BY
`doctor_id`,
DATE(created_dt,'%m-%Y')
ORDER BY
`doctor_id` ASC,
DATE(created_dt,'%m-%Y') ASC
) t1
GROUP BY
doctor_id
Obviously you can add your WHERE clauses, as this query is compatible for multiple years (i.e. it will not count January of 2013th and January of 2014th as one month).
Also, it takes into account if a doctor has "blank" months, where he did not have any patients, so it will not count those months (0 can destroy and average).
Use this, you'll group each doctor's total id, by month.
Select monthname(created_dt), doctor_id, count(id) as total from cases group by 1,2 order by 1
Also you can use GROUP_CONCAT() as nested query in order to deploy a pivot like table, where each column is each doctor_id.
I'm new to MySQL, and I'm running this query,
SELECT item_id,amount FROM db.invoice_line WHERE item_id = 'xxx'
OR item_id = 'yyy'
...
AND invoice_id IN
(SELECT id_invoices FROM db.invoices
WHERE customer = 'zzzz'
AND transaction_date > DATE_SUB(NOW(), INTERVAL 6 MONTH)
AND sales_rep = 'aaa') ORDER BY item_id;
That is, select some columns from a table where a foreign key is found in another table.
The issue is that I would like to also have, in the results, the customer name. However, the customer name is not found in the invoice line table, it is found in the invoice table.
While I could naively create a duplicate index upon table creation and inserts, I was wondering if there was a SQL way to select the proper row from the invoice table and have it in the result sets.
Is the performance better if I just duplicate data?
Thanks,
Dane
How about something like this?
SELECT
invoice_line.item_id,
invoice_line.amount,
invoices.customer_name
FROM db.invoice_line
INNER JOIN db.invoices
ON invoice_line.invoice_id = invoices.id_invoices
WHERE invoices.customer = 'zzzz'
AND invoices.transaction_date > DATE_SUB(CURRENT_DATE, INTERVAL 6 MONTH)
AND invoices.sales_rep = 'aaa'
AND (invoice_line.item_id = 'xxx' OR invoice_line.item_id = 'yyy')
ORDER BY invoice_line.item_id;
Use join between table to achieve your result.
All I want to count entries based on date.(i.e entries with same date.)
My table is
You can see 5th and 6th entry have same date.
Now, the real problem as i think is the same date entry have different time so i am not getting what I want.
I am using this sql
SELECT COUNT( created_at ) AS entries, created_at
FROM wp_frm_items
WHERE user_id =1
GROUP BY created_at
LIMIT 0 , 30
What I am getting is this.
I want entries as 2 for date 2012-02-22
The reason you get what you get is because you also compare the time, down to a second apart. So any entries created the same second will be grouped together.
To achieve what you actually want, you need to apply a date function to the created_at column:
SELECT COUNT(1) AS entries, DATE(created_at) as date
FROM wp_frm_items
WHERE user_id =1
GROUP BY DATE(created_at)
LIMIT 0 , 30
This would remove the time part from the column field, and so group together any entries created on the same day. You could take this further by removing the day part to group entries created on the same month of the same year etc.
To restrict the query to entries created in the current month, you add a WHERE-clause to the query to only select entries that satisfy that condition. Here's an example:
SELECT COUNT(1) AS entries, DATE(created_at) as date
FROM wp_frm_items
WHERE user_id = 1
AND created_at >= DATE_FORMAT(CURDATE(),'%Y-%m-01')
GROUP BY DATE(created_at)
Note: The COUNT(1)-part of the query simply means Count each row, and you could just as well have written COUNT(*), COUNT(id) or any other field. Historically, the most efficient approach was to count the primary key, since that is always available in whatever index the query engine could utilize. COUNT(*) used to have to leave the index and retrieve the corresponding row in the table, which was sometimes inefficient. In more modern query planners this is probably no longer the case. COUNT(1) is another variant of this that didn't force the query planner to retrieve the rows from the table.
Edit: The query to group by month can be created in a number of different ways. Here is an example:
SELECT COUNT(1) AS entries, DATE_FORMAT(created_at,'%Y-%c') as month
FROM wp_frm_items
WHERE user_id =1
GROUP BY DATE_FORMAT(created_at,'%Y-%c')
You must eliminate the time with GROUP BY
SELECT COUNT(*) AS entries, created_at
FROM wp_frm_items
WHERE user_id =1
GROUP BY DATE(created_at)
LIMIT 0 , 30
Oops, misread it.
Use GROUP BY DATE(created_at)
Try:
SELECT COUNT( created_at ) AS entries, created_at
FROM wp_frm_items
WHERE user_id =1
GROUP BY DATE(created_at)
LIMIT 0 , 30
I have two tables, news and news_views. Every time an article is viewed, the news id, IP address and date is recorded in news_views.
I'm using a query with a subquery to fetch the most viewed titles from news, by getting the total count of views in the last 24 hours for each one.
It works fine except that it takes between 5-10 seconds to run, presumably because there's hundreds of thousands of rows in news_views and it has to go through the entire table before it can finish. The query is as follows, is there any way at all it can be improved?
SELECT n.title
, nv.views
FROM news n
LEFT
JOIN (
SELECT news_id
, count( DISTINCT ip ) AS views
FROM news_views
WHERE datetime >= SUBDATE(now(), INTERVAL 24 HOUR)
GROUP
BY news_id
) AS nv
ON nv.news_id = n.id
ORDER
BY views DESC
LIMIT 15
I don't think you need to calculate the count of views as a derived table:
SELECT n.id, n.title, count( DISTINCT nv.ip ) AS views
FROM news n
LEFT JOIN news_views nv
ON nv.news_id = n.id
WHERE nv.datetime >= SUBDATE(now(), INTERVAL 24 HOUR)
GROUP BY n.id, n.title
ORDER BY views DESC LIMIT 15
The best advice here is to run these queries through EXPLAIN (or whatever mysql's equivalent is) to see what the query will actually do - index scans, table scans, estimated costs, etc. Avoid full table scans.