Improve performance in sum across multiple rows for multiple fields? - mysql

I have MySQL sub query to select multiple rows with 15 columns names,joining two tables one has 4,00,000 record another one has 9000 records.
In this filter using unique id in both tables,in where clause using date filter as between .. and ... date column mostly having null value only.Added index for both table columns its reduced 28 sec to 17 sec ..
140 rows retrieving in this select statement..In this case sum across multiple rows for multiple fields took more time to retrieve data,.How to improve performance to this query?If any one has experience share it?
SELECT A.xs_id ,
A.unique_id ,
sum(A.xs_type) AS TYPE ,
sum(A.xs_item_type) AS item_type ,
sum(A.xs_counterno) AS counterno ,
r.Modified_date ,
sum(A.sent) AS sent ,
sum(A.sent_amt) AS amount ,
(sum(sent_amt)+sum(rec_amt)) AS total
FROM xs_data A
JOIN r_data r ON (r.unique_id=A.uniqueid
AND summ_id =1
AND modified_date IS NOT NULL
WHERE date(modified_date) BETWEEN '2012-02-12' AND '2013-01-22')
GROUP BY date(modified_date),
A.xs_id,
A.unique_id;

If you have an index on modified_date, then the BETWEEN query can use it. Once you put a column into a function DATE(modified_date), MySQL no longer uses the column's index, so it has to go through all the rows (that's called a full table scan).
It could be helpful to use
WHERE `modified_date` >= '2012-02-12 00:00:00'
AND `modified_date` < '2013-01-23 00:00:00'
On the other hand EXPLAIN SELECT will tell you more.

Find the pain....
Delete the group by, delete the columns and focus on the join. You join is about unique_id AND summ_id AND modified_date, so I need an index with those three fields. The most discriminant field first, that's probably uniqueid / unique_id.
To help you focus on the join only write the result of the join in a temp table, like:
select uniqueid
into #temp
from ...
Do that if the join spits out 100k records you don't get confronted with side effects (like for example sending all those records to you management studio).

Related

Make mysql query faster

I have this query which takes around 29 second to perform and need to make it faster. I have created index on aggregate_date column and still no real improvement. Each aggregate_date has almost 26k rows within the table.
One more thing the query will run starting from 1/1/2018 till yesterday date
select MAX(os.aggregate_date) as lastMonthDay,
os.totalYTD
from (
SELECT aggregate_date,
Sum(YTD) AS totalYTD
FROM tbl_aggregated_tables
WHERE subscription_type = 'Subcription Income'
GROUP BY aggregate_date
) as os
GROUP by MONTH(os.aggregate_date), YEAR(os.aggregate_date);
I used Explain Select ... and received the following
update
The most of query time is consumed by the inner Query, so as scaisEdge suggested bellow i have tested the query and the time is reduced to almost 8s.
the Inner Query will look like:
select agt.aggregate_date,SUM(YTD)
from tbl_aggregated_tables as agt
FORCE INDEX(idx_aggregatedate_subtype_YTD)
WHERE agt.subscription_type = 'Subcription Income'
GROUP by agt.aggregate_date
I have noticed that this comparison "WHERE agt.subscription_type = 'Subcription Income'" takes the most of time. So is there any way to change that and to be mentioned the column of subscription_type only have 2 values which is 'Subcription Income' and 'Subcription Unit'
The index on aggregate_date column is not useful for performance because in not involved in where condition
looking to your code an useful index should be on column subscription_type
you could try using a redundant index adding also the column involved in select clause (for try to obtain all the data in query from index avoiding access to table)so you index could be
create idx1 on tbl_aggregated_tables (subscription_type, aggregate_date, YTD )
the meaning of the last group by seems not coherent with the select clause

What SQL indexes to put for big table

I have two big tables from which I mostly select but complex queries with 2 joins are extremely slow.
First table is GameHistory in which I store records for every finished game (I have 15 games in separate table).
Fields: id, date_end, game_id, ..
Second table is GameHistoryParticipants in which I store records for every player participated in certain game.
Fields: player_id, history_id, is_winner
Query to get top players today is very slow (20+ seconds).
Query:
SELECT p.nickname, count(ghp.player_id) as num_games_today
FROM `GameHistory` as gh
INNER JOIN GameHistoryParticipants as ghp ON gh.id=ghp.history_id
INNER JOIN Players as p ON p.id=ghp.player_id
WHERE TIMESTAMPDIFF(DAY, gh.date_end, NOW())=0 AND gh.game_id='scrabble'
GROUP BY ghp.player_id ORDER BY count(ghp.player_id) DESC LIMIT 10
First table has 1.5 million records and the second one 3.5 million.
What indexes should I put ? (I tried some and it was all slow)
You are only interested in today's records. However, you search the whole GameHistory table with TIMESTAMPDIFF to detect those records. Even if you have an index on that column, it cannot be used, due to the fact that you use a function on the field.
You should have an index on both fields game_id and date_end. Then ask for the date_end value directly:
WHERE gh.date_end >= DATE(NOW())
AND gh.date_end < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
AND gh.game_id = 'scrabble'
It would even be better to have an index on date_end's date part rather then on the whole time carrying date_end. This is not possible in MySQL however. So consider adding another column trunc_date_end for the date part alone which you'd fill with a before-insert trigger. Then you'd have an index on trunc_date_end and game_id, which should help you find the desired records in no time.
WHERE gh.trunc_date_end = DATE(NOW())
AND gh.game_id = 'scrabble'
add 'EXPLAIN' command at the beginning of your query then run it in a database viewer(ex: sqlyog) and you will see the details about the query, look for the 'rows' column and you will see different integer values. Now, index the table columns indicated in the EXPLAIN command result that contain large rows.
-i think my explanation is kinda messy, you can ask for clarification

MySQL, return all measurements and results within X last hours

This question is very much related to my previous question: MySQL, return all results within X last hours altough with additional significant constraint:
Now i have 2 tables, one for measurements and one for classified results for part of the measurements.
measurements are constantly arrive so as result, that are constantly added after classification of new measurements.
results will not necessarily be stored in the same order of measurement's arrive and store order!
I am interested only to present the last results. By last i mean to take the max time (the time is a part of the measurement structure) of last available result call it Y and a range of X seconds , and present the measurements together with the available results in the range beteen Y and Y-X.
The following are the structure of 2 tables:
event table:
CREATE TABLE `event_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`Feature` char(256) NOT NULL,
`UnixTimeStamp` int(10) unsigned NOT NULL,
`Value` double NOT NULL,
KEY `ix_filter` (`Feature`),
KEY `ix_time` (`UnixTimeStamp`),
KEY `id_index` (`id`)
) ENGINE=MyISAM
classified results table:
CREATE TABLE `event_results` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`level` enum('NORMAL','SUSPICIOUS') DEFAULT NULL,
`score` double DEFAULT NULL,
`eventId` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `eventId_index` (`eventId`)
) ENGINE=MyISAM
I can't query for the last measurements timestamp first since i want to present measurements for which there are currently results, and since measurements arrive constantly, results may still not be available.
Therefore i thought of joining the two tables using
event_results.eventId=event_data.id and than selecting the max time of event_data.UnixTimeStamp as maxTime , after i have the maxTime, i need to do the same opearation again (joining 2 tables) and adding in a where clause a condition
WHERE event_data.UnixTimeStamp >= maxTime + INTERVAL -X SECOND
It seems to be not efficient to execute 2 joins only to achieve what i am asking, Do you have more ef
From my understanding, you are using an aggregate function, MAX. This will produce a record set of size one as a result, which is the highest time from which you will perform. Therefore, it needs to be broken out into a sub query (As you say, nested select). You HAVE to do 2 queries at some point. (Your answer to the last question has 2 queries in it, by having subqueries/nested selects).
The main time sub queries cause problems is when you perform the subquery in the select part of the query, as it performs the subquery for each time there is a row, which will make the query run exponentially slower as the resultset grows. Lets take the answer to your last question and write it in a horrible, inefficient way:
SELECT timeStart,
(SELECT max(timeStart) FROM events) AS maxTime
FROM events
WHERE timeStart > (maxTime + INTERVAL -1 SECOND)
This will perform a select query for each time there is an eventTime record, for the max eventtime. It should produce the same result, but this is slow. This is where the fear of subqueries comes from.
It also performs the aggregate function MAX on each row, which will return the same answer each time. So, you perform that sub query ONCE rather than on each row.
However, in the case of the answer of your last question, the MAX sub query part is ran once, and used to filter on the where, of which that select is ran once. So, in total, 2 queries are ran.
2 super fast queries are faster ran one after the other than 1 super slow query that is super slow.
I'm not entirely sure what resultset you want returned, so I am going to make some assumptions. Please feel free to correct any assumptions I've made.
It sounds (to me) like you want ALL rows from event_data that are within an hour (or however many seconds) of the absolute "latest" timestamp, and along with those rows, you also want to return any related rows from event_results, if any matching rows are available.
If that's the case, then using an inline view to retrieve the maximum value of timestamp is the way to go. (That operation will be very efficient, since the query will be returning a single row, and it can be efficiently retrieved from an existing index.)
Since you want all rows from a specified period of time (from the "latest time" back to "latest time minus X seconds"), we can go ahead and calculate the starting timestamp of the period in that same query. Here we assume you want to "go back" one hour (=60*60 seconds):
SELECT MAX(UnixTimeStamp) - 3600 FROM event_data
NOTE: the expression in the SELECT list above is based on UnixTimeStamp column defined as integer type, rather than as a DATETIME or TIMESTAMP datatype. If the column were defined as DATETIME or TIMESTAMP datatype, we would likely express that with something like this:
SELECT MAX(mydatetime) + INTERVAL -3600 SECONDS
(We could specify the interval units in minutes, hours, etc.)
We can use the result from that query in another query. To do that in the same query text, we simply wrap that query in parentheses, and reference it as a rowsource, as if that query were an actual table. This allows us to get all the rows from event_data that are within in the specified time period, like this:
SELECT d.id
, d.Feature
, d.UnixTimeStamp
, d.Value
JOIN ( SELECT MAX(l.UnixTimeStamp) - 3600 AS from_unixtimestamp
FROM event_data l
) m
JOIN event_data d
ON d.UnixTimetamp >= m.from_unixtimestamp
In this particular case, there's no need for an upper bound predicate on UnixTimeStamp column in the outer query. This is because we already know there are no values of UnixTimeStamp that are greater than the MAX(UnixTimeStamp), which is the upper bound of the period we are interested in.
(We could add an expression to the SELECT list of the inline view, to return MAX(l.UnixTimeStamp) AS to_unixtimestamp, and then include a predicate like AND d.UnixTimeStamp <= m.to_unixtimestamp in the outer query, but that would be unnecessarily redundant.)
You also specified a requirement to return information from the event_results table.
I believe you said that you wanted any related rows that are "available". This suggests (to me) that if no matching row is "available" from event_results, you still want to return the row from the event_data table.
We can use a LEFT JOIN operation to get that to happen:
SELECT d.id
, d.Feature
, d.UnixTimeStamp
, d.Value
, r.id
, r.level
, r.score
, r.eventId
JOIN ( SELECT MAX(l.UnixTimeStamp) - 3600 AS from_unixtimestamp
FROM event_data l
) m
JOIN event_data d
ON d.UnixTimetamp >= m.from_unixtimestamp
LEFT
JOIN event_results r
ON r.eventId = d.id
Since there is no unique constraint on the eventID column in the event_results table, there is a possibility that more than one "matching" row from event_results will be found. Whenever that happens, the row from event_data table will be repeated, once for each matching row from event_results.
If there is no matching row from event_results, then the row from event_data will still be returned, but with the columns from the event_results table set to NULL.
For performance, remove any columns from the SELECT list that you don't need returned, and be judicious in your choice of expressions in an ORDER BY clause. (The addition of a covering index may improve performance.)
For the statement as written above, MySQL is likely to use the ix_time index on the event_data table, and the eventId_index index on the event_results table.

My SQL Query is very slow with 50K+ records

I have a table called links in which I am having 50000 + records where three of the fields are having indexes in the order link_id ,org_id ,data_id.
Problem here is when I am using group by sql query it is taking more time to load.
The Query is like this
SELECT DISTINCT `links`.*
FROM `links`
WHERE `links`.`org_id` = 2 AND (link !="")
GROUP BY link
The table is having nearly 20+ columns
Is there any solution to speed up the query to access faster.
Build an index on (org_id, link)
org_id to optimize your WHERE clause, and link for your group by (and also part of where).
By having the LINK_ID in the first position is probably what is holding your query back.
create index LinksByOrgAndLink on Links ( Org_ID, Link );
MySQL Create Index syntax
the problem is in your DISTINCT.*
the GROUP BY is already doing the work of DISTINCT , so are doing distinct two times , one of SELECT DISTINCT and other for GROUP BY
try this
SELECT *
FROM `links`
WHERE org_id` = 2 AND (link !="")
GROUP BY link
I guess adding a index to your 'link' column would improve the result.
http://dev.mysql.com/doc/refman/5.0/en/create-index.html
Only select the columns that you need.
Why is there a distinct for links.*?
Do you have some rows exactly doubled in your table?
On the other hand, changing the value "" to NULL could be improve your select statement, but iam not sure about this.

Get last distinct set of records

I have a database table containing the following columns:
id code value datetime timestamp
In this table the only unique values reside in id i.e. primary key.
I want to retrieve the last distinct set of records in this table based on the datetime value. For example, let's say below is my table
id code value datetime timestamp
1 1023 23.56 2011-04-05 14:54:52 1234223421
2 1024 23.56 2011-04-05 14:55:52 1234223423
3 1025 23.56 2011-04-05 14:56:52 1234223424
4 1023 23.56 2011-04-05 14:57:52 1234223425
5 1025 23.56 2011-04-05 14:58:52 1234223426
6 1025 23.56 2011-04-05 14:59:52 1234223427
7 1024 23.56 2011-04-05 15:00:12 1234223428
8 1026 23.56 2011-04-05 15:01:14 1234223429
9 1025 23.56 2011-04-05 15:02:22 1234223430
I want to retrieve the records with IDs 4, 7, 8, and 9 i.e. the last set of records with distinct codes (based on datetime value). What I have highlighted is simply an example of what I'm trying to achieve, as this table is going to eventually contain millions of records, and hundreds of individual code values.
What SQL statement can I use to achieve this? I can't seem to get it done with a single SQL statement. My database is MySQL 5.
This should work for you.
SELECT *
FROM [tableName]
WHERE id IN (SELECT MAX(id) FROM [tableName] GROUP BY code)
If id is AUTO_INCREMENT, there's no need to worry about the datetime which is far more expensive to compute, as the most recent datetime will also have the highest id.
Update: From a performance standpoint, make sure the id and code columns are indexed when dealing with a large number of records. If id is the primary key, this is built in, but you may need to add a non-clustered index covering code and id.
Try this:
SELECT *
FROM <YOUR_TABLE>
WHERE (code, datetime, timestamp) IN
(
SELECT code, MAX(datetime), MAX(timestamp)
FROM <YOUR_TABLE>
GROUP BY code
)
It's and old post, but testing #smdrager answer with large tables was very slow. My fix to this was using "inner join" instead of "where in".
SELECT *
FROM [tableName] as t1
INNER JOIN (SELECT MAX(id) as id FROM [tableName] GROUP BY code) as t2
ON t1.id = t2.id
This worked really fast.
I'll try something like this :
select * from table
where id in (
select id
from table
group by code
having datetime = max(datetime)
)
(disclaimer: this is not tested)
If the row with the bigger datetime also have the bigger id, the solution proposed by smdrager is quicker.
Looks like all existing answers suggest to do GROUP BY code on the whole table. When it's logically correct, in reality this query will go through the whole(!) table (use EXPLAIN to make sure). In my case, I have less than 500k of rows in the table and executing ...GROUP BY codetakes 0.3 seconds which is absolutely not acceptable.
However I can use knowledge of my data here (read as "show last comments for posts"):
I need to select just top-20 records
Amount of records with same code across last X records is relatively small (~uniform distribution of comments across posts, there are no "viral" post which got all the recent comments)
Total amount of records >> amount of available code's >> amount of "top" records you want to get
By experimenting with numbers I found out that I can always find 20 different code if I select just last 50 records. And in this case following query works (keeping in mind #smdrager comment about high probability to use id instead of datetime)
SELECT id, code
FROM tablename
ORDER BY id DESC
LIMIT 50
Selecting just last 50 entries is super quick, because it doesn't need to check the whole table. And the rest is to select top-20 with distinct code out of those 50 entries.
Obviously, queries on the set of 50 (100, 500) elements are significantly faster than on the whole table with hundreds of thousands entries.
Raw SQL "Postprocessing"
SELECT MAX(id) as id, code FROM
(SELECT id, code
FROM tablename
ORDER BY id DESC
LIMIT 50) AS nested
GROUP BY code
ORDER BY id DESC
LIMIT 20
This will give you list of id's really quick and if you want to perform additional JOINs, put this query as yet another nested query and perform all joins on it.
Backend-side "Postprocessing"
And after that you need to process the data in your programming language to include to the final set only the records with distinct code.
Some kind of Python pseudocode:
records = select_simple_top_records(50)
added_codes = set()
top_records = []
for record in records:
# If record for this code was already found before
# Note: this is not optimal, better to use structure allowing O(1) search and insert
if record['code'] in added_codes:
continue
# Save record
top_records.append(record)
added_codes.add(record['code'])
# If we found all top-20 required, finish
if len(top_records) >= 20:
break