MYSQL Value Difference optimization - mysql

Hallo guys,
I'm running a very large database (ATM like >5 Million datasets). My database stores custom generated numbers (which and how they compose doesn't really matters here) and the corresponding date to this one. In addition there is a ID stored for every product (means one product can have multiple entries for different dates in my database -> primary key is divided). Now I want to SELECT those top 10 ID's which got the largest difference in theire numbers in the last two days. Currently I tried to achieve this using JOINS but since I got that much datasets this way is far to slow. How could I speed up the whole operation?
SELECT
d1.place,d2.place,d1.ID
FROM
daily
INNER JOIN
daily AS d1 ON d1.date = CURDATE()
INNER JOIN
daily as d2 ON d2.date = DATE_ADD(CURDATE(), INTERVAL -1 DAY)
ORDER BY
d2.code-d1.code LIMIT 10
EDIT: Thats how my structure looks like
CREATE TABLE IF NOT EXISTS `daily` (
`ID` bigint(40) NOT NULL,
`source` char(20) NOT NULL,
`date` date NOT NULL,
`code` int(11) NOT NULL,
`cc` char(2) NOT NULL,
PRIMARY KEY (`ID`,`source`,`date`,`cc`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Thats the output of the Explain Statement
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE d1 ALL PRIMARY NULL NULL NULL 5150350 Using where; Using temporary; Using filesort
1 SIMPLE d2 ref PRIMARY PRIMARY 8 mytable.d1.ID 52 Using where

How about this?
SELECT
d1.ID, d1.place, d2.place
FROM
daily AS d1
CROSS JOIN
daily AS d2
USING (ID)
WHERE
d1.date = CURDATE()
AND d2.date = CURDATE() - INTERVAL 1 DAY
ORDER BY
d2.code - d1.code DESC
LIMIT
10
Some thoughts about your table structure.
`ID` bigint(40) NOT NULL,
Why BIGINT? You would need to be doing 136 inserts/s 24h a day, 7 days a week for a year to exhaust the range of INT. And before you get halfway there, your application will probably need a professional DBA anyway.
Remember, Smaller primary index leads to fater lookups - which brings us to:
PRIMARY KEY (`ID`,`source`,`date`,`cc`)
Why? A single column PK on ID column should be enough. If you need indexes on other columns, create additional indexes (and to it wisely). As it is, you basically have a covering index for entire table... which is like having entire table in the index.
Last but not least: where is place column? You've used it in your query (and then I in mine), but it's nowhere to be seen?
Proposed table structure:
CREATE TABLE IF NOT EXISTS `daily` (
`ID` int(10) UNSIGNED NOT NULL, --usually AUTO_INCREMENT is used as well,
`source` char(20) NOT NULL,
`date` date NOT NULL,
`code` int(11) NOT NULL,
`cc` char(2) NOT NULL,
PRIMARY KEY (`ID`),
KEY `ID_date` (`ID`,`date`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Related

Very slow query on mysql table with 35 million rows

I am trying to figure out why a query is so slow on my MySQL database. I've read various content about MySQL performance, various SO questions, but this stays a riddle for me.
I am using MySQL 5.6.23-log - MySQL Community Server (GPL)
I have a table with roughly 35 million rows.
This table is being inserted to about 5 times / second
The table looks like this:
I have indexes on all the columns except for answer_text
The query I'm running is:
SELECT answer_id, COUNT(1)
FROM answers_onsite a
WHERE a.screen_id=384
AND a.timestamp BETWEEN 1462670000000 AND 1463374800000
GROUP BY a.answer_id
this query takes roughly 20-30 seconds, then gives a result set:
Any insights?
EDIT
as asked, my show create table:
CREATE TABLE 'answers_onsite' (
'id' bigint(20) unsigned NOT NULL AUTO_INCREMENT,
'device_id' bigint(20) unsigned NOT NULL,
'survey_id' bigint(20) unsigned NOT NULL,
'answer_set_group' varchar(255) NOT NULL,
'timestamp' bigint(20) unsigned NOT NULL,
'screen_id' bigint(20) unsigned NOT NULL,
'answer_id' bigint(20) unsigned NOT NULL DEFAULT '0',
'answer_text' text,
PRIMARY KEY ('id'),
KEY 'device_id' ('device_id'),
KEY 'survey_id' ('survey_id'),
KEY 'answer_set_group' ('answer_set_group'),
KEY 'timestamp' ('timestamp'),
KEY 'screen_id' ('screen_id'),
KEY 'answer_id' ('answer_id')
) ENGINE=InnoDB AUTO_INCREMENT=35716605 DEFAULT CHARSET=utf8
ALTER TABLE answers_onsite ADD key complex_index (screen_id,`timestamp`,answer_id);
you can use mysql Partitioning like this :
alter table answers_onsite drop primary key;
alter table answers_onsite add primary key (id, timestamp) partition by HASH(id) partitions 500;
Running the above may take a while depending on the size of your table.
Look at your WHERE clause:
WHERE a.screen_id=384
AND a.timestamp BETWEEN 1462670000000 AND 1463374800000
GROUP BY a.answer_id
I would create a composite index (screen_id, answer_id, timestamp) and run some tests.
You could also try (screen_id, timestamp, answer_id) to see if it performs better.
The BETWEEN clause is known to be slow though, as any range query. So is COUNT on million of rows. I would count once a day and save the result to a 'Stats' table which you can query when you need...obviously if you do not need live data.

MySQL: SUM/MAX/MIN GROUP BY query optimize

I have a table of bitcoin transactions:
CREATE TABLE `transactions` (
`trans_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`trans_exchange` int(10) unsigned DEFAULT NULL,
`trans_currency_base` int(10) unsigned DEFAULT NULL,
`trans_currency_counter` int(10) unsigned DEFAULT NULL,
`trans_tid` varchar(20) DEFAULT NULL,
`trans_type` tinyint(4) DEFAULT NULL,
`trans_price` decimal(15,4) DEFAULT NULL,
`trans_amount` decimal(15,8) DEFAULT NULL,
`trans_datetime` datetime DEFAULT NULL,
`trans_sid` bigint(20) DEFAULT NULL,
`trans_timestamp` int(10) unsigned DEFAULT NULL,
PRIMARY KEY (`trans_id`),
KEY `trans_tid` (`trans_tid`),
KEY `trans_datetime` (`trans_datetime`),
KEY `trans_timestmp` (`trans_timestamp`),
KEY `trans_price` (`trans_price`),
KEY `trans_amount` (`trans_amount`)
) ENGINE=MyISAM AUTO_INCREMENT=6162559 DEFAULT CHARSET=utf8;
As you can see from the AUTO_INCREMENT value, the table has over 6 million entries. There will eventually be many more.
I would like to query the table to obtain max price, min price, volume and total amount traded during arbitrary time intervals. To accomplish this, I'm using a query like this:
SELECT
DATE_FORMAT( MIN(transactions.trans_datetime),
'%Y/%m/%d %H:%i:00'
) AS trans_datetime,
SUM(transactions.trans_amount) as trans_volume,
MAX(transactions.trans_price) as trans_max_price,
MIN(transactions.trans_price) as trans_min_price,
COUNT(transactions.trans_id) AS trans_count
FROM
transactions
WHERE
transactions.trans_datetime BETWEEN '2014-09-14 00:00:00' AND '2015-09-13 23:59:00'
GROUP BY
transactions.trans_timestamp DIV 86400
That should select transactions made over a year period, grouped by day (86,400 seconds).
The idea is the timestamp field, which contains the same value as datetime, but as a timestamp...I found this faster than UNIX_TIMESTAMP(trans_datetime), is divided by the amount of seconds I want to be in the time intervals.
The problem: the query is slow. I'm getting 4+ seconds processing time. Here is the result of EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE transactions ALL trans_datetime,trans_timestmp NULL NULL NULL 6162558 Using where; Using temporary; Using filesort
The question: is it possible to optimize this better? Is this structure or approach flawed? I have tried several approaches, and have only succeeded in making modest millisecond-type gains.
Most of the data in the table is for the last 12 months? So you need to touch most of the table? Then there is no way to speed that query up. However, you can get the same output orders of magnitude faster...
Create a summary table. It would have a DATE as the PRIMARY KEY, and the columns would be effectively the fields mentioned in your SELECT.
Once you have initially populated the summary table, then maintain it by adding a new row each night for the day's transactions. More in my blog.
Then the query to get the desired output would hit this Summary Table (with only a few hundred rows), not the table with millions or rows.

Real-time aggregation on a table with millions of records

I'm dealing with an ever growing table which contains about 5 million records at the moment. About a 100000 new records are added daily.
The table contains information about ad campaigns, and is joined on query with another table:
CREATE TABLE `statistics` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`ip_range_id` int(11) DEFAULT NULL,
`campaign_id` int(11) DEFAULT NULL,
`payout` decimal(5,2) DEFAULT NULL,
`is_converted` tinyint(1) unsigned NOT NULL DEFAULT '0',
`converted` datetime DEFAULT NULL,
`created` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `created` (`created`),
KEY `converted` (`converted`),
KEY `campaign_id` (`campaign_id`),
KEY `ip_range_id` (`ip_range_id`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The other table contains IP ranges:
CREATE TABLE `ip_ranges` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`ip_range` varchar(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `ip_range` (`ip_range`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The aggregation query is as follows:
SELECT
SUM(`payout`) AS `revenue`,
(SELECT COUNT(*) FROM `statistics` WHERE `ip_range_id` = `IpRange`.`id`) AS `clicks`,
(SELECT COUNT(*) FROM `statistics` WHERE `ip_range_id` = `IpRange`.`id` AND `is_converted` = 1) AS `conversions`
FROM `ip_ranges` AS `IpRange`
INNER JOIN `statistics` AS `Statistic` ON `IpRange`.`id` = `Statistic`.`ip_range_id`
GROUP BY `IpRange`.`id`
ORDER BY `clicks` DESC
LIMIT 20
The query takes about 20 seconds to complete.
This is what EXPLAIN returns:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY ip_range index PRIMARY PRIMARY 4 NULL 306552 Using index; Using temporary; Using filesort
1 PRIMARY statistic ref ip_range_id ip_range_id 5 db.ip_range.id 8 Using where
3 DEPENDENT SUBQUERY statistics ref ip_range_id ip_range_id 5 func 8 Using where
2 DEPENDENT SUBQUERY statistics ref ip_range_id ip_range_id 5 func 8 Using where; Using index
Caching the clicks and conversions in the ip_ranges table as extra columns is not an option, because I need to be able to also filter on the campaign_id column (and possibly other columns in the future). So these aggregations need to be somewhat real-time.
What is the best strategy to do aggregation on large tables on multiple dimensions and near real-time?
Note that I'm not necessarily looking to just make the query better, but I'm also interested in strategies that might involve other database systems (NoSQL) and/or distributing the data over different servers, etc
Your query looks overly complicated. There is no need to query the same table again and again:
select
sum(payout) as revenue,
count(*) as clicks,
sum(s.is_converted = 1) as conversions
from ip_ranges r
inner join statistics s on r.id = s.ip_range_id
group by r.id
order by clicks desc
limit 20;
EDIT (after acceptance): As to your actual question on how to deal with a task like this:
You want to look at all the data in your table and you want your result to be up-to-date. Then there is no other option than to read all data (full table scans). If the tables are wide (i.e. have many columns) you may want to create covering indexes (i.e. indexes that contain all columns involved), so instead of reading the table, the index would be read. Well, what else? On full table scans it is recommendable to use parallel access, which MySQL doesn't provide, as far as I know. So you might want to switch to another DBMS. Then see what else the DBMS offers. Maybe the parallel querying would benefit from partitioning the tables. The last thing that comes to mind is hardware, i.e. more CPUs, faster drives etc.
Another option might be to remove old data from your tables. Say you need the details of the current year, but only the aggregated data for previous years. Then have another table old_statistics holding only the sums and counts needed, e.g.
table old_statistics
(
ip_range_id,
revenue,
conversions
);
Then you'd aggregate the data from statistics, which would be much smaller then, because it would hold only data of the current year, and add old_statistics to get the results.
Try this
SELECT
SUM(`payout`) AS `revenue`,
SUM(case when `ip_range_id` = `IpRange`.`id` then 1 else 0 end) AS `clicks`,
SUM(case when `ip_range_id` = `IpRange`.`id` and `is_converted` = 1 then 1 else 0 end)
AS `conversions`
FROM `ip_ranges` AS `IpRange`
INNER JOIN `statistics` AS `Statistic` ON `IpRange`.`id` = `Statistic`.`ip_range_id`
GROUP BY `IpRange`.`id`
ORDER BY `clicks` DESC
LIMIT 20

MYSQL Simple Moving Average Calculation

The following MySql update state seems to take an excessive amount of time to execute for the recordset provided (~5000 records). The update statement below takes on average 12 seconds to execute. I currently plan to run this calculation for 5 different periods and about 500 different stock symbols. This translates into 12secs * 5 calculations * 500 symbols = 30,000 seconds or 8..33 hrs.
Update Statement:
UPDATE tblStockDataMovingAverages_AAPL JOIN
(SELECT t1.Sequence,
(
SELECT AVG(t2.Close)
FROM tblStockDataMovingAverages_AAPL AS t2
WHERE (t1.Sequence - t2.Sequence)BETWEEN 0 AND 7
)AS "8SMA"
FROM tblStockDataMovingAverages_AAPL AS t1
ORDER BY t1.Sequence) AS ma_query
ON tblStockDataMovingAverages_AAPL.Sequence = ma_query.Sequence
SET tblStockDataMovingAverages_AAPL.8MA_Price = ma_query.8SMA
Table Design:
CREATE TABLE `tblStockDataMovingAverages_AAPL` (
`Symbol` char(6) NOT NULL DEFAULT '',
`TradeDate` date NOT NULL DEFAULT '0000-00-00',
`Sequence` int(11) DEFAULT NULL,
`Close` decimal(18,5) DEFAULT NULL,
`200MA_Price` decimal(18,5) DEFAULT NULL,
`100MA_Price` decimal(18,5) DEFAULT NULL,
`50MA_Price` decimal(18,5) DEFAULT NULL,
`20MA_Price` decimal(18,5) DEFAULT NULL,
`8MA_Price` decimal(18,5) DEFAULT NULL,
`50_200_Cross` int(5) DEFAULT NULL,
PRIMARY KEY (`Symbol`,`Sequence`),
KEY `idxSequnce` (`Sequence`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1$$
Any help on sppeding up the process would be greatly appreciated.
Output of Select Explain:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY t1 index NULL idxSymbol_Sequnce 11 NULL 5205 Using index; Using filesort
2 DEPENDENT SUBQUERY t2 ALL NULL NULL NULL NULL 5271 Using where
This should be a little better:
update tblStockDataMovingAverages_AAPL
join (
select t1.sequence as sequence, avg(t2.close) as av
from tblStockDataMovingAverages_AAPL t1
join tblStockDataMovingAverages_AAPL t2
on t2.sequence BETWEEN t1.sequence-7 AND t1.sequence
group by t1.sequence
) t1 on tblStockDataMovingAverages_AAPL.sequence = t1.sequence
set 8MA_Price = t1.av
With regard to my BETWEEN statement: field1 OPERATOR expression(field2) is easier to optimise than expression(field1, field2) OPERATOR expression in the ON condition. I think this holds for BETWEEN.
It looks like the ORDER BY in your query is unnecessary and removing it might speed your query up a ton.
If any of the stock symbols appear in the same table, stick all these into a single update query (different periods won't work though), this would likely be way faster than running it for each.
As already suggested, adding an index to Close may help.
you can optimize it slightly by adding index to Close field. AVG function have to be more effective. Please share dump of your dataset to see it more close.

mysql where + group by very slow

one question that I should be able to answer myself but I don't and I also don't find any answer in google:
I have a table that contains 5 million rows with this structure:
CREATE TABLE IF NOT EXISTS `files_history2` (
`FILES_ID` int(10) unsigned DEFAULT NULL,
`DATE_FROM` date DEFAULT NULL,
`DATE_TO` date DEFAULT NULL,
`CAMPAIGN_ID` int(10) unsigned DEFAULT NULL,
`CAMPAIGN_STATUS_ID` int(10) unsigned DEFAULT NULL,
`ON_HOLD` decimal(1,0) DEFAULT NULL,
`DIVISION_ID` int(11) DEFAULT NULL,
KEY `DATE_FROM` (`DATE_FROM`),
KEY `FILES_ID` (`FILES_ID`),
KEY `CAMPAIGN_ID` (`CAMPAIGN_ID`),
KEY `CAMP_DATE` (`CAMPAIGN_ID`,`DATE_FROM`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
When I execute
SELECT files_id, min( date_from )
FROM files_history2
WHERE campaign_id IS NOT NULL
GROUP BY files_id
the query rests with status "Sending data" for more than eight hours (then I killed the process).
Here the explain:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE files_history2 ALL CAMPAIGN_ID,CAMP_DATE NULL NULL NULL 5073254 Using where; Using temporary; Using filesort
I assume that I generated the necessary keys but then the query should take that long, does it?
I would suggest a different index... Index on (Files_ID, Date_From, Campaign_ID)...
Since your group by is on Files_ID, you want THOSE grouped. Then the MIN( Date_From), so that is in second position... Then FINALLY the Campaign_ID to qualify for not null and here's why...
If you put all your campaign IDs first, great, get all the NULLs out of the way... Now, you have 1,000 campaigns and the Files_ID spans MANY campaigns and they also span many dates, you are going to choke.
By the index I'm projecting, by the Files_ID first, you have each "files_id" already ordered to match your group by. Then, within that, all the earliest dates are at the top of the indexed list... great, almost there, then, by campaign ID. Skip over whatever NULL may be there and you are done, on to the next Files_ID
Hope this makes sense -- unless you have TONs of entries with NULL value campaigns.
Also, by having all 3 parts of the index matching the criteria and output columns of your query, it never has to go back to the raw data file for the data, it gets it all from the index directly.
I'd create a covering index (CAMPAIGN_ID, files_id, date_from) and check that performance. I suspect your issue is due to the grouping not and date_from not being able to use the same index.
CREATE INDEX your_index_name ON files_history2 (CAMPAIGN_ID, files_id, date_from);
If this works you could drop the point index CAMPAIGN_ID as it's included in the composite index.
Well the query is slow due to the aggregation ( function MIN ) along with grouping.
One of the solution is altering your query by moving the aggregating subquery from the WHERE clause to the FROM clause, which will be lot faster than the approach you are using.
try following:
SELECT f.files_id
FROM file_history2 AS f
JOIN (
SELECT campaign_id, MIN(date_from) AS datefrom
FROM file_history2
GROUP BY files_id
) AS f1 ON f.campaign_id = f1.campaign_id AND f.date_from = f1.datefrom;
This should have lot better performance, if doesn't work temporary table would only be the choice to go with.