I have a table of bitcoin transactions:
CREATE TABLE `transactions` (
`trans_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`trans_exchange` int(10) unsigned DEFAULT NULL,
`trans_currency_base` int(10) unsigned DEFAULT NULL,
`trans_currency_counter` int(10) unsigned DEFAULT NULL,
`trans_tid` varchar(20) DEFAULT NULL,
`trans_type` tinyint(4) DEFAULT NULL,
`trans_price` decimal(15,4) DEFAULT NULL,
`trans_amount` decimal(15,8) DEFAULT NULL,
`trans_datetime` datetime DEFAULT NULL,
`trans_sid` bigint(20) DEFAULT NULL,
`trans_timestamp` int(10) unsigned DEFAULT NULL,
PRIMARY KEY (`trans_id`),
KEY `trans_tid` (`trans_tid`),
KEY `trans_datetime` (`trans_datetime`),
KEY `trans_timestmp` (`trans_timestamp`),
KEY `trans_price` (`trans_price`),
KEY `trans_amount` (`trans_amount`)
) ENGINE=MyISAM AUTO_INCREMENT=6162559 DEFAULT CHARSET=utf8;
As you can see from the AUTO_INCREMENT value, the table has over 6 million entries. There will eventually be many more.
I would like to query the table to obtain max price, min price, volume and total amount traded during arbitrary time intervals. To accomplish this, I'm using a query like this:
SELECT
DATE_FORMAT( MIN(transactions.trans_datetime),
'%Y/%m/%d %H:%i:00'
) AS trans_datetime,
SUM(transactions.trans_amount) as trans_volume,
MAX(transactions.trans_price) as trans_max_price,
MIN(transactions.trans_price) as trans_min_price,
COUNT(transactions.trans_id) AS trans_count
FROM
transactions
WHERE
transactions.trans_datetime BETWEEN '2014-09-14 00:00:00' AND '2015-09-13 23:59:00'
GROUP BY
transactions.trans_timestamp DIV 86400
That should select transactions made over a year period, grouped by day (86,400 seconds).
The idea is the timestamp field, which contains the same value as datetime, but as a timestamp...I found this faster than UNIX_TIMESTAMP(trans_datetime), is divided by the amount of seconds I want to be in the time intervals.
The problem: the query is slow. I'm getting 4+ seconds processing time. Here is the result of EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE transactions ALL trans_datetime,trans_timestmp NULL NULL NULL 6162558 Using where; Using temporary; Using filesort
The question: is it possible to optimize this better? Is this structure or approach flawed? I have tried several approaches, and have only succeeded in making modest millisecond-type gains.
Most of the data in the table is for the last 12 months? So you need to touch most of the table? Then there is no way to speed that query up. However, you can get the same output orders of magnitude faster...
Create a summary table. It would have a DATE as the PRIMARY KEY, and the columns would be effectively the fields mentioned in your SELECT.
Once you have initially populated the summary table, then maintain it by adding a new row each night for the day's transactions. More in my blog.
Then the query to get the desired output would hit this Summary Table (with only a few hundred rows), not the table with millions or rows.
Related
I'm building an online tool for collecting feedback.
Right now I'm building a visual summary of all answers per question with answer occurence next to it. I use this query:
SELECT
feedback_answer,
feedback_qtype,
COUNT(feedback_answer) as occurence
FROM acc_data_1005
WHERE (feedback_qtype=5 or feedback_qtype=4 or feedback_qtype=12 or feedback_qtype=13 or feedback_qtype=1 or feedback_qtype=2)
and survey_id=205283
GROUP BY feedback_answer ORDER BY feedback_qtype DESC, COUNT(feedback_answer) DESC
DB table:
CREATE TABLE `acc_data_1005` (
`id` int UNSIGNED NOT NULL,
`survey_id` int UNSIGNED NOT NULL,
`feedback_id` int UNSIGNED NOT NULL,
`date_registered` date NOT NULL,
`feedback_qid` int UNSIGNED NOT NULL,
`feedback_question` varchar(140) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`feedback_qtype` tinyint UNSIGNED NOT NULL COMMENT 'nps, text, input etc',
`data_type` tinyint UNSIGNED NOT NULL COMMENT '0 till 10 are sensitive data options (first name, last name, email etc.)',
`feedback_answer` varchar(1500) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
`additional_data` varchar(500) COLLATE utf8mb4_unicode_ci NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC;
ALTER TABLE `acc_data_1005`
ADD PRIMARY KEY (`id`),
ADD KEY `date_registered` (`date_registered`),
ADD KEY `feedback_qid` (`feedback_qid`,`feedback_question`) USING BTREE,
ADD KEY `feedback_id` (`feedback_id`),
ADD KEY `survey_id` (`survey_id`);
ALTER TABLE `acc_data_1005` ADD FULLTEXT KEY `feedback_answer` (`feedback_answer`);
ALTER TABLE `acc_data_1005`
MODIFY `id` int UNSIGNED NOT NULL AUTO_INCREMENT, AUTO_INCREMENT=2020001;
COMMIT;
The table has around 2 million rows and for this test, they all have the same survey_id.
Profling says executing takes up 96% of time, explain result:
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
acc_data_1005
NULL
ref
survey_id,feedback_answer
survey_id
4
const
998375
46.86
Using where; Using temporary; Using filesort
This query takes around 22-30 seconds for just 11 rows.
If I remove the survey_id (which is important), the query takes around 2-4 seconds (still way too much).
I've been at it for hours but can't find why this query is so slow.
If it helps I can dump the rows in a SQL file (around 400-600MB).
The group by is slow because of scanning 2 million rows on a fulltext index (feedback_answer) long character items.
I created another table "analytic_stats" and create a cron job that runs this query every month (for only the data of that month) and store that in the stats table.
When the customer want's to get the data of a full year (2 million+ rows, which is too slow) I just get the data of a few rows from the stats table and run the group by query only for the current month. This would just have to group around 10.000-20.000 rows instead of 2 million which is instant.
Maybe not the most efficent way, but it works for me ;)
Hope it might help someone with a similar problem.
i have one simple query, but on the other hand relatively big table.
Here it is:
select `stats_ad_groups`.`ad_group_id`,
sum(stats_ad_groups.earned) / 1000000 as earned
from `stats_ad_groups`
where `stats_ad_groups`.`day` between '2018-01-01' and '2018-05-31'
group by `ad_group_id` order by earned asc
limit 10
And here is table structure:
CREATE TABLE `stats_ad_groups` (
`campaign_id` int(11) NOT NULL,
`ad_group_id` int(11) NOT NULL,
`impressions` int(11) NOT NULL,
`clicks` int(11) NOT NULL,
`avg_position` double(3,1) NOT NULL,
`cost` int(11) NOT NULL,
`profiles` int(11) NOT NULL DEFAULT 0,
`upgrades` int(11) NOT NULL DEFAULT 0,
`earned` int(11) NOT NULL DEFAULT 0,
`day` date NOT NULL,
PRIMARY KEY (`ad_group_id`,`day`,`campaign_id`)
)
Also there are partitions by range here, but i excluded them, not to waste space :)
Query I wrote here is executed in about 9 sec. Do you know some way to improve it?
If i exclude limit/order by its executed in 200ms.
To sum it:
I need to order by sum on big table, if its possible with limit and offset.
INDEX(day, ad_group_id, earned)
handles the WHERE and is 'covering'.
Is your PARTITIONing PARTITION BY RANGE(TO_DAYs(day)) with daily partitions? If so, could leave off day from that index.
With that index, PARTITIONing provides no extra performance for this query.
For significant speedup, build and maintain a summary table that has day, ad_group_id, SUM(earned). More
Don't use (m,n) on DOUBLE or FLOAT.
I have a setup like such:
using mysql 5.5
a building has a set of measurement equipment
measurements are stored in measurements -table
there are multiple different types
users can have access to either a whole building, or a set of equipment
a few million measurements
The table creation looks like this:
CREATE TABLE `measurements` (
`id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
`building_id` INT(10) UNSIGNED NOT NULL,
`equipment_id` INT(10) UNSIGNED NOT NULL,
`state` ENUM('normal','high','low','alarm','error') NOT NULL DEFAULT 'normal',
`measurement_type` VARCHAR(50) NULL DEFAULT NULL,
PRIMARY KEY (`id`),
INDEX `building_id` (`building_id`),
INDEX `equipment_id` (`equipment_id`),
INDEX `state` (`state`),
INDEX `timestamp` (`timestamp`),
INDEX `measurement_type` (`measurement_type`),
INDEX `building_timestamp_type` (`building_id`, `timestamp`, `measurement_type`),
INDEX `building_type_state` (`building_id`, `measurement_type`, `state`),
INDEX `equipment_type_state` (`equipment_id`, `type`, `state`),
INDEX `equipment_type_state_stamp` (`equipment_id`, `measurement_type`, `state`, `timestamp`),
) COLLATE='utf8_unicode_ci' ENGINE=InnoDB;
Now I need to query for the last 50 measurements of certain types that the user has access to. If the user has access to a whole building, the query runs very, very fast. However, if the user only has access to separate equipments, the query execution time seems to grow linearly with the number of equipment_ids. For example, I tested having only 2 equipment_ids in the IN query and the execution time was around 10ms. At 130 equipment_ids, it took 2.5s. The query I'm using looks like this:
SELECT *
FROM `measurements`
WHERE
`measurements`.`state` IN ('high','low','alarm')
AND `measurements`.`equipment_id` IN (
SELECT `equipment_users`.`equipment_id`
FROM `equipment_users`
WHERE `equipment_users`.`user_id` = 1
)
AND (`measurements`.`measurement_type` IN ('temperature','luminosity','humidity'))
ORDER BY `measurements`.`timestamp` DESC
LIMIT 50
The query seems to favor the measurement_type index which makes it take 15seconds, forcing it to use the equipment_type_state_stamp index drops that down to the listed numbers. Still, why does the execution time go up linearly with the number of ids, and is there something I could do to prevent that?
I have a MYSQL DB with table definition like this:
CREATE TABLE `minute_data` (
`date` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`open` decimal(10,2) DEFAULT NULL,
`high` decimal(10,2) DEFAULT NULL,
`low` decimal(10,2) DEFAULT NULL,
`close` decimal(10,2) DEFAULT NULL,
`volume` decimal(10,2) DEFAULT NULL,
`adj_close` varchar(45) DEFAULT NULL,
`symbol` varchar(10) NOT NULL DEFAULT '',
PRIMARY KEY (`symbol`,`date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
It stores 1 minute data points from the stock market. The primary key is a combination of the symbol and date columns. This way I always have only 1 data point for each symbol at any time.
I am wondering why the following query takes so long that I can't even wait for it to finish:
select distinct date from test.minute_data where date >= "2013-01-01"
order by date asc limit 100;
However I can select count(*) from minute_data; and that finishes very quickly.
I know that it must have something to do with the fact that there are over 374 million rows of data in the table, and my desktop computer is pretty far from a super computer.
Does anyone know something I can try to speed up with query? Do I need to abandon all hope of using a MySQL table this big??
Thanks a lot!
When you have a composite index on 2 columns, like your (symbol, date) primary key, searching and grouping by a prefix of they key will be fast. But searching for something that doesn't include the first column in the index requires scanning all rows or using some other index.
You can either change your primary key to (date, symbol) if you don't usually need to search for symbol without date. Or you can add an additional index on date:
alter table minute_data add index (date)
I am trying to generate a group query on a large table (more than 8 million rows). However I can reduce the need to group all the data by date. I have a view that captures that dates I require and this limits the query bu it's not much better.
Finally I need to join to another table to pick up a field.
I am showing the query, the create on the main table and the query explain below.
Main Query:
SELECT pgi_raw_data.wsp_channel,
'IOM' AS wsp,
pgi_raw_data.dated,
pgi_accounts.`master`,
pgi_raw_data.event_id,
pgi_raw_data.breed,
Sum(pgi_raw_data.handle),
Sum(pgi_raw_data.payout),
Sum(pgi_raw_data.rebate),
Sum(pgi_raw_data.profit)
FROM pgi_raw_data
INNER JOIN summary_max
ON pgi_raw_data.wsp_channel = summary_max.wsp_channel
AND pgi_raw_data.dated > summary_max.race_date
INNER JOIN pgi_accounts
ON pgi_raw_data.account = pgi_accounts.account
GROUP BY pgi_raw_data.event_id
ORDER BY NULL
The create table:
CREATE TABLE `pgi_raw_data` (
`event_id` char(25) NOT NULL DEFAULT '',
`wsp_channel` varchar(5) NOT NULL,
`dated` date NOT NULL,
`time` time DEFAULT NULL,
`program` varchar(5) NOT NULL,
`track` varchar(25) NOT NULL,
`raceno` tinyint(2) NOT NULL,
`detail` varchar(30) DEFAULT NULL,
`ticket` varchar(20) NOT NULL DEFAULT '',
`breed` varchar(12) NOT NULL,
`pool` varchar(10) NOT NULL,
`gross` decimal(11,2) NOT NULL,
`refunds` decimal(11,2) NOT NULL,
`handle` decimal(11,2) NOT NULL,
`payout` decimal(11,4) NOT NULL,
`rebate` decimal(11,4) NOT NULL,
`profit` decimal(11,4) NOT NULL,
`account` mediumint(10) NOT NULL,
PRIMARY KEY (`event_id`,`ticket`),
KEY `idx_account` (`account`),
KEY `idx_wspchannel` (`wsp_channel`,`dated`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=latin1
This is my view for summary_max:
CREATE ALGORITHM=UNDEFINED DEFINER=`root`#`localhost` SQL SECURITY DEFINER VIEW
`summary_max` AS select `pgi_summary_tbl`.`wsp_channel` AS
`wsp_channel`,max(`pgi_summary_tbl`.`race_date`) AS `race_date`
from `pgi_summary_tbl` group by `pgi_summary_tbl`.`wsp
And also the evaluated query:
1 PRIMARY <derived2> ALL 6 Using temporary
1 PRIMARY pgi_raw_data ref idx_account,idx_wspchannel idx_wspchannel
7 summary_max.wsp_channel 470690 Using where
1 PRIMARY pgi_accounts ref PRIMARY PRIMARY 3 gf3data_momutech.pgi_raw_data.account 29 Using index
2 DERIVED pgi_summary_tbl ALL 42282 Using temporary; Using filesort
Any help on indexing would help.
At a minimum you need indexes on these fields:
pgi_raw_data.wsp_channel,
pgi_raw_data.dated,
pgi_raw_data.account
pgi_raw_data.event_id,
summary_max.wsp_channel,
summary_max.race_date,
pgi_accounts.account
The general (not always) rule is anything you are sorting, grouping, filtering or joining on should have an index.
Also: pgi_summary_tbl.wsp
Also, why the order by null?
The first thing is to be sure that you have indexes on pgi_summary_table(wsp_channel, race_date) and pgi_accounts(account). For this query, you don't need indexes on these columns in the raw data.
MySQL has a tendency to use indexes even when they are not the most efficient path. I would start by looking at the performance of the "full" query, without the joins:
SELECT pgi_raw_data.wsp_channel,
'IOM' AS wsp,
pgi_raw_data.dated,
-- pgi_accounts.`master`,
pgi_raw_data.event_id,
pgi_raw_data.breed,
Sum(pgi_raw_data.handle),
Sum(pgi_raw_data.payout),
Sum(pgi_raw_data.rebate),
Sum(pgi_raw_data.profit)
FROM pgi_raw_data
GROUP BY pgi_raw_data.event_id
If this has better performance, you may have a situation where the indexes are working against you. The specific problem is called "thrashing". It occurs when a table is too bit to fit into memory. Often, the fastest way to deal with such a table is to just read the whole thing. Accessing the table through an index can result in an extra I/O operation for most of the rows.
If this works, then do the joins after the aggregate. Also, consider getting more memory, so the whole table will fit into memory.
Second, if you have to deal with this type of data, then partitioning the table by date may prove to be a very useful option. This will allow you to significantly reduce the overhead of reading the large table. You do have to be sure that the summary table can be read the same way.