Having issues only updating a table where values have increased.
My scenario :
Imagine you have a database of structure like so, which represents all a programmes peak audience values split by channel and device.
Every 5 mins you are pushing new data into this table,
Where the goal is to only update values where the programme's peak views has increased
eg increased "event_peak_views" from a unique "platform_id AND channel_epg_id AND event_start"
My current approach is to quick insert new values every 5 mins, then select the lowest peak values for any given programme and delete them.
There MUST be a better way, given the data sets are rather large at several million rows, anyone got any better suggestions than my own shoddy approach ?
Current Table Layout
`entry_id` INT(11) NOT NULL AUTO_INCREMENT,
`platform_id` INT(11) NOT NULL,
`channel_epg_id` INT(11) NOT NULL,
`event_start` DATETIME NOT NULL,
`event_peak_views` INT(11) NOT NULL,
`last_update` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`entry_id`),
INDEX `platform_id` (`platform_id`),
INDEX `channel_epg_id` (`channel_epg_id`),
INDEX `event_start` (`event_start`)
My Delete/Select lowest values SQL
DELETE victim
FROM
Livescrape_data_live_historical_events as victim,
Livescrape_data_live_historical_events as comparsion
WHERE
victim.entry_id<>comparsion.entry_id
AND victim.event_start=comparsion.event_start
AND victim.platform_id=comparsion.platform_id
AND victim.channel_epg_id=comparsion.channel_epg_id
AND (
(victim.event_peak_views < comparsion.event_peak_views)
OR (
victim.event_peak_views = comparsion.event_peak_views
AND victim.entry_id > comparsion.entry_id
)
);
Related
I need to run a process against unique combinations of 3 to 8 values from a list of 25 integers. This creates a table with approximately 6-million unique record/options (assuming I limit queries as in the following chart) . The formula for unique combinations (here in excel form) for combinations of 8 from a 25 item row is = fact(25)/(fact(8)*fact(25-8)) ...
See this explanation here ... https://www.khanacademy.org/math/precalculus/x9e81a4f98389efdf:prob-comb/x9e81a4f98389efdf:combinations/v/handshaking-combinations
As I want combinations of 3, 4, 5, 6, 7, or 8 - some records will have null entries for their fourth, through eighth values.
As multiple computers and processors might be working on this simultaneously, I need to store the list of "jobs" in MYSQL and update or delete each record when processed.
Using #Eggyal's solution. I can create a table for my 25 values and generate all unique combinations in MYSQL. This list of 25 to 50 values changes occasionally ...
CREATE TABLE UserContacts
(`contact_id` int)
;
INSERT INTO UserContacts
(`contact_id`)
VALUES
(1),
(5),
(6)
;
To get all combinations, I can run ...
SELECT a.contact_id a, b.contact_id b, ... (I need up 6 of these combinations)
FROM UserContacts a
JOIN UserContacts b ON b.contact_id > a.contact_id
See it on sqlfiddle (change to MySQL 5.6 db for it to work).
My question is:
How to best add this data to a new MYSQL table that indexes these combinations properly, so I can find and update records from this table of millions of combinations?
Is it faster to create a compound primary key PRIMARY KEY (firstval, secondval, thirdval ...) ? If yes, how would a select or where query look for the compound solution? etc.
CREATE TABLE 'combinations' (
'combo_id' INT NOT NULL,
'firstval' SMALLINT(5) NULL,
'secondval' SMALLINT(5) NULL,
'thirdval' SMALLINT(5) NULL,
'fourthval' SMALLINT(5) NULL,
'fifthval' SMALLINT(5) NULL,
'sixthval' SMALLINT(5) NULL,
'seventhval' SMALLINT(5) NULL,
'eightthval' SMALLINT(5) NULL,
`started` TINYINT(1) NULL,
`score` TINYINT(3) NULL,
`added` DATETIME NULL,
`processed` DATETIME NULL,
PRIMARY KEY (`combo_id`)
)
;
If you are building a queue with millions of rows, you may find that the overhead of enqueuing and dequeuing adds complexity and slows down the processing.
Since you know exactly what combinations need to be processed, I suggest that you simply tell each 'worker' what set of combinations to work on.
Ways to split
Let's say you decide to have 10 'worker' processes.
If there is an id that is mostly compact, compute ranges of id to provide to each worker.
Assign 0 through 9 to the 10 workers, then have a worker do id % 10 to see if it is a task to work on.
A work grabs, say, 20 unassigned ids and assigns those rows to itself. This is handy if there is no useful pattern for id and the list of tasks is continually being augmented. A drawback is that if a worker dies before finishing its 20, then a separate "reaper" process is needed to put those rows back into the pool of tasks waiting for assignment.
For storing combinations in a MYSQL database, the solution that appears to work best is to create two indexes :
A primary KEY
A second UNIQUE index across all combination values - as follows:
CREATE TABLE combo (
`combo_id` INT NOT NULL,
`one` SMALLINT(5) NULL,
`two` SMALLINT(5) NULL,
`three` SMALLINT(5) NULL,
`four` SMALLINT(5) NULL,
`five` SMALLINT(5) NULL,
`six` SMALLINT(5) NULL,
`seven` SMALLINT(5) NULL,
`eight` SMALLINT(5) NULL,
`processed` DATETIME NULL,
PRIMARY KEY (`combo_id`),
UNIQUE KEY(`one`, `two`, `three`, `four`, `five`, `six`, `seven`, `eight`)
)
;
Simple solution that works well...
I have to tables with 65.5 Million rows:
1)
CREATE TABLE RawData1 (
cdasite varchar(45) COLLATE utf8_unicode_ci NOT NULL,
id int(20) NOT NULL DEFAULT '0',
timedate datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
type int(11) NOT NULL DEFAULT '0',
status int(11) NOT NULL DEFAULT '0',
branch_id int(20) DEFAULT NULL,
branch_idString varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (id,cdasite,timedate),
KEY idx_timedate (timedate,cdasite)
) ENGINE=InnoDB;
2)
Same table with partition (call it RawData2)
PARTITION BY RANGE ( TO_DAYS(timedate))
(PARTITION p20140101 VALUES LESS THAN (735599) ENGINE = InnoDB,
PARTITION p20140401 VALUES LESS THAN (735689) ENGINE = InnoDB,
.
.
PARTITION p20201001 VALUES LESS THAN (738064) ENGINE = InnoDB,
PARTITION future VALUES LESS THAN MAXVALUE ENGINE = InnoDB);
I'm using the same query:
SELECT count(id) FROM RawData1
where timedate BETWEEN DATE_FORMAT(date_sub(now(),INTERVAL 2 YEAR),'%Y-%m-01') AND now();
2 problems:
1. why the partitioned table runs longer then the regular table?
2. the regular table returns 36380217 in 17.094 Sec. is it normal, all R&D leaders think it is not fast enough, it need to return in ~2 Sec.
What do I need to check / do / change ?
Is it realistic to scan 35732495 rows and retrieve 36380217 in less then 3-4 sec?
You have found one example of why PARTITIONing is not a performance panacea.
Where does id come from?
How many different values are there for cdasite? If thousands, not millions, build a table mapping cdasite <=> id and switch from a bulky VARCHAR(45) to a MEDIUMINT UNSIGNED (or whatever is appropriate). This item may help the most, but perhaps not enough.
Ditto for status, but probably using TINYINT UNSIGNED. Or think about ENUM. Either is 1 byte, not 4.
The (20) on INT(20) means nothing. You get a 4-byte integer with a limit of about 2 billion.
Are you sure there are no duplicate timedates?
branch_id and branch_idString -- this smells like a pair that needs to be in another table, leaving only the id here?
Smaller -> faster.
COUNT(*) is the same as COUNT(id) since id is NOT NULL.
Do not include future partitions before they are needed; it slows things down. (And don't use partitioning at all.)
To get that query even faster, build and maintain a Summary Table. It would have at least a DATE in the PRIMARY KEY and at least COUNT(*) as a column. Then the query would fetch from that table. More on Summary tables: http://mysql.rjweb.org/doc.php/summarytables
I have a setup like such:
using mysql 5.5
a building has a set of measurement equipment
measurements are stored in measurements -table
there are multiple different types
users can have access to either a whole building, or a set of equipment
a few million measurements
The table creation looks like this:
CREATE TABLE `measurements` (
`id` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
`timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
`building_id` INT(10) UNSIGNED NOT NULL,
`equipment_id` INT(10) UNSIGNED NOT NULL,
`state` ENUM('normal','high','low','alarm','error') NOT NULL DEFAULT 'normal',
`measurement_type` VARCHAR(50) NULL DEFAULT NULL,
PRIMARY KEY (`id`),
INDEX `building_id` (`building_id`),
INDEX `equipment_id` (`equipment_id`),
INDEX `state` (`state`),
INDEX `timestamp` (`timestamp`),
INDEX `measurement_type` (`measurement_type`),
INDEX `building_timestamp_type` (`building_id`, `timestamp`, `measurement_type`),
INDEX `building_type_state` (`building_id`, `measurement_type`, `state`),
INDEX `equipment_type_state` (`equipment_id`, `type`, `state`),
INDEX `equipment_type_state_stamp` (`equipment_id`, `measurement_type`, `state`, `timestamp`),
) COLLATE='utf8_unicode_ci' ENGINE=InnoDB;
Now I need to query for the last 50 measurements of certain types that the user has access to. If the user has access to a whole building, the query runs very, very fast. However, if the user only has access to separate equipments, the query execution time seems to grow linearly with the number of equipment_ids. For example, I tested having only 2 equipment_ids in the IN query and the execution time was around 10ms. At 130 equipment_ids, it took 2.5s. The query I'm using looks like this:
SELECT *
FROM `measurements`
WHERE
`measurements`.`state` IN ('high','low','alarm')
AND `measurements`.`equipment_id` IN (
SELECT `equipment_users`.`equipment_id`
FROM `equipment_users`
WHERE `equipment_users`.`user_id` = 1
)
AND (`measurements`.`measurement_type` IN ('temperature','luminosity','humidity'))
ORDER BY `measurements`.`timestamp` DESC
LIMIT 50
The query seems to favor the measurement_type index which makes it take 15seconds, forcing it to use the equipment_type_state_stamp index drops that down to the listed numbers. Still, why does the execution time go up linearly with the number of ids, and is there something I could do to prevent that?
I have a table for storing stats. Currently this is populated with about 10 million rows at the end of the day then copied to daily stats table and deleted. For this reason I can't have an auto-incrementing primary key.
This is the table structure:
CREATE TABLE `stats` (
`shop_id` int(11) NOT NULL,
`title` varchar(255) CHARACTER SET latin1 NOT NULL,
`created` datetime NOT NULL,
`mobile` tinyint(1) NOT NULL DEFAULT '0',
`click` tinyint(1) NOT NULL DEFAULT '0',
`conversion` tinyint(1) NOT NULL DEFAULT '0',
`ip` varchar(20) CHARACTER SET latin1 NOT NULL,
KEY `shop_id` (`shop_id`,`created`,`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
I have a key on shop_id, created, ip but I'm not sure what columns I should use to create the optimal index to increase lookup speeds any further?
The query below takes about 12 seconds with no key and about 1.5 seconds using the index above:
SELECT DATE(CONVERT_TZ(`created`, 'UTC', 'Australia/Brisbane')) AS `date`, COUNT(*) AS `views`
FROM `stats`
WHERE `created` <= '2017-07-18 09:59:59'
AND `shop_id` = '17515021'
AND `click` != 1
AND `conversion` != 1
GROUP BY DATE(CONVERT_TZ(`created`, 'UTC', 'Australia/Brisbane'))
ORDER BY DATE(CONVERT_TZ(`created`, 'UTC', 'Australia/Brisbane'));
If there is no column (or combination of columns) that is guaranteed unique, then do have an AUTO_INCREMENT id. Don't worry about truncating/deleting. (However, if the id does not reset, you probably need to use BIGINT, not INT UNSIGNED to avoid overflow.)
Don't use id as the primary key, instead, PRIMARY KEY(shop_id, created, id), INDEX(id).
That unconventional PK will help with performance in 2 ways, while being unique (due to the addition of id). The INDEX(id) is to keep AUTO_INCREMENT happy. (Whether you DELETE hourly or daily is a separate issue.)
Build a Summary table based on each hour (or minute). It will contain the count for such -- 400K/hour or 7K/minute. Augment it each hour (or minute) so that you don't have to do all the work at the end of the day.
The summary table can also filter on click and/or conversion. Or it could keep both, if you need them.
If click/conversion have only two states (0 & 1), don't say != 1, say = 0; the optimizer is much better at = than at !=.
If they 2-state and you changed to =, then this becomes viable and much better: INDEX(shop_id, click, conversion, created) -- created must be last.
Don't bother with TZ when summarizing into the Summary table; apply the conversion later.
Better yet, don't use DATETIME, use TIMESTAMP so that you won't need to convert (assuming you have TZ set correctly).
After all that, if you still have issues, start over on the Question; there may be further tweaks.
In your where clause, Use the column first which will return the small set of results and so on and create the index in the same order.
You have
WHERE created <= '2017-07-18 09:59:59'
AND shop_id = '17515021'
AND click != 1
AND conversion != 1
If created will return the small number of set as compare to other 3 columns then you are good otherwise you that column at first position in your where clause then select the second column as per the same explanation and create the index as per you where clause.
If you think order is fine then create an index
KEY created_shopid_click_conversion (created,shop_id, click, conversion);.
I have a table of bitcoin transactions:
CREATE TABLE `transactions` (
`trans_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`trans_exchange` int(10) unsigned DEFAULT NULL,
`trans_currency_base` int(10) unsigned DEFAULT NULL,
`trans_currency_counter` int(10) unsigned DEFAULT NULL,
`trans_tid` varchar(20) DEFAULT NULL,
`trans_type` tinyint(4) DEFAULT NULL,
`trans_price` decimal(15,4) DEFAULT NULL,
`trans_amount` decimal(15,8) DEFAULT NULL,
`trans_datetime` datetime DEFAULT NULL,
`trans_sid` bigint(20) DEFAULT NULL,
`trans_timestamp` int(10) unsigned DEFAULT NULL,
PRIMARY KEY (`trans_id`),
KEY `trans_tid` (`trans_tid`),
KEY `trans_datetime` (`trans_datetime`),
KEY `trans_timestmp` (`trans_timestamp`),
KEY `trans_price` (`trans_price`),
KEY `trans_amount` (`trans_amount`)
) ENGINE=MyISAM AUTO_INCREMENT=6162559 DEFAULT CHARSET=utf8;
As you can see from the AUTO_INCREMENT value, the table has over 6 million entries. There will eventually be many more.
I would like to query the table to obtain max price, min price, volume and total amount traded during arbitrary time intervals. To accomplish this, I'm using a query like this:
SELECT
DATE_FORMAT( MIN(transactions.trans_datetime),
'%Y/%m/%d %H:%i:00'
) AS trans_datetime,
SUM(transactions.trans_amount) as trans_volume,
MAX(transactions.trans_price) as trans_max_price,
MIN(transactions.trans_price) as trans_min_price,
COUNT(transactions.trans_id) AS trans_count
FROM
transactions
WHERE
transactions.trans_datetime BETWEEN '2014-09-14 00:00:00' AND '2015-09-13 23:59:00'
GROUP BY
transactions.trans_timestamp DIV 86400
That should select transactions made over a year period, grouped by day (86,400 seconds).
The idea is the timestamp field, which contains the same value as datetime, but as a timestamp...I found this faster than UNIX_TIMESTAMP(trans_datetime), is divided by the amount of seconds I want to be in the time intervals.
The problem: the query is slow. I'm getting 4+ seconds processing time. Here is the result of EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE transactions ALL trans_datetime,trans_timestmp NULL NULL NULL 6162558 Using where; Using temporary; Using filesort
The question: is it possible to optimize this better? Is this structure or approach flawed? I have tried several approaches, and have only succeeded in making modest millisecond-type gains.
Most of the data in the table is for the last 12 months? So you need to touch most of the table? Then there is no way to speed that query up. However, you can get the same output orders of magnitude faster...
Create a summary table. It would have a DATE as the PRIMARY KEY, and the columns would be effectively the fields mentioned in your SELECT.
Once you have initially populated the summary table, then maintain it by adding a new row each night for the day's transactions. More in my blog.
Then the query to get the desired output would hit this Summary Table (with only a few hundred rows), not the table with millions or rows.