MySQL booking site: query/db optimization - mysql

I have a very bad performance in most of my queries. I've read a lot on stackoverflow, but still have some questions, maybe anyone could help or give me any hints?
Basically, i am working on a booking website, having among others the following tables:
objects
+----+---------+--------+---------+------------+-------------+----------+----------+-------------+------------+-------+-------------+------+-----------+----------+-----+-----+
| id | user_id | status | type_id | privacy_id | location_id | address1 | address2 | object_name | short_name | price | currency_id | size | no_people | min_stay | lat | lng |
+----+---------+--------+---------+------------+-------------+----------+----------+-------------+------------+-------+-------------+------+-----------+----------+-----+-----+
OR in MySQL:
CREATE TABLE IF NOT EXISTS `objects` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT COMMENT 'object_id',
`user_id` int(11) unsigned DEFAULT NULL,
`status` tinyint(2) unsigned NOT NULL,
`type_id` tinyint(3) unsigned DEFAULT NULL COMMENT 'type of object, from object_type id',
`privacy_id` tinyint(11) unsigned NOT NULL COMMENT 'id from privacy',
`location_id` int(11) unsigned DEFAULT NULL,
`address1` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`address2` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`object_name` varchar(35) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT 'given name by user',
`short_name` varchar(12) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT 'short name, selected by user',
`price` int(6) unsigned DEFAULT NULL,
`currency_id` tinyint(3) unsigned DEFAULT NULL,
`size` int(4) unsigned DEFAULT NULL COMMENT 'size rounded and in m2',
`no_people` tinyint(3) unsigned DEFAULT NULL COMMENT 'number of people',
`min_stay` tinyint(2) unsigned DEFAULT NULL COMMENT '0=no min stay;else # nights',
`lat` varchar(32) COLLATE utf8_unicode_ci DEFAULT NULL,
`lng` varchar(32) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1451046 ;
reservations
+----+------------+-----------+-----------+---------+--------+
| id | by_user_id | object_id | from_date | to_date | status |
+----+------------+-----------+-----------+---------+--------+
OR in MySQL:
CREATE TABLE IF NOT EXISTS `reservations` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`by_user_id` int(11) NOT NULL COMMENT 'user_id of guest',
`object_id` int(11) NOT NULL COMMENT 'id of object',
`from_date` date NOT NULL COMMENT 'start date of reservation',
`to_date` date NOT NULL COMMENT 'end date of reservation',
`status` int(1) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=890729 ;
There are a few questions:
1 - I have not set any additional key (except primary) - where should I set and which key should I set?
2 - I have read about MyISAM vs InnoDB, the conclusion for me was that MyISAM is faster when it comes to read-only, whereas InnoDB is designed for tables that get UPDATED or INSERTs more frequently. So, currently objects uses MyISAM and reservations InnoDB. Is this a good idea to mix? Is there a better choice?
3 - I need to query those objects that are available in a certain period (between from_date and end_date). I have read (among others) this post on stackoverflow: MySQL select rows where date not between date
However, when I use the suggested solution the query times out before returning any results (so it is really slow):
SELECT DISTINCT o.id FROM objects o LEFT JOIN reservations r ON(r.object_id=o.id) WHERE
COALESCE('2012-04-05' NOT BETWEEN r.from_date AND r.to_date, TRUE)
AND COALESCE('2012-04-08' NOT BETWEEN r.from_date AND r.to_date, TRUE)
AND o.location_id=201
LIMIT 20
What am I doing wrong? What is the best solution for doing such a query? How do other sites do it? Is my database structure not the best for this or is it only the query?
I would have some more questions, but I would be really grateful for getting any help on this! Thank you very much in advance for any hint or suggestion!

It appears you are looking for any "objects" that do NOT have a reservation conflict based on the from/to dates provided. Doing a coalesce() to always include those that are not ever found in reservations is an ok choice, however, being a left-join, I would try left joining where the IS a date found, and ignoring any objects FOUND. Something like
SELECT DISTINCT
o.id
FROM
objects o
LEFT JOIN reservations r
ON o.id = r.object_id
AND ( r.from_date between '2012-04-05' and '2012-04-08'
OR r.to_date between '2012-04-05' and '2012-04-08' )
WHERE
o.location_id = 201
AND r.object_id IS NULL
LIMIT 20
I would ensure an index on the reservations table by (object_id, from_date ) and another (object_id, to_date). By explicitly using the from_date between range, (and to date also), you are specifically looking FOR a reservation occupying this time period. If they ARE found, then don't allow, hence the WHERE clause looking for "r.object_id IS NULL" (ie: nothing is found in conflict within the date range you've provided)
Expanding from my previous answer, and by having two distinct indexes on (id, from date) and (id, to date), you MIGHT get better performance by joining on reservations for each index respectively and expecting NULL in BOTH reservation sets...
SELECT DISTINCT
o.id
FROM
objects o
LEFT JOIN reservations r
ON o.id = r.object_id
AND r.from_date between '2012-04-05' and '2012-04-08'
LEFT JOIN reservations r2
ON o.id = r2.object_id
AND r2.to_date between '2012-04-05' and '2012-04-08'
WHERE
o.location_id = 201
AND r.object_id IS NULL
AND r2.object_id IS NULL
LIMIT 20

I wouldn't mix InnoDB and MyISAM tables, but I would define all the tables as InnoDB (for foreing keys support). Generally all the columns with the _id suffix should be foreign keys refering to appropriate table (object_id => objects etc).
You don't have to define index on foreign key as it is defined automatically (since MySQL 4.1.2), but you can define additional indexes on reservations.from_date and reservations.to_date columns for faster comparison.

I know this is a year old, but if you've tried that solution above, the logic isn't complete. It misses reservations that start before the query start AND end after the query end. Also between doesn't cope with reservations that start and end at the same time.
This worked better for me:
SELECT venues.id
FROM venues LEFT JOIN reservations r
ON venues.id = r.venue_id && (r.date_end >':start' and r.date_start <':end')
WHERE r.venue_id IS NULL
ORDER BY venues.id

Related

Improve query speed suggestions

For self education I am developing an invoicing system for an electricity company. I have multiple time series tables, with different intervals. One table represents consumption, two others represent prices. A third price table should be still incorporated. Now I am running calculation queries, but the queries are slow. I would like to improve the query speed, especially since this is only the beginning calculations and the queries will only become more complicated. Also please note that this is my first database i created and exercises I have done. A simplified explanation is preferred. Thanks for any help provided.
I have indexed: DATE, PERIOD_FROM, PERIOD_UNTIL in each table. This speed up the process from 60 seconds to 5 seconds.
The structure of the tables is the following:
CREATE TABLE `apxprice` (
`APX_id` int(11) NOT NULL AUTO_INCREMENT,
`DATE` date DEFAULT NULL,
`PERIOD_FROM` time DEFAULT NULL,
`PERIOD_UNTIL` time DEFAULT NULL,
`PRICE` decimal(10,2) DEFAULT NULL,
PRIMARY KEY (`APX_id`)
) ENGINE=MyISAM AUTO_INCREMENT=28728 DEFAULT CHARSET=latin1
CREATE TABLE `imbalanceprice` (
`imbalanceprice_id` int(11) NOT NULL AUTO_INCREMENT,
`DATE` date DEFAULT NULL,
`PTU` tinyint(3) DEFAULT NULL,
`PERIOD_FROM` time DEFAULT NULL,
`PERIOD_UNTIL` time DEFAULT NULL,
`UPWARD_INCIDENT_RESERVE` tinyint(1) DEFAULT NULL,
`DOWNWARD_INCIDENT_RESERVE` tinyint(1) DEFAULT NULL,
`UPWARD_DISPATCH` decimal(10,2) DEFAULT NULL,
`DOWNWARD_DISPATCH` decimal(10,2) DEFAULT NULL,
`INCENTIVE_COMPONENT` decimal(10,2) DEFAULT NULL,
`TAKE_FROM_SYSTEM` decimal(10,2) DEFAULT NULL,
`FEED_INTO_SYSTEM` decimal(10,2) DEFAULT NULL,
`REGULATION_STATE` tinyint(1) DEFAULT NULL,
`HOUR` int(2) DEFAULT NULL,
PRIMARY KEY (`imbalanceprice_id`),
KEY `DATE` (`DATE`,`PERIOD_FROM`,`PERIOD_UNTIL`)
) ENGINE=MyISAM AUTO_INCREMENT=117427 DEFAULT CHARSET=latin
CREATE TABLE `powerload` (
`powerload_id` int(11) NOT NULL AUTO_INCREMENT,
`EAN` varchar(18) DEFAULT NULL,
`DATE` date DEFAULT NULL,
`PERIOD_FROM` time DEFAULT NULL,
`PERIOD_UNTIL` time DEFAULT NULL,
`POWERLOAD` int(11) DEFAULT NULL,
PRIMARY KEY (`powerload_id`)
) ENGINE=MyISAM AUTO_INCREMENT=61039 DEFAULT CHARSET=latin
Now when running this query:
SELECT i.DATE, i.PERIOD_FROM, i.TAKE_FROM_SYSTEM, i.FEED_INTO_SYSTEM,
a.PRICE, p.POWERLOAD, sum(a.PRICE * p.POWERLOAD)
FROM imbalanceprice i, apxprice a, powerload p
WHERE i.DATE = a.DATE
and i.DATE = p.DATE
AND i.PERIOD_FROM >= a.PERIOD_FROM
and i.PERIOD_FROM = p.PERIOD_FROM
AND i.PERIOD_FROM < a.PERIOD_UNTIL
AND i.DATE >= '2018-01-01'
AND i.DATE <= '2018-01-31'
group by i.DATE
I have run the query with explain and get the following result: Select_type, all simple partitions all null possible keys a,p = null i = DATE Key a,p = null i = DATE key_len a,p = null i = 8 ref a,p = null i = timeseries.a.DATE,timeseries.p.PERIOD_FROM rows a = 28727 p = 61038 i = 1 filtered a = 100 p = 10 i = 100 a extra: using where using temporary using filesort b extra: using where using join buffer (block nested loop) c extra: null
Preferably I run a more complicated query for a whole year and group by month for example with all price tables incorporated. However, this would be too slow. I have indexed: DATE, PERIOD_FROM, PERIOD_UNTIL in each table. The calculation result may not be changed, in this case quarter hourly consumption of two meters multiplied by hourly prices.
"Categorically speaking," the first thing you should look at is indexes.
Your clauses such as WHERE i.DATE = a.DATE ... are categorically known as INNER JOINs, and the SQL engine needs to have the ability to locate the matching rows "instantly." (That is to say, without looking through the entire table!)
FYI: Just like any index in real-life – here I would be talking about "library card catalogs" if we still had such a thing – indexes will assist both "equal to" and "less/greater than" queries. The index takes the computer directly to a particular point in the data, whether that's a "hit" or a "near miss."
Finally, the EXPLAIN verb is very useful: put that word in front of your query, and the SQL engine should "explain to you" exactly how it intends to carry out your query. (The SQL engine looks at the structure of the database to make that decision.) Although the EXPLAIN output is ... (heh) ... "not exactly standardized," it will help you to see if the computer thinks that it needs to do something very time-wasting in order to deliver your answer.

MARIADB: Index not used for a select with join on a range

I have a first table containing my ips stored as integer (500k rows), and a second one containing ranges of black listed ips and the reason of black listing (10M rows)
here is the table structure :
CREATE TABLE `black_lists` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`ip_start` INT(11) UNSIGNED NOT NULL,
`ip_end` INT(11) UNSIGNED NULL DEFAULT NULL,
`reason` VARCHAR(3) NOT NULL,
`excluded` TINYINT(1) NULL DEFAULT NULL,
PRIMARY KEY (`id`),
INDEX `ip_range` (`ip_end`, `ip_start`),
INDEX `ip_start` ( `ip_start`),
INDEX `ip_end` (`ip_end`),
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB
AUTO_INCREMENT=10747741
;
CREATE TABLE `ips` (
`id` INT(11) NOT NULL AUTO_INCREMENT COMMENT 'Id ips',
`idhost` INT(11) NOT NULL COMMENT 'Id Host',
`ip` VARCHAR(45) NULL DEFAULT NULL COMMENT 'Ip',
`ipint` INT(11) UNSIGNED NULL DEFAULT NULL COMMENT 'Int ip',
`type` VARCHAR(45) NULL DEFAULT NULL COMMENT 'Type',
PRIMARY KEY (`id`),
INDEX `host` (`idhost`),
INDEX `index3` (`ip`),
INDEX `index4` (`idhost`, `ip`),
INDEX `ipsin` (`ipint`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB
AUTO_INCREMENT=675651;
my problem is when I try to run this query no index is used and it takes an eternity to finish :
select i.ip,s1.reason
from ips i
left join black_lists s1 on i.ipint BETWEEN s1.ip_start and s1.ip_end;
I'm using MariaDB 10.0.16
True.
The optimizer has no knowledge that start..end values are non overlapping, nor anything else obvious about them. So, the best it can do is decide between
s1.ip_start <= i.ipint -- and use INDEX(ip_start), or
s1.ip_end >= i.ipint -- and use INDEX(ip_end)
Either of those could result in upwards of half the table being scanned.
In 2 steps you could achieve the desired goal for one ip; let's say #ip:
SELECT ip_start, reason
FROM black_lists
WHERE ip_start <= #ip
ORDER BY ip_start DESC
LIMIT 1
But after that, you need to see if the ip_end corresponding to that ip_start is <= #ip before deciding whether you have a black-listed item.
SELECT reason
FROM ( ... ) a -- fill in the above query
JOIN black_lists b USING(ip_start)
WHERE b.ip_end <= #ip
That will either return the reason or no rows.
In spite of the complexity, it will be very fast. But, you seem to have a set of IPs to check. That makes it more complex.
For black_lists, there seems to be no need for id. Suggest you replace the 4 indexes with only 2:
PRIMARY KEY(ip_start, ip_end),
INDEX(ip_end)
In ips, isn't ip unique? If so, get rid if id and change 5 indexes to 3:
PRIMARY KEY(idint),
INDEX(host, ip),
INDEX(ip)
You have allowed more than enough in the VARCHAR for IPv6, but not in INT UNSIGNED.
More discussion.

How to create indexes efficiently

I wish to know how I can create indexes in my database according to my data structure. most of my queries are fetching data against the ID and the name as well with two or three tables joining while pagination. please advise how to make indexes according to below queries.
Query:1
SELECT DISTINCT topic, type FROM books where type like 'Tutor-Books' order by topic
Explain:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE books range faith faith 102 NULL 132 Using index condition; Using temporary; Using filesort
Query:2
SELECT books.name, books.name2, books.id, books.image, books.faith,
books.topic, books.downloaded, books.viewed, books.language,
books.size, books.author as author_id, authors.name as author_name,
authors.aid
from books
LEFT JOIN authors ON books.author = authors.aid
WHERE books.id = '".$id."'
AND status = 1
Explain:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE books const PRIMARY PRIMARY 4 const 1 NULL
1 SIMPLE authors const aid aid 4 const 1 NULL
Can i use indexes for pagination in offset case where same query returns total:
SELECT SQL_CALC_FOUND_ROWS books.name, books.name2, books.id,
books.image, books.topic, books.author as author_id,
authors.name as author_name, authors.aid
from books
LEFT JOIN authors ON books.author = authors.aid
WHERE books.author = '$pid'
AND status = 1
ORDER BY books.name
LIMIT $limit OFFSET $offset
Do I need to update my queries after creating indexes. please also suggest what should be the table format.
SHOW CREATE TABLE books:
Table Create Table
books CREATE TABLE `books` (
`name` varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`name2` varchar(150) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`author` int(100) NOT NULL,
`translator` int(120) NOT NULL,
`publisher` int(100) NOT NULL,
`pages` int(50) NOT NULL,
`date` varchar(50) CHARACTER SET latin1 NOT NULL,
`downloaded` int(100) NOT NULL,
`alt_lnk` text NOT NULL,
`viewed` int(100) NOT NULL,
`language` varchar(100) CHARACTER SET latin1 NOT NULL,
`image` varchar(200) CHARACTER SET latin1 NOT NULL,
`faith` varchar(100) CHARACTER SET latin1 NOT NULL,
`id` int(100) NOT NULL AUTO_INCREMENT,
`sid` varchar(1200) CHARACTER SET latin1 DEFAULT NULL,
`topic` varchar(100) CHARACTER SET latin1 NOT NULL,
`last_viewed` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`size` double NOT NULL,
`status` int(2) NOT NULL DEFAULT '0',
`is_scroll` int(2) NOT NULL,
`is_downloaded` int(2) NOT NULL,
`pdf_not_found` int(2) NOT NULL,
PRIMARY KEY (`id`),
KEY `name` (`name`),
KEY `downloaded` (`downloaded`),
KEY `name2` (`name2`),
KEY `topic` (`topic`),
KEY `faith` (`faith`)
) ENGINE=InnoDB AUTO_INCREMENT=12962 DEFAULT CHARSET=utf8
where type like 'Tutor-Books' order by topic (or:)
where type = 'Tutor-Books' order by topic
--> INDEX(type, topic)
where type like '%Tutor-Books' order by topic
--> INDEX(topic) -- the leading % prevents indexing
LEFT JOIN authors ON books.author = authors.aid
--> PRIMARY KEY(aid)
Do you really need LEFT JOIN? If you can change it to JOIN, the optimizer might be able to start with authors. If it does, then
--> INDEX(author) -- in `books`
My cookbook for building indexes.
Other tips:
INT(100) and INT(2) are identical -- each is a 4-byte signed integer. Read about TINYINT UNSIGNED for numbers 0..255, etc. Use that for your flags (status, is_scroll, etc)
DATE is a datatype; using a VARCHAR is problematic if you ever want to compare or order.
Learn about composite indexes, such as my first example.
Your display widths are a little funky, but that wont cause a problem.
Query 1:
You're using the LIKE operator without a wildcard search %. You can likely swap this with an = operator.
I don't see the column type in your SHOW CREATE TABLE -- but it seems you don't have an index here, unless you renamed it to faith.
Do you need to type to be a string? could it be abstracted to a types table and then joined against using an integer? Or, if you have a fixed amount of types that's unlikely to change, could you use an enum?
Query 2:
You don't need to quote strings, also that's probably vulnerable to SQL injection. do ='.intval($id).' instead.
Make sure you have an index on authors.aid and that they're of the same type.

Never ending MySQL query during data import

I'm working on a data import routine from a set of CSV files into my main database and am stuck with this particular set of data. I've used LOAD DATA LOCAL INFILE to dump the CSV data into my table, feed_hcp_leasenote:
CREATE TABLE `feed_hcp_leasenote` (
`BLDGID` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`LEASID` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`NOTEDATE` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`REF1` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`REF2` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`LASTDATE` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`USERID` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`NOTETEXT` varchar(1000) COLLATE utf8_unicode_ci DEFAULT NULL,
`tempid` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`tempid`),
KEY `BLDGID` (`BLDGID`),
KEY `LEASID` (`LEASID`),
KEY `REF1` (`REF1`),
KEY `NOTEDATE` (`NOTEDATE`)
) ENGINE=MyISAM AUTO_INCREMENT=65002 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
I'm trying to import this data into two tables, lease_notes and customfield_data. lease_notes only stores a unique ID value, the note itself, and the lid which links it to the lease table. customfield_data stores a variety of data for system- and user-created fields, with each record linked to another table via the linkid field. Here's the lease_note table:
CREATE TABLE `lease_notes` (
`lnid` int(11) NOT NULL AUTO_INCREMENT,
`notetext` longtext COLLATE utf8_unicode_ci NOT NULL,
`lid` int(11) NOT NULL COMMENT 'Lease ID',
PRIMARY KEY (`lnid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
And the customfield_data table:
CREATE TABLE `customfield_data` (
`cfdid` int(11) NOT NULL AUTO_INCREMENT,
`data_int` int(11) DEFAULT NULL,
`data_date` datetime DEFAULT NULL,
`data_smtext` varchar(1000) COLLATE utf8_unicode_ci DEFAULT NULL,
`data_lgtext` longtext COLLATE utf8_unicode_ci,
`data_numeric` decimal(20,2) DEFAULT NULL,
`linkid` int(11) DEFAULT NULL COMMENT 'ID value of specific item',
`cfid` int(11) NOT NULL COMMENT 'Custom field ID',
PRIMARY KEY (`cfdid`),
KEY `data_smtext` (`data_smtext`(333)),
KEY `linkid` (`linkid`),
KEY `cfid` (`cfid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
The query that is getting stuck is as follows:
SELECT NOTEDATE, REF1, REF2, LASTDATE, USERID, feed_hcp_leasenote.NOTETEXT, leases.lid, lease_notes.lnid
FROM feed_hcp_leasenote
JOIN customfield_data mrileaseid ON feed_hcp_leasenote.LEASID = mrileaseid.data_smtext AND mrileaseid.cfid = ?
JOIN leases ON mrileaseid.linkid = leases.lid
JOIN suites ON leases.sid = suites.sid
JOIN floors ON suites.fid = floors.fid
JOIN customfield_data coid ON floors.bid = coid.linkid AND coid.cfid = ? AND coid.data_smtext = feed_hcp_leasenote.BLDGID
JOIN customfield_data status ON leases.lid = status.linkid AND status.cfid = ? AND status.data_smtext <> ?
LEFT JOIN lease_notes ON leases.lid = lease_notes.lid
LEFT JOIN customfield_data notedate ON lease_notes.lnid = notedate.linkid AND notedate.data_date = feed_hcp_leasenote.NOTEDATE AND notedate.cfid = ?
LEFT JOIN customfield_data ref1 ON lease_notes.lnid = ref1.linkid AND ref1.data_smtext = feed_hcp_leasenote.REF1 AND ref1.cfid = ?
My goal with this is to return all records in feed_hcp_leasenote and, depending on whether or not lease_notes.lnid is null, insert or update the records as needed (nulls would be inserts, not nulls would be updates.) The problem is that the provided data uses a combination of 4 fields to determine uniqueness: BLDGID, LEASID, NOTEDATE, and REF1. A note will not exist without a proper BLDGID and LEASID (translated in my query to a valid lid. It can match an existing record with a valid lid, NOTEDATE, and REF1, but if those don't match then I can assume it's a new record.
If I chop off all of the LEFT JOINs and the lease_notes.lnid from the SELECT, it executes properly and gives me all records. Since I couldn't get my original query to work I played with the idea of cycling all results and performing another SELECT to see if the notedate and ref1 matched. If not, I INSERTed, otherwise UPDATE. While this approach works it can only process about 20 records per second which is a problem when I'm dealing with 30,000 at a crack.
Since I got asked about it in a previous question, here's an EXPLAIN of my query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE status ref data_smtext,linkid,cfid cfid 4 const 934 Using where
1 SIMPLE mrileaseid ref data_smtext,linkid,cfid linkid 5 rl_hpsi.status.linkid 19 Using where
1 SIMPLE leases eq_ref PRIMARY,sid PRIMARY 4 rl_hpsi.mrileaseid.linkid 1 Using where
1 SIMPLE suites eq_ref PRIMARY,fid PRIMARY 4 rl_hpsi.leases.sid 1
1 SIMPLE floors eq_ref PRIMARY,bid PRIMARY 4 rl_hpsi.suites.fid 1
1 SIMPLE feed_hcp_leasenote ref BLDGID,LEASID LEASID 153 rl_hpsi.mrileaseid.data_smtext 19 Using where
1 SIMPLE coid ref data_smtext,linkid,cfid data_smtext 1002 rl_hpsi.feed_hcp_leasenote.BLDGID 10 Using where
1 SIMPLE lease_notes ALL NULL NULL NULL NULL 15000
1 SIMPLE notedate ref linkid,cfid linkid 5 rl_hpsi.lease_notes.lnid 24
1 SIMPLE ref1 ref data_smtext,linkid,cfid data_smtext 1002 rl_hpsi.feed_hcp_leasenote.REF1 10
Can anyone point me in the right direction? Thanks!
From our comments:
The answer is to add the columns that make an entry unique to your destination table and create a compound unique key on them. Then when inserting to that table use INSERT ON DUPLICATE KEY UPDATE to prevent duplicate data. When the insert is complete you can drop those columns if they are no longer necessary, to prevent storing data in multiple tables.

Optimization Mysql Query Left Join

We want to map the entries of the calibration_data to the calibration data by following query. But the duration of this query is quite too long in my opinion (>24h).
Is there any optimization possible?
We added for testing more Indexes as needed right now but it didn't had any impact on the duration.
[Edit]
The hardware shouldn't be the biggest bottleneck
128 GB RAM
1TB SSD RAID 5
32 cores
EXPLAIN result
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+------------------------------------------------+
| 1 | SIMPLE | cal | NULL | ALL | NULL | NULL | NULL | NULL | 2009 | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | m | NULL | ALL | visit | NULL | NULL | NULL | 3082466 | 100.00 | Range checked for each record (index map: 0x1) |
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+------------------------------------------------+
Query which takes too long:
Insert into knn_data (SELECT cal.X AS X,
cal.Y AS Y,
cal.BeginTime AS BeginTime,
cal.EndTime AS EndTime,
avg(m.dbm_ant) AS avg_dbm_ant,
m.ant_id AS ant_id,
avg(m.location) avg_location,
count(*) AS count,
m.visit
FROM calibration cal
LEFT join calibration_data m
ON m.visit BETWEEN cal.BeginTime AND cal.EndTime
GROUP BY cal.X,
cal.Y,
cal.BeginTime,
cal. BeaconId,
m.ant_id,
m.macHash,
m.visit;
Table knn_data:
CREATE TABLE `knn_data` (
`X` int(11) NOT NULL,
`Y` int(11) NOT NULL,
`BeginTime` datetime NOT NULL,
`EndTIme` datetime NOT NULL,
`avg_dbm_ant` float DEFAULT NULL,
`ant_id` int(11) NOT NULL,
`avg_location` float DEFAULT NULL,
`count` int(11) DEFAULT NULL,
`visit` datetime NOT NULL,
PRIMARY KEY (`ant_id`,`visit`,`X`,`Y`,`BeginTime`,`EndTIme`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Table calibration
BeaconId, X, Y, BeginTime, EndTime
41791, 1698, 3944, 2016-11-12 22:44:00, 2016-11-12 22:49:00
CREATE TABLE `calibration` (
`BeaconId` int(11) DEFAULT NULL,
`X` int(11) DEFAULT NULL,
`Y` int(11) DEFAULT NULL,
`BeginTime` datetime DEFAULT NULL,
`EndTime` datetime DEFAULT NULL,
KEY `x,y` (`X`,`Y`),
KEY `x` (`X`),
KEY `y` (`Y`),
KEY `BID` (`BeaconId`),
KEY `beginTime` (`BeginTime`),
KEY `x,y,beg,bid` (`X`,`Y`,`BeginTime`,`BeaconId`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Table calibration_data
macHash, visit, dbm_ant, ant_id, mac, isRand, posX, posY, sources, ip, dayOfMonth, location, am, ar
'f5:dc:7d:73:2d:e9', '2016-11-12 22:44:00', '-87', '381', 'f5:dc:7d:73:2d:e9', NULL, NULL, NULL, NULL, NULL, '12', '18.077636300207715', 'inradius_41791', NULL
CREATE TABLE `calibration_data` (
`macHash` varchar(100) COLLATE utf8_bin NOT NULL,
`visit` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`dbm_ant` int(3) NOT NULL,
`ant_id` int(11) NOT NULL,
`mac` char(17) COLLATE utf8_bin DEFAULT NULL,
`isRand` tinyint(4) DEFAULT NULL,
`posX` double DEFAULT NULL,
`posY` double DEFAULT NULL,
`sources` int(2) DEFAULT NULL,
`ip` int(10) unsigned DEFAULT NULL,
`dayOfMonth` int(11) DEFAULT NULL,
`location` varchar(80) COLLATE utf8_bin DEFAULT NULL,
`am` varchar(300) COLLATE utf8_bin DEFAULT NULL,
`ar` varchar(300) COLLATE utf8_bin DEFAULT NULL,
KEY `visit` (`visit`),
KEY `macHash` (`macHash`),
KEY `ant, time` (`dbm_ant`,`visit`),
KEY `beacon` (`am`),
KEY `ant_id` (`ant_id`),
KEY `ant,mH,visit` (`ant_id`,`macHash`,`visit`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Onetime task? Then it does not matter? After getting this data loaded, will you incrementally update the "summary table" each day?
Shrink datatypes -- bulky data takes longer to process. Example: a 4-byte INT DayOfMonth could be a 1-byte TINYINT UNSIGNED.
You are moving a TIMESTAMP into a DATETIME. This may or may not work as you expect.
INT UNSIGNED is OK for IPv4, but you can't fit IPv6 in it.
COUNT(*) probably does not need a 4-byte INT; see the smaller variants.
Use UNSIGNED where appropriate.
A mac-address takes 19 bytes the way you have it; it could easily be converted to/from a 6-byte BINARY(6). See REPLACE(), UNHEX(), HEX(), etc.
What is the setting of innodb_buffer_pool_size? It could be about 100G for the big RAM you have.
Do the time ranges overlap? If not, take advantage of that. Also, don't include unnecessary columns in the PRIMARY KEY, such as EndTime.
Have the GROUP BY columns in the same order as the PRIMARY KEY of knn_data; this will avoid a lot of block splits during the INSERT.
The big problem is that there is no useful index in calibration_data, so the JOIN has to do a full table scan again and again! An extimated 2K scans of 3M rows! Let me focus on that problem...
There is no good way to do WHERE x BETWEEN start AND end because MySQL does not know whether the datetime ranges overlap. There is no real cure for that in this context, so let me approach it differently...
Are start and end 'regular'? Like every hour? Of so, we can do some sort of computation instead of the BETWEEN. Let me know if this is the case; I will continue my thoughts.
That's a nasty and classical one on "range" queries: the optimiser doesnt use your indexes and end up in a full table scan. In your explain plan ou can see this on column type=ALL.
Ideally you should have type=range and something in the key column
Some ideas:
I doubt that changing you jointure from
ON m.visit BETWEEN cal.BeginTime AND cal.EndTime
to
ON m.visit >= cal.BeginTime AND m.visit <= cal.EndTime
will work, but still give it a try.
Do trigger an ANALYSE TABLE on both tables. This is will update the stats on your tables and might help the optimiser to take the right decision (ie using the indexes)
Change the query to this might also help to force the optimiser use indexes :
Insert into knn_data (SELECT cal.X AS X,
cal.Y AS Y,
cal.BeginTime AS BeginTime,
cal.EndTime AS EndTime,
avg(m.dbm_ant) AS avg_dbm_ant,
m.ant_id AS ant_id,
avg(m.location) avg_location,
count(*) AS count,
m.visit
FROM calibration cal
LEFT join calibration_data m
ON m.visit >= cal.BeginTime
WHERE m.visit <= cal.EndTime
GROUP BY cal.X,
cal.Y,
cal.BeginTime,
cal. BeaconId,
m.ant_id,
m.macHash,
m.visit;
That's all I am thinking off...