Mysql Join on range of time values (no exact relation) - mysql

I am evaluating logfiles for a research project and inserted them to a MySQL database.
Now I have a query where I need to join data from other tables without having an exact matching value.
The "logdata" table contains the data of mobile units I have to analyze, "basepositions" holds the GPS coordinates of base stations. In two of the data fields of "logdata" the sender position of the corresponding base station is logged. The problem there is: the position of the base station varies slightly over time (GPS fluctuation, just some degrees), so I have to look for the right entry by using the BETWEEN operation as seen in the query below. This is not perfect, but there are only about 100 base stations, so the cost is tolerable here.
The same problem exists in the second join. There I have to get a validity flag out of another table. The problem here is: both logs are written approximately every second, but are not synchronized. So i have to scan for the corresponding row, again by using BETWEEN and the time range of 1 second.
Because of the number of rows, this second scan lets my execution time explode.
I think the diffuse correlation is the problem here.
The two tables both have the indexes given in the overview below.
Is there a way to speed up the query? Because of the performance problems it now takes 30 hours to complete in my database setup to return around 20000 rows.
I appreciate any help.
logdata (~ 300.000.000 entries):
+-----------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| unit | tinytext | YES | MUL | NULL | |
| timestamp | bigint(20) | YES | | NULL | |
| logid | int(11) | YES | | NULL | |
| d1 | bigint(20) | YES | | NULL | |
| d2 | bigint(20) | YES | | NULL | |
| d3 | bigint(20) | YES | | NULL | |
| d4 | bigint(20) | YES | | NULL | |
| d5 | bigint(20) | YES | | NULL | |
| d6 | bigint(20) | YES | | NULL | |
| d7 | bigint(20) | YES | | NULL | |
| d8 | bigint(20) | YES | | NULL | |
| d9 | bigint(20) | YES | | NULL | |
| d10 | bigint(20) | YES | | NULL | |
+-----------+---------------------+------+-----+---------+----------------+
basepositions (~100 entries):
+----------------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------------------+--------------+------+-----+---------+-------+
| ID | int(11) | NO | PRI | NULL | |
| GPSLONGITUDE | varchar(50) | YES | | NULL | |
| LOCATION | varchar(100) | YES | | NULL | |
| GPSLATITUDE | varchar(50) | YES | | NULL | |
| GPSALTITUDE | varchar(50) | YES | | NULL | |
| ISUNDERTEST | tinyint(1) | YES | | 0 | |
+----------------------------+--------------+------+-----+---------+-------+
validity (~200.000.000 entries):
+-----------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| unit | tinytext | YES | MUL | NULL | |
| timestamp | bigint(20) | YES | | NULL | |
| logid | int(11) | YES | | NULL | |
| d1 | bigint(20) | YES | | NULL | |
+-----------+---------------------+------+-----+---------+----------------+
my query so far:
SELECT
logdata.unit,
logdata.timestamp,
logdata.d1,
logdata.d2,
cast(logdata.d3/10000000 as decimal(15, 10)),
cast(logdata.d4/10000000 as decimal(15, 10)),
logdata.d5,
logdata.d6,
logdata.d7,
logdata.d8,
cast(logdata.d9/10000000 as decimal(15, 10)),
cast(logdata.d10/10000000 as decimal(15, 10)),
BASEID,
validity.d1
FROM
logdata
JOIN
basepositions
ON
cast(GPSLATITUDE / 10000000 as decimal(15,10)) BETWEEN cast(d3 / 10000000 as decimal(15,10)) - 0.0001 AND cast(d3 / 10000000 as decimal(15,10)) + 0.0001
AND
cast(GPSLONGITUDE / 10000000 as decimal(15,10)) BETWEEN cast(d4 / 10000000 as decimal(15,10)) - 0.0001 AND cast(d4 / 10000000 as decimal(15,10)) + 0.0001
JOIN
validity
ON
validity.unit = logdata.unit
AND
validity.logid = 12345
AND
validity.timestamp BETWEEN logdata.timestamp - 500 AND logdata.timestamp + 499
WHERE
logdata.unit = "IVS${IVS}"
AND
logdata.logid = 111222
AND
BASEID = 012;
indeces:
+-------------------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------------------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| logdata | 0 | PRIMARY | 1 | id | A | 301433830 | NULL | NULL | | BTREE | | |
| logdata | 1 | unit_logid_timestamp | 1 | unit | A | 18 | 6 | NULL | YES | BTREE | | |
| logdata | 1 | unit_logid_timestamp | 2 | logid | A | 18 | NULL | NULL | YES | BTREE | | |
| logdata | 1 | unit_logid_timestamp | 3 | timestamp | A | 301433830 | NULL | NULL | YES | BTREE | | |
+-------------------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
EDIT (Comment field was to small):
I think the problem is the join that is constructed. EXPLAIN EXTENDED shows, that the query optimizer is joining all three tables together, which means 300.000.000 * 200.000.000 * 100 rows to look through.
When I rewrite the join with "validity" to a subquery mysql is just joining "logdata" and "basepositions".
I think data type changes could be a factor in later optimizing but first i think i have to get down a few runtime classes by optimizing the query plan.
I'm not experienced enough to know what i can do to further optimize this query.
The single query for a timestamp on "validity" returns in no time at all.
The single query for the Basestation position is also very fast.
I don't know how I can convince mysql to first filter and the join my query.
EDIT 2:
here are the idexes you asked for. I got them using "SHOW INDEXES FROM"
indexes for "validity":
+-------------------------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------------------------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| validity | 0 | PRIMARY | 1 | id | A | 194863653 | NULL | NULL | | BTREE | | |
| validity | 1 | unit_logid_timestamp | 1 | unit | A | 18 | 6 | NULL | YES | BTREE | | |
| validity | 1 | unit_logid_timestamp | 2 | logid | A | 18 | NULL | NULL | YES | BTREE | | |
| validity | 1 | unit_logid_timestamp | 3 | timestamp | A | 194863653 | NULL | NULL | YES | BTREE | | |
+-------------------------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
indexes for "basepositions":
+----------------------+------------+---------------------------------------+--------------+----------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------------------+------------+---------------------------------------+--------------+----------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| basepositions | 0 | PRIMARY | 1 | ID | A | 109 | NULL | NULL | | BTREE | | |
+----------------------+------------+---------------------------------------+--------------+----------------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
EXPLAIN of the query above:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE basepositions const PRIMARY PRIMARY 4 const 1 100.00
1 SIMPLE logdata ref unit_logid_timestamp unit_logid_timestamp 14 const,const 4150932 100.00 Using where
1 SIMPLE validity ref unit_logid_timestamp unit_logid_timestamp 14 const,const 3294136 100.00 Using where
EXPLAIN (after adding indexes for lat/lon):
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE basepositions const PRIMARY,lat_lon,lat,lon PRIMARY 4 const 1 100.00
1 SIMPLE logdata ref unit_logid_timestamp unit_logid_timestamp 14 const,const 4150932 100.00 Using where
1 SIMPLE validity ref unit_logid_timestamp unit_logid_timestamp 14 const,const 3294136 100.00 Using where

Related

MySQL query optimization with compound index and persistent column

The following query is being run on MariaDB 10.0.28, taking ~17 seconds, I'm looking to speed it up substantially.
select series_id,delivery_date,delivery_he,forecast_date,forecast_he,value
from forecast where forecast_he=8
AND series_id in (12142594,20735627,632287496,1146453088,1206342447,1154376340,2095084238,2445233529,2495523920,2541234725,2904312523,3564421486)
AND delivery_date >= '2016-07-13'
AND delivery_date < '2018-06-27'
and DATEDIFF(delivery_date,forecast_date)=1
The first attempt to speed it up was to create a persistent column as (datediff(delivery_date,forecast_date)), rebuild the index using the persistent column, and modify the query, replacing the datediff calc with forecast_delivery_delta=1
> describe forecast;
+-------------------------+------------------+------+-----+---------+------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------------+------------------+------+-----+---------+------------+
| series_id | int(10) unsigned | NO | PRI | 0 | |
| delivery_date | date | NO | PRI | NULL | |
| delivery_he | int(11) | NO | PRI | NULL | |
| forecast_date | date | NO | PRI | NULL | |
| forecast_he | int(11) | NO | PRI | NULL | |
| value | float | NO | | NULL | |
| forecast_delivery_delta | tinyint(4) | YES | | NULL | PERSISTENT |
+-------------------------+------------------+------+-----+---------+------------+
> show index from forecast;
+----------+------------+----------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+----------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| forecast | 0 | PRIMARY | 1 | series_id | A | 35081 | NULL | NULL | | BTREE | | |
| forecast | 0 | PRIMARY | 2 | delivery_date | A | 130472 | NULL | NULL | | BTREE | | |
| forecast | 0 | PRIMARY | 3 | delivery_he | A | 1290223 | NULL | NULL | | BTREE | | |
| forecast | 0 | PRIMARY | 4 | forecast_date | A | 2322401 | NULL | NULL | | BTREE | | |
| forecast | 0 | PRIMARY | 5 | forecast_he | A | 23224016 | NULL | NULL | | BTREE | | |
| forecast | 1 | he_series_delta_date | 1 | forecast_he | A | 29812 | NULL | NULL | | BTREE | | |
| forecast | 1 | he_series_delta_date | 2 | series_id | A | 74198 | NULL | NULL | | BTREE | | |
| forecast | 1 | he_series_delta_date | 3 | delivery_date | A | 774133 | NULL | NULL | | BTREE | | |
+----------+------------+----------------------+--------------+---------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
This seems to have taken ~2 seconds off the runtime, but I'm wondering if there are better ways to speed this up substantially. I looked into adjusting the buffer size but it seemed not to be wildly misconfigured.
>show variables like '%innodb_buffer_pool_size%';
+-------------------------+-----------+
| Variable_name | Value |
+-------------------------+-----------+
| innodb_buffer_pool_size | 134217728 |
+-------------------------+-----------+
Total table size:
+----------+------------+
| Table | Size in MB |
+----------+------------+
| forecast | 1547.00 |
+----------+------------+
EXPLAIN:
+------+-------------+----------+-------+------------------------------+----------------------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------+-------+------------------------------+----------------------+---------+------+--------+-----------------------+
| 1 | SIMPLE | forecast | range | PRIMARY,he_series_delta_date | he_series_delta_date | 11 | NULL | 832016 | Using index condition |
+------+-------------+----------+-------+------------------------------+----------------------+---------+------+--------+-----------------------+
If you are going to say
AND forecast_delivery_delta=1
then the optimal index is one starting with the two = columns:
(forecast_he, forecast_delivery_delta, -- in either order
series_id, -- an IN might work ok next
delivery_date) -- finally a range
It is generally useless to put a column (delivery_date) tested via a range anywhere other than last.
But note, that index will not work very well if you say forecast_delivery_delta <= 2. Now it is a "range", and nothing after it in the index will be used for filtering. Still it may be worth it to have a small number of different indexes, just in case you turn = into a range or vice versa.
And increase innodb_buffer_pool_size to be about 70% of RAM (assuming you have over 4GB of RAM).

What does means cardinality in MySQL when use composite index?

I'm new with mysql and am a little confused about what cardinality means, I read that it means the number or unique rows but I'd like to know what it does mean in this case, this is my table definition
+-------------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| revisado | varchar(10) | YES | MUL | NULL | |
| total | int(11) | NO | MUL | NULL | |
| busqueda | varchar(300) | NO | MUL | NULL | |
| clave | bigint(15) | NO | | NULL | |
| producto_servicio | varchar(300) | NO | | NULL | |
+-------------------+--------------+------+-----+---------+----------------+
the total of records right now is 13621
I have this query
SELECT clave, producto_servicio FROM buscador_claves2 WHERE busqueda = 'FERRETERIA' AND total = 2 AND revisado = 'APROBADO'
And this the index definition of the table
+------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| buscador_claves2 | 0 | PRIMARY | 1 | id | A | 14309 | NULL | NULL | | BTREE | |
| buscador_claves2 | 1 | idx_busqueda | 1 | busqueda | A | 14309 | 255 | NULL | | BTREE | |
| buscador_claves2 | 1 | idx_total | 1 | total | A | 3 | NULL | NULL | | BTREE | |
| buscador_claves2 | 1 | idx_revisado | 1 | revisado | A | 1 | NULL | NULL | YES | BTREE | |
| buscador_claves2 | 1 | idx_compuesto1 | 1 | revisado | A | 1 | NULL | NULL | YES | BTREE | |
| buscador_claves2 | 1 | idx_compuesto1 | 2 | total | A | 105 | NULL | NULL | | BTREE | |
| buscador_claves2 | 1 | idx_compuesto1 | 3 | busqueda | A | 14309 | 255 | NULL | | BTREE | |
| buscador_claves2 | 1 | idx_compuesto2 | 1 | busqueda | A | 14309 | 255 | NULL | | BTREE | |
| buscador_claves2 | 1 | idx_compuesto2 | 2 | total | A | 14309 | NULL | NULL | | BTREE | |
| buscador_claves2 | 1 | idx_compuesto2 | 3 | revisado | A | 14309 | NULL | NULL | YES | BTREE | |
+------------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
the query is taking idx_compuesto1 as the index to find the data, what means the cardinality in this case for the revisado, total and busqueda columns as part of the index idx_compuesto1? and why it takes idx_compuesto1 instead of idx_compuesto2, I can see the cardinality is different in both indexes
This is the output of the query explain
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: buscador_claves2
type: ref
possible_keys: idx_busqueda,idx_total,idx_revisado,idx_compuesto1,idx_compuesto2
key: idx_compuesto1
key_len: 804
ref: const,const,const
rows: 1
Extra: Using where
I hope you can help me to understand better this info, thank you.
In MySQL, the value of the index cardinality column is the storage engine estimate for the number of unique values in that index. It is used to determine how well this index can be used during joins. Generally MySQL optimizer prefers the index with a higher cardinality, because it usually means it is able to filter down to fewer rows. The ideal scenario is for the value of cardinality to be always equal to SELECT COUNT(DISTINCT the_key)..., but in practice it is usually off by some relatively small margin due to the difficulty of accurately computing this during normal database operations in an efficient manner that does not disrupt database performance. The value will be more accurate immediately after ANALYZE TABLE. Being off on cardinality begins to matter when the optimizer can choose more than one key for a particular join, it makes a huge difference in performance which one gets chosen, and the cardinality estimates for those keys are sufficiently off to cause the optimizer to choose the wrong key. Those situations are relatively rare, but do happen. In that case, the problem can be solved either with ANALYZE TABLE or - if you are always 100% sure which key is better for the join - by explicitly making the optimizer use it with FORCE KEY in the query.

Simple MySQL query with performance issues

I have the following simple MySQL query:
SELECT SQL_NO_CACHE mainID
FROM tableName
WHERE otherID3=19
AND dateStartCol >= '2012-08-01'
AND dateStartCol <= '2012-08-31';
When I run this it takes 0.29 seconds to bring back 36074 results. When I increase my date period to bring back more results (65703) it runs in 0.56. When I run other similar SQL queries on the same server but on different tables (some tables are larger) the results come back in approximately 0.01 seconds.
Although 0.29 isn't slow - this is a basic part for a complex query and this timing means that it is not scalable.
See below for the table definition and indexes.
I know it's not server load as I have the same issue on a development server which has very little usage.
+---------------------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------------------+--------------+------+-----+---------+----------------+
| mainID | int(11) | NO | PRI | NULL | auto_increment |
| otherID1 | int(11) | NO | MUL | NULL | |
| otherID2 | int(11) | NO | MUL | NULL | |
| otherID3 | int(11) | NO | MUL | NULL | |
| keyword | varchar(200) | NO | MUL | NULL | |
| dateStartCol | date | NO | MUL | NULL | |
| timeStartCol | time | NO | MUL | NULL | |
| dateEndCol | date | NO | MUL | NULL | |
| timeEndCol | time | NO | MUL | NULL | |
| statusCode | int(1) | NO | MUL | NULL | |
| uRL | text | NO | | NULL | |
| hostname | varchar(200) | YES | MUL | NULL | |
| IPAddress | varchar(25) | YES | | NULL | |
| cookieVal | varchar(100) | NO | | NULL | |
| keywordVal | varchar(60) | NO | | NULL | |
| dateTimeCol | datetime | NO | MUL | NULL | |
+---------------------------+--------------+------+-----+---------+----------------+
+--------------------+------------+-------------------------------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------------+------------+-------------------------------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+
| tableName | 0 | PRIMARY | 1 | mainID | A | 661990 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_otherID1 | 1 | otherID1 | A | 330995 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_otherID2 | 1 | otherID2 | A | 25 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_otherID3 | 1 | otherID3 | A | 48 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_dateStartCol | 1 | dateStartCol | A | 187 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_timeStartCol | 1 | timeStartCol | A | 73554 | NULL | NULL | | BTREE | |
|tableName | 1 | idx_dateEndCol | 1 | dateEndCol | A | 188 | NULL | NULL | | BTREE | |
|tableName | 1 | idx_timeEndCol | 1 | timeEndCol | A | 73554 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_keyword | 1 | keyword | A | 82748 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_hostname | 1 | hostname | A | 2955 | NULL | NULL | YES | BTREE | |
| tableName | 1 | idx_dateTimeCol | 1 | dateTimeCol | A | 220663 | NULL | NULL | | BTREE | |
| tableName | 1 | idx_statusCode | 1 | statusCode | A | 2 | NULL | NULL | | BTREE | |
+--------------------+------------+-------------------------------+--------------+---------------------------+-----------+-------------+----------+--------+------+------------+---------+
Explain Output:
+----+-------------+-----------+-------+----------------------------------+-------------------+---------+------+-------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+-------+----------------------------------+-------------------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | tableName | range | idx_otherID3,idx_dateStartCol | idx_dateStartCol | 3 | NULL | 66875 | 75.00 | Using where |
+----+-------------+-----------+-------+----------------------------------+-------------------+---------+------+-------+----------+-------------+
If that is really your query (and not a simplified version of same), then this ought to achieve best results:
CREATE INDEX table_ndx on tableName( otherID3, dateStartCol, mainID);
The first index entry means that the first match in the WHERE is very fast; the same also applies with dateStartCol. The third field is very small and does not slow the index appreciably, but allows for the datum you require to be found immediately in the index with no table access at all.
It is important that the keys are in the same index. In the EXPLAIN you posted, each key is in an index of its own, so even if MySQL chooses the best index, the performances will not be optimal. I'd try and use less indexes, for they also have a cost (shameless plug: Can Indices actually decrease SELECT performance? ).
First try to add the right key. It seems like dateStartCol is more selective than otherID3
ALTER TABLE tableName ADD KEY idx_dates(dateStartCol, dateStartCol)
Second - please make sure you select only rows you need by adding LIMIT clause to the SELECT. This will should up the query. Try like this:
SELECT SQL_NO_CACHE mainID
FROM tableName
WHERE otherID3=19
AND dateStartCol >= '2012-08-01'
AND dateStartCol <= '2012-08-31'
LIMIT 10;
Please also make sure that your MySQL tuned up properly. You may want to check key_buffer_size and innodb_buffer_pool_size as described in http://astellar.com/2011/12/why-is-stock-mysql-slow/
If this is a recurrent or important query then create a multiple column index:
CREATE INDEX index_name ON tableName (otherID3, dateStartCol)
Delete the non used indexes as they make table changes more expensive.
BTW you don't need two separate columns for date and time. You can combine then in a datetime or timestamp type. One less column and one less index.
The explain output shows it chose the dateStartCol index so you could try the opposite I suggested above:
CREATE INDEX index_name ON tableName (dateStartCol, otherID3)
Notice that the query's dateStartCol condition will still get 75% of the rows so not much improvement, if any, in using that single index.
How unique is otherID3? If there are not many repeated otherID3 you can hint the engine to use it.

MySQL optimization on join tables with range criteria

I am going to join two tables by using a single position in one table to the range (represented by two columns) in another table.
However, the performance is too slow, which is about 20 mins.
I have tried adding the index on the table or changing the query.
But the performance is still poor.
So, I am asking for optimization of the joining speed.
The following is the query to MySQL.
mysql> SELECT `inVar`.chrom, `inVar`.pos, `openChrom_K562`.score
-> FROM `inVar`
-> LEFT JOIN `openChrom_K562`
-> ON (
-> `inVar`.chrom=`openChrom_K562`.chrom AND
-> `inVar`.pos BETWEEN `openChrom_K562`.chromStart AND `openChrom_K562`.chromEnd
-> );
inVar and openChrom_K562 are the tables I used.
inVar stores the single position in each row.
openChrom_K562 stores the range information indicated by chromStart and chromEnd.
inVar contains 57902 rows and openChrom_K562 has 137373 rows respectively.
Fields on the tables.
mysql> DESCRIBE inVar;
+-------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| chrom | varchar(31) | NO | PRI | NULL | |
| pos | int(10) | NO | PRI | NULL | |
+-------+-------------+------+-----+---------+-------+
mysql> DESCRIBE openChrom_K562;
+------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+-------------+------+-----+---------+-------+
| chrom | varchar(31) | NO | MUL | NULL | |
| chromStart | int(10) | NO | MUL | NULL | |
| chromEnd | int(10) | NO | | NULL | |
| score | int(10) | NO | | NULL | |
+------------+-------------+------+-----+---------+-------+
Index built in the tables
mysql> SHOW INDEX FROM inVar;
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| inVar | 0 | PRIMARY | 1 | chrom | A | NULL | NULL | NULL | | BTREE | |
| inVar | 0 | PRIMARY | 2 | pos | A | 57902 | NULL | NULL | | BTREE | |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
mysql> SHOW INDEX FROM openChrom_K562;
+----------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| openChrom_K562 | 1 | start_end | 1 | chromStart | A | 137373 | NULL | NULL | | BTREE | |
| openChrom_K562 | 1 | start_end | 2 | chromEnd | A | 137373 | NULL | NULL | | BTREE | |
| openChrom_K562 | 1 | chrom_only | 1 | chrom | A | 22 | NULL | NULL | | BTREE | |
| openChrom_K562 | 1 | chrom_start | 1 | chrom | A | 22 | NULL | NULL | | BTREE | |
| openChrom_K562 | 1 | chrom_start | 2 | chromStart | A | 137373 | NULL | NULL | | BTREE | |
| openChrom_K562 | 1 | chrom_end | 1 | chrom | A | 22 | NULL | NULL | | BTREE | |
| openChrom_K562 | 1 | chrom_end | 2 | chromEnd | A | 137373 | NULL | NULL | | BTREE | |
+----------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
Execution plan on MySQL
mysql> EXPLAIN SELECT `inVar`.chrom, `inVar`.pos, score FROM `inVar` LEFT JOIN `openChrom_K562` ON ( inVar.chrom=openChrom_K562.chrom AND `inVar`.pos BETWEEN chromStart AND chromEnd );
+----+-------------+----------------+-------+--------------------------------------------+------------+---------+-----------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------+-------+--------------------------------------------+------------+---------+-----------------+-------+-------------+
| 1 | SIMPLE | inVar | index | NULL | PRIMARY | 37 | NULL | 57902 | Using index |
| 1 | SIMPLE | openChrom_K562 | ref | start_end,chrom_only,chrom_start,chrom_end | chrom_only | 33 | tmp.inVar.chrom | 5973 | |
+----+-------------+----------------+-------+--------------------------------------------+------------+---------+-----------------+-------+-------------+
It seems it only optimizes by looking chrom in two tables. Then do the brute-force comparing in the tables.
Is there any ways to do the further optimization like indexing on the position?
(It is my first time posting the question, sorry for the poor posting quality.)
chrom_only is likely to be a bad index selection for your join as you only have chrom 22 values.
If I have interpreted this right the query should be faster if using start_end
SELECT `inVar`.chrom, `inVar`.pos, `openChrom_K562`.score
FROM `inVar`
LEFT JOIN `openChrom_K562`
USE INDEX (`start_end`)
ON (
`inVar`.chrom=`openChrom_K562`.chrom AND
`inVar`.pos BETWEEN `openChrom_K562`.chromStart AND `openChrom_K562`.chromEnd
)

mysql primary key returning less results than compound composite index

I have inherited a database schema which has some design issues
Note that there are another 9 keys on the table which I haven't listed below, the keys in question look like
+-------+------------+----------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+----------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| users | 0 | PRIMARY | 1 | userid | A | 604 | NULL | NULL | | BTREE | | |
| users | 1 | userid_2 | 1 | userid | A | 604 | NULL | NULL | | BTREE | | |
| users | 1 | userid_2 | 2 | age | A | 604 | NULL | NULL | YES | BTREE | | |
| users | 1 | userid_2 | 3 | image | A | 604 | 255 | NULL | YES | BTREE | | |
| users | 1 | userid_2 | 4 | gender | A | 604 | NULL | NULL | YES | BTREE | | |
| users | 1 | userid_2 | 5 | last_login | A | 604 | NULL | NULL | YES | BTREE | | |
| users | 1 | userid_2 | 6 | latitude | A | 604 | NULL | NULL | YES | BTREE | | |
| users | 1 | userid_2 | 7 | longitude | A | 604 | NULL | NULL | YES | BTREE | | |
+-------+------------+----------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
In a table with the following fields.
+--------------------------------+---------------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------------+---------------------+------+-----+-------------------+----------------+
| userid | int(11) | NO | PRI | NULL | auto_increment |
| age | int(11) | YES | | NULL | |
| image | varchar(500) | YES | | | |
| gender | varchar(10) | YES | | NULL | |
| last_login | timestamp | YES | MUL | NULL | |
| latitude | varchar(20) | YES | MUL | NULL | |
| longitude | varchar(20) | YES | | NULL | |
+--------------------------------+---------------------+------+-----+-------------------+----------------+
Running an explain statement and forcing it to use userid_2, it uses 522 rows
describe SELECT userid, age FROM users USE INDEX(userid_2) WHERE `userid` >=100 and age >27 limit 10 ;
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
| 1 | SIMPLE | users | index | userid_2 | userid_2 | 941 | NULL | 522 | Using where; Using index |
+----+-------------+-------+-------+---------------+----------+---------+------+------+--------------------------+
1 row in set (0.02 sec)
if I don't force it to use the index it is just using the primary key, which only consists of the userid and only uses 261 rows
mysql> describe SELECT userid, age FROM users WHERE userid >=100 and age >27 limit 10 ;
+----+-------------+-------+-------+--------------------------------------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+--------------------------------------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | users | range | PRIMARY,users_user_ids_key,userid,userid_2 | PRIMARY | 4 | NULL | 261 | Using where |
+----+-------------+-------+-------+--------------------------------------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)
Questions
Why is it examining more rows when it uses the compound composite index?
Why isn't the query using the userid_2 index if its not specified in the query?
That row count is only an estimate based on indexed value distribution.
You have two options:
Execute ANALYZE TABLE mytable to recalculate distributions and then re-try the describe
Don't worry about stuff that doesn't matter... rows is just an estimate anyway