Related
I've looked around a bit and found quite a few people seeking to order a table of points by distance to a set point, but I'm curious how one would go about efficiently joining two tables on the minimum distance between two points. In my case, consider the table nodes and centroids.
CREATE TABLE nodes (
node_id VARCHAR(255),
pt POINT
);
CREATE TABLE centroids (
centroid_id MEDIUMINT UNSIGNED,
temperature FLOAT,
pt POINT
);
I have approximately 300k nodes and 15k centroids, and I want to get the closest centroid to each node so I can assign each node a temperature. So far I have created spatial indexes on pt on both tables and tried running the following query:
SELECT
nodes.node_id,
MIN(ST_DISTANCE(nodes.pt, centroids.pt))
FROM nodes
INNER JOIN centroids
ON ST_DISTANCE(nodes.pt, centroids.pt) <= 4810
GROUP BY
nodes.node_id
LIMIT 10;
Clearly, this query is not going to solve my problem; it does not retrieve temperature, assumes that the closest centroid is within 4810, and only evaluates 10 nodes. However, even with these simplifications, this query is very poorly optimized, and is still running as I type this. When I have MySQL give details about the query, it says no indexes are being used and none of the spatial indexes are listed as possible keys.
How could I build a query that can actually return the data I want joined efficiently utilizing spatial indexes?
I think a good approach would be partitioning (numerically not db partitioning) the data into cells. I don't know how well spatial indexes applies here, but the high-level logic is to say bin each node and centroid point into square regions and find matches between all the node-centroid in the same square, then make sure that there isn't a closer match in an 8-adjacent square (e.g. using the same nodes in original square). The closest matches can then be used to compute and save the temperature. All subsequent queries should ignore nodes with the temperature set.
There will still be nodes with centroids that aren't within the same or 8-adjacent squares, you would then expand the search, perhaps use squares with double the width and height. I can see this working with plain indexes on just the x and y coordinate of the points. I don't know how spatial indexes can further improve this.
There are many ways to solve this least-n-per-group problem.
One method uses a self-left-join antipattern (this allows ties):
select
n.node_id,
c.centroid_id,
st_distance(n.pt, c.pt) dist,
c.temperature
from nodes n
cross join centroids c
left join centroids c1
on c1.centroid_id <> c.centroid_id
and st_distance(n.pt, c1.pt) < st_distance(n.pt, c.pt)
where c1.centroid_id is null
The same logic can be expressed with a not exists condition.
Another option is to use a correlated subquery for filtering (this does not allow ties):
select
n.node_id,
n.node_id,
c.centroid_id,
st_distance(n.pt, c.pt) dist,
c.temperature
from nodes n
inner join centroids c
on c.centroid_id = (
select c1.centroid_id
from centroids c1
order by st_distance(n.pt, c1.pt)
limit 1
)
Finally: if all you want is the temperature of the closest centroid, then a simple subquery should be a good choice:
select
n.node_id,
(
select c1.temperature
from centroids c1
order by st_distance(n.pt, c1.pt)
limit 1
) temperature
from nodes n
I am using a MySQL DB, and have the following table:
CREATE TABLE SomeTable (
PrimaryKeyCol BIGINT(20) NOT NULL,
A BIGINT(20) NOT NULL,
FirstX INT(11) NOT NULL,
LastX INT(11) NOT NULL,
P INT(11) NOT NULL,
Y INT(11) NOT NULL,
Z INT(11) NOT NULL,
B BIGINT(20) DEFAULT NULL,
PRIMARY KEY (PrimaryKeyCol),
UNIQUE KEY FirstLastXPriority_Index (FirstX,LastX,P)
) ENGINE=InnoDB;
The table contains 4.3 million rows, and never changes once initialized.
The important columns of this table are FirstX, LastX, Y, Z and P.
As you can see, I have a unique index on the rows FirstX, LastX and P.
The columns FirstX and LastX define a range of integers.
The query I need to run on this table fetches for a given X all the rows having FirstX <= X <= LastX (i.e. all the rows whose range contains the input number X).
For example, if the table contains the rows (I'm including only the relevant columns):
FirstX
LastX
P
Y
Z
100000
500000
1
111
222
150000
220000
2
333
444
180000
190000
3
555
666
550000
660000
4
777
888
700000
900000
5
999
111
750000
850000
6
222
333
and I need, for example, the rows that contain the value 185000, the first 3 rows should be returned.
The query I tried, which should be using the index, is:
SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND LastX >= ? LIMIT 10;
Even without the LIMIT, this query should return a small number of records (less than 50) for any given X.
This query was executed by a Java application for 120000 values of X. To my surprise, it took over 10 hours (!) and the average time per query was 0.3 seconds.
This is not acceptable, not even near acceptable. It should be much faster.
I examined a single query that took 0.563 seconds to make sure the index was being used. The query I tried (the same as the query above with a specific integer value instead of ?) returned 2 rows.
I used EXPLAIN to find out what was happening:
id 1
select_type SIMPLE
table SomeTable
type range
possible_keys FirstLastXPriority_Index
key FirstLastXPriority_Index
key_len 4
ref NULL
rows 2104820
Extra Using index condition
As you can see, the execution involved 2104820 rows (nearly 50% of the rows of the table), even though only 2 rows satisfy the conditions, so half of the index is examined in order to return just 2 rows.
Is there something wrong with the query or the index? Can you suggest an improvement to the query or the index?
EDIT:
Some answers suggested that I run the query in batches for multiple values of X. I can't do that, since I run this query in real time, as inputs arrive to my application. Each time an input X arrives, I must execute the query for X and perform some processing on the output of the query.
I found a solution that relies on properties of the data in the table. I would rather have a more general solution that doesn't depend on the current data, but for the time being that's the best I have.
The problem with the original query:
SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND LastX >= ? LIMIT 10;
is that the execution may require scanning a large percentage of the entries in the FirstX,LastX,P index when the first condition FirstX <= ? is satisfied by a large percentage of the rows.
What I did to reduce the execution time is observe that LastX-FirstX is relatively small.
I ran the query:
SELECT MAX(LastX-FirstX) FROM SomeTable;
and got 4200000.
This means that FirstX >= LastX – 4200000 for all the rows in the table.
So in order to satisfy LastX >= ?, we must also satisfy FirstX >= ? – 4200000.
So we can add a condition to the query as follows:
SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND FirstX >= ? - 4200000 AND LastX >= ? LIMIT 10;
In the example I tested in the question, the number of index entries processed was reduced from 2104820 to 18 and the running time was reduced from 0.563 seconds to 0.0003 seconds.
I tested the new query with the same 120000 values of X. The output was identical to the old query. The time went down from over 10 hours to 5.5 minutes, which is over 100 times faster.
WHERE col1 < ... AND ... < col2 is virtually impossible to optimize.
Any useful query will involve a "range" on either col1 or col2. Two ranges (on two different columns) cannot be used in a single INDEX.
Therefore, any index you try has the risk of checking a lot of the table:
INDEX(col1, ...) will scan from the start to where col1 hits .... Similarly for col2 and scanning until the end.
To add to your woes, the ranges are overlapping. So, you can't pull a fast one and add ORDER BY ... LIMIT 1 to stop quickly. And if you say LIMIT 10, but there are only 9, it won't stop until the start/end of the table.
One simple thing you can do (but it won't speed things up by much) is to swap the PRIMARY KEY and the UNIQUE. This could help because InnoDB "clusters" the PK with the data.
If the ranges did not overlap, I would point you at http://mysql.rjweb.org/doc.php/ipranges .
So, what can be done?? How "even" and "small" are the ranges? If they are reasonably 'nice', then the following would take some code, but should be a lot faster. (In your example, 100000 500000 is pretty ugly, as you will see in a minute.)
Define buckets to be, say, floor(number/100). Then build a table that correlates buckets and ranges. Samples:
FirstX LastX Bucket
123411 123488 1234
222222 222444 2222
222222 222444 2223
222222 222444 2224
222411 222477 2224
Notice how some ranges 'belong' to multiple buckets.
Then, the search is first on the bucket(s) in the query, then on the details. Looking for X=222433 would find two rows with bucket=2224, then decide that both are OK. But for X=222466, two rows have the bucket, but only one matches with firstX and lastX.
WHERE bucket = FLOOR(X/100)
AND firstX <= X
AND X <= lastX
with
INDEX(bucket, firstX)
But... with 100000 500000, there would be 4001 rows because this range is in that many 'buckets'.
Plan B (to tackle the wide ranges)
Segregate the ranges into wide and narrow. Do the wide ranges by a simple table scan, do the narrow ranges via my bucket method. UNION ALL the results together. Hopefully the "wide" table would much smaller than the "narrow" table.
You need to add another index on LastX.
The unique index FirstLastXPriority_Index (FirstX,LastX,P) represents the concatenation of these values, so it will be useless with the 'AND LastX >= ?' part of your WHERE clause.
It seems that the only way to make the query fast is to reduce the number of fetched and compared fields. Here is the idea.
We can declare a new indexed field (for instance UNSIGNED BIGINT) and store both values FistX and LastX in it using an offset for one of the fields.
For example:
FirstX LastX CombinedX
100000 500000 100000500000
150000 220000 150000220000
180000 190000 180000190000
550000 660000 550000660000
70000 90000 070000090000
75 85 000075000085
an alternative is to declare the field as DECIMAL and store FirstX + LastX / MAX(LastX) in it.
Later look for the values satisfying the conditions comparing the values with a single field CombinedX.
APPENDED
And then you can fetch the rows checking only one field:
by something like where param1=160000
SELECT * FROM new_table
WHERE
(CombinedX <= 160000*1000000) AND
(CombinedX % 1000000 >= 160000);
Here I assume that for all FistX < LastX. Of course, you can calculate the param1*offset in advance and store it in a variable against which the further comparisons will be done. Of course, you can consider not decimal offsets but bitwise shifts instead. Decimal offsets were chosen as they are easier to read by a human to show in the sample.
Eran, I believe the solution you found youself is the best in terms of minimum costs. It is normal to take into account distribution properties of the data in the DB during optimization process. Moreover, in large systems, it is usually impossible to achieve satisfactory performance, if the nature of the data is not taken into account.
However, this solution also has drawbacks. And the need to change the configuration parameter with every data change is the least. More important may be the following. Let's suppose that one day a very large range appears in the table. For example, let its length cover half of all possible values. I do not know the nature of your data, so I can not definitely know if such a range can ever appear or not, so this is just an assumption. From the point of view to the result, it's okay. It just means that about every second query will now return one more record. But even just one such interval will completely kill your optimization, because the condition FirstX <=? AND FirstX> =? - [MAX (LastX-FirstX)] will no longer effectively cut off enough records.
Therefore, if you do not have assurance if too long ranges will ever come, I would suggest you to keep the same idea, but take it from other side.
I propose, when loading new data to the table, break all long ranges into smaller with a length not exceeding a certain value. You wrote that The important columns of this table are FirstX, LastX, Y, Z and P. So you can once choose some number N, and every time loading data to the table, if found the range with LastX-FirstX > N, to replace it with several rows:
FirstX; FirstX + N
FirstX + N; FirstX + 2N
...
FirstX + kN; LastX
and for the each row, keep the same values of Y, Z and P.
For the data prepared that way, your query will always be the same:
SELECT P, Y, Z FROM SomeTable WHERE FirstX <=? AND FirstX> =? - N AND LastX> =?
and will always be equally effective.
Now, how to choose the best value for N? I would take some experiments with different values and see what would be better. And it is possible for the optimum to be less than the current maximum length of the interval 4200000. At first it could surprise one, because the lessening of N is surely followed by growth of the table so it can become much larger than 4.3 million. But in fact, the huge size of the table is not a problem, when your query uses the index well enough. And in this case with lessening of N, the index will be used more and more efficiently.
Indexes will not help you in this scenario, except for a small percentage of all possible values of X.
Lets say for example that:
FirstX contains values from 1 to 1000 evenly distributed
LastX contains values from 1 to 1042 evenly distributed
And you have following indexes:
FirstX, LastX, <covering columns>
LastX, FirstX, <covering columns>
Now:
If X is 50 the clause FirstX <= 50 matches approximately 5% rows while LastX >= 50 matches approximately 95% rows. MySQL will use the first index.
If X is 990 the clause FirstX <= 990 matches approximately 99% rows while LastX >= 990 matches approximately 5% rows. MySQL will use the second index.
Any X between these two will cause MySQL to not use either index (I don't know the exact threshold but 5% worked in my tests). Even if MySQL uses the index, there are just too many matches and the index will most likely be used for covering instead of seeking.
Your solution is the best. What you are doing is defining upper and lower bound of "range" search:
WHERE FirstX <= 500 -- 500 is the middle (worst case) value
AND FirstX >= 500 - 42 -- range matches approximately 4.3% rows
AND ...
In theory, this should work even if you search FirstX for values in the middle. Having said that, you got lucky with 4200000 value; possibly because the maximum difference between first and last is a smaller percentage.
If it helps, you can do the following after loading the data:
ALTER TABLE testdata ADD COLUMN delta INT NOT NULL;
UPDATE testdata SET delta = LastX - FirstX;
ALTER TABLE testdata ADD INDEX delta (delta);
This makes selecting MAX(LastX - FirstX) easier.
I tested MySQL SPATIAL INDEXES which could be used in this scenario. Unfortunately I found that spatial indexes were slower and have many constraints.
Edit: Idea #2
Do you have control over the Java app? Because, honestly, 0.3 seconds for an index scan is not bad. Your problem is that you're trying to get a query, run 120,000 times, to have a reasonable end time.
If you do have control over the Java app, you could either have it submit all the X values at once - and let SQL not have to do an index scan 120k times. Or you could even just program the logic on the Java side, since it would be relatively easy to optimize.
Original Idea:
Have you tried creating a Multiple-Column index?
The problem with having multiple indexes is that each index is only going to narrow it down to ~50% of the records - it has to then match those ~2 million rows of Index A against ~2 million rows of Index B.
Instead, if you get both columns in the same index, the SQL engine can first do a Seek operation to get to the start of the records, and then do a single Index Scan to get the list of records it needs. No matching one index against another.
I'd suggest not making this the Clustered Index, though. The reason for that? You're not expecting many results, so matching the Index Scan's results against the table isn't going to be time consuming. Instead, you want to make the Index as small as possible, so that the Index Scan goes as fast as possible. Clustered Indexes are the table - so a Clustered Index is going to have the same Scan speed as the table itself. Along the same lines, you probably don't want any other fields other than FirstX and LastX in your index - make that Index as tiny as you can, so that the scan flies along.
Finally, like you're doing now, you're going to need to clue the engine in that you're not expecting a large set of data back from the search - you want to make sure it's using that compact Index for its scan (instead of it saying, "Eh, I'd be better off just doing a full table scan.)
One way might be to partition the table by different ranges then only querying stuff that fit into a range hence making the amount it needs to check much smaller. This might not work since the java may be slower. But it might put less stress on the database.
There might be a way also to not Query the database so many times and have a more inclusive SQL(you might be able to send a list of values and have the sql send it to a different table).
Suppose you got the execution time down to 0.1 seconds. Would the resulting 3 hours, twenty minutes be acceptable?
The simple fact is that thousands of calls to the same query is incredibly inefficient. Quite aside from what the database has to endure, there is network traffic to think of, disk seek times and all kinds of processing overhead.
Supposing that you don't already have the 120,000 values for x in a table, that's where I would start. I would insert them into a table in batches of 500 or so at a time:
insert into xvalues (x)
select 14 union all
select 18 union all
select 42 /* and so on */
Then, change your query to join to xvalues.
I reckon that optimisation alone will get your run-time down to minutes or seconds instead of hours (based on many such optimisations I have done through the years).
It also opens up the door for further optimisations. If the x values are likely to have at least some duplicates (say, at least 20% of values occur more than once) it may be worth investigating a solution where you only run the query for unique values and do the insert into SomeTable for every x with the matching value.
As a rule: anything you can do in bulk is likely to exponentially outperform anything you do row by row.
PS:
You referred to a query, but a stored procedure can also work with an input table. In some RDBMSs you can pass a table as parameter. I don't think that works in MySQL, but you can create a temporary table that the calling code fills in and the stored procedure joins to. Or a permanent table used in the same way. The major drawback of not using a temp table, is that you may need to concern yourself with session management or discarding stale data. Only you will know if that is applicable to your case.
So, I dont have enough data to be sure of the run time. This will only work if column P is unique? In order to get two indexes working, I created two indexes and the following query...
Index A - FirstX, P, Y, Z
Index B - P, LastX
This is the query
select A.P, A.Y, A.Z
from
(select P, Y, Z from asdf A where A.firstx <= 185000 ) A
join
(select P from asdf A where A.LastX >= 185000 ) B
ON A.P = B.P
For some reason this seemed faster than
select A.P, A.Y, A.Z
from asdf A join asdf B on A.P = B.P
where A.firstx <= 185000 and B.LastX >= 185000
To optimize this query:
SELECT P, Y, Z FROM SomeTable WHERE FirstX <= ? AND LastX >= ? LIMIT 10;
Here's 2 resources you can use:
descending indexes
spatial indexes
Descending indexes:
One option is to use an index that is descending on FirstX and ascending on LastX.
https://dev.mysql.com/doc/refman/8.0/en/descending-indexes.html
something like:
CREATE INDEX SomeIndex on SomeTable (FirstX DESC, LastX);
Conversely, you could create instead the index (LastX, FirstX DESC).
Spatial indexes:
Another option is to use a SPATIAL INDEX with (FirstX, LastX). If you think of FirstX and LastX as 2D spatial coordinates, then your search what it does is select the points in a contiguous geographic area delimited by the lines FirstX<=LastX, FirstX>=0, LastX>=X.
Here's a link on spatial indexes (not specific to MySQL, but with drawings):
https://learn.microsoft.com/en-us/sql/relational-databases/spatial/spatial-indexes-overview
Another approach is to precalculate the solutions, if that number isn't too big.
CREATE TABLE SomeTableLookUp (
X INT NOT NULL
PrimaryKeyCol BIGINT NOT NULL,
PRIMARY KEY(X, PrimaryKeyCol)
);
And now you just pre-populate your constant table.
INSERT INTO SomeTableLookUp
SELECT X, PrimaryKeyCol
FROM SomeTable
JOIN (
SELECT DISTINCT X FROM SomeTable
) XS
WHERE XS.X BETWEEN StartX AND EndX
And now you can SELECT your answers directly.
SELECT SomeTable.*
FROM SomeTableLookup
JOIN SomeTable
ON SomeTableLookup.PrimaryKeyCol = SomeTable.PrimaryKeyCol
WHERE SomeTableLookup = ?
LIMIT 10
SO,
The problem
I have a very simple - at first glance - problem. Assuming that I have data set with two meaningful columns: from and till. This data set isn't yet in DB. I need to search through this data set and for some X find rows where condition from < X < till is true. For example, I have rows (id added just for identifying rows, it doesn't mean that rows are in DB):
id from till
------------
1 100 200
2 120 200
3 1000 1050
4 1100 1500
and I want to find rows for X = 125. That will be rows # 1 and 2. I.e. intervals may intersect, but they are always correct (from is always lesser than till). Also, strict condition is that all three: from, till and X are unsigned integers. Besides, with high probability, intervals will not be nested too heavily - so, if intersection would be, it will not be a case, when, for example, some interval is nested to all others (practically that means that certain interval is a reliable condition which will not mean full table)
Moving to the deal. My data set could be huge (around ~500.000.000 rows) - and I need to store it somehow in DB. There is no restrictions for DB structure - it can be anything, I'm free to chose proper solution (that it why my data set is not in DB yet). So, problem is - how to store that in DB to make querying rows for given X as fast as possible?
My approach
At first glance - it's very simple. We just create columns for from and till, filling them with our data set and here we are. Really? Not. Why? Because such table structure will not allow to build any good index for using it in query. If we'll create index on two columns (from, till) it will have no sense in terms of our problem - and if we'll create two separate indexes on two columns from and till - they will both have low selectivity. Why? Imagine that we have row with from = 100.000.000 and till = 100.000.200. Then querying WHERE 100.000.000 < X AND X < 100.000.200 will not use index - because that condition with split indexes will produce near full scan for each index. And there's where tricky part is - obviously, that condition specifies very narrow part of table (i.e. logically, it is good) - but if we're speaking about separate conditions - it's crap, because each of them is near full scan.
My next though was to create some function which will take two arguments and create then bijective transition to some line set of numbers. Since my from and till are integers - and, what's important - positive integers, and also from < till always, sample of such function will be from^2 + till^2. So, ok, we'll translate our intervals to some numbers. But, unfortunately, to operate on this numbers and X we'll have to rely on original from and till - i.e. it seems that's not a case for such idea. But may be I'm missing something?
The question
Currently, I have no completed clear idea - how to implement this. So - again, I'm free to chose any architecture, but it should fit requirement of fast querying for needed rows by X. And the question is - what table structure (columns, indexes e t.c.) could be proposed here? We are also free to store additional tables (however, it will be good if their sizes will not be too high). Of course, since we're free to define table structure, we can change querying for X too (i.e. if some structure will need to add some condition to that query - it is ok, the only need is to achieve final goal).
You want to reduce the impact of the query over all rows running the comparison function to find out if that row matches the span of numbers X lies in or not.
As you have outlined, the effectiveness of some common index is not of much use because of the sheer amount of numbers / row ratio.
This is where I would start. Why not reduce the resolution and use that as an index?
Also how large do the spans get? You have so far 100, 80, 50, 400.
Assuming that the size of a span is not up to the superset of all values but instead normally a little fraction of it (e.g. max 1 000 by a superset of 500 000 000 values), why not index from but at lower resultion, e.g. divided by 1 000.
That will greatly reduce the index-space to 500 000 entries on such a low resolution helper-column. You then can use std. math in the WHERE part of the query to use that index, too to find a superset of possible matching rows. The more expensive comparisons (the exact BETWEEN) can then be deffered on only these possible matching rows.
This perhaps is not such an academic solution to the problem but might give you the performance you're looking for.
Edit: As #NikiC kindly pointed out and for the academic solution, there is a paper by Hans-Peter Kriegel, Marco Pötke and Thomas Seidl:
The Relational Interval Tree: Manage Interval Data Efficiently
in Your Relational Database (PDF)
One option here is to partition your table. Specifically using range partitioning. This coupled with indexes on your from and till columns should give you an acceptable level of performance.
Here is a basic example:
CREATE TABLE myTable (
`id` INT NOT NULL,
`from` bigint unsigned not null,
`till` bigint unsigned not null,
PRIMARY KEY (`from`,`till`),
INDEX myTableIdx1 (`from`),
INDEX myTableIdx2 (`till`)
)
PARTITION BY RANGE (`from`) (
PARTITION p0 VALUES LESS THAN (200000),
PARTITION p1 VALUES LESS THAN (400000),
PARTITION p2 VALUES LESS THAN (600000),
PARTITION p3 VALUES LESS THAN (800000),
PARTITION p4 VALUES LESS THAN (1000000),
PARTITION p5 VALUES LESS THAN (1200000),
PARTITION p6 VALUES LESS THAN (1400000),
PARTITION p7 VALUES LESS THAN (1600000),
PARTITION p8 VALUES LESS THAN (1800000),
PARTITION p9 VALUES LESS THAN (2000000),
-- etc etc
PARTITION pEnd VALUES LESS THAN MAXVALUE
);
This approach does make the assumption that your version of MySQL supports partitioning and that you can divide your table into meaningful partitions based on the data!
PS You may want to choose a different column name other than from....
Option 1
I think this is what you need.
But still an full index scan is needed for the 125 case, the 2001 will trigger an better range scan.
SELECT
data.id
, data.`from`
, data.`till`
FROM
data
WHERE
`from` < 125 and 125 < `till`
see demo http://sqlfiddle.com/#!2/208ca/20
Option 2
DERIVED table to filter out the non matches
SET #x = 125;
SELECT
data.id
, data.`from`
, data.`till`
FROM (
SELECT
id
, `till`
FROM
data
WHERE
`from` < #x -- from should always be smaller than #x
) from_filter
INNER JOIN
data
ON
from_filter.id = data.id
AND
#x < from_filter.`till` -- #x should always be smaller then till
;
see demo http://sqlfiddle.com/#!2/208ca/27
Option 3
R tree indexing may be the best option
I'm facing some issues with a rapidly growing table at increasing speed (currently 4mio rows, 300k inserts a day). I hope I can get some ideas and advices here to improve my setup and squeeze the last bit out of my box, before it takes down my website in near future.
The setup:
Intel i7 720
8GB RAM
2x750GB SATA RAID 0
CentOS
MySQL 5.5.10
Node.js + node-lib_mysql-client
The table definition:
CREATE TABLE IF NOT EXISTS `canvas` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`x1` int(11) NOT NULL,
`y1` int(11) NOT NULL,
`x2` int(11) NOT NULL,
`y2` int(11) NOT NULL,
`c` int(4) unsigned NOT NULL,
`s` int(3) unsigned NOT NULL,
`m` bigint(20) unsigned NOT NULL,
`r` varchar(32) NOT NULL,
PRIMARY KEY (`id`,`x1`,`y1`) KEY_BLOCK_SIZE=1024,
KEY `x1` (`x1`,`y1`) KEY_BLOCK_SIZE=1024,
KEY `x2` (`x2`,`y2`) KEY_BLOCK_SIZE=1024
) ENGINE=MyISAM DEFAULT CHARSET=latin1 ROW_FORMAT=COMPACT KEY_BLOCK_SIZE=4
/*!50100 PARTITION BY HASH ( (
(
x1 MOD 10000
)
) + y1 MOD 10000)
PARTITIONS 10 */ AUTO_INCREMENT=13168904 ;
The query:
SELECT x1,y1,x2,y2,s,c,r,m FROM canvas
WHERE 1 AND ((
x1 >= 0
AND x1 <= 400
AND y1 >= 0
AND y1 <= 400
) OR (
x2 >= 0
AND x2 <= 400
AND y2 >= 0
AND y2 <= 400
) )
ORDER BY id desc
That's the only query I'm executing, except for the fact that the values for x1,y1,x2 and y2 change per query. It's a 2D canvas and each row represents a line on the canvas. Guess it's also important to know that the maximum range selected for 1 field is never bigger than 1200 (pixels).
A few weeks ago I upgraded to MySQL 5.5.10 and started using partitions. The 'x1 % 10000' hashw as my first and unaware approach to get into the partition topic. It already gave me a decent boost in SELECT speed, but I'm sure there's still room for optimizations.
Oh, and before you ask... I'm aware of the fact that I'm using a MyISAM table. A friend of mine suggested innoDB, but tried it already and the result was a 2 times bigger table and a big drop in SELECT performance. I don't need no fancy transactions and stuff.... all I need is the best possible SELECT performance and a decent performance with INSERTs.
What would you change? Could I perhaps tweak my indexes somehow? Does my partion setup make any sense at all? Should I perhaps increase the number of partition files?
All suggestions are welcome... I also discussed a local replication into a memory table with a friend, but I'm sure it's only a matter of time until the table size would exeed my RAM and a swapping box is a fairly ugly thing to see.
When you think about my issue please keep in mind that it's growing rapidly and unpredictably. In case it goes viral somewhere for some reason, I expect to see more than 1mio INSERTS a day.
Thank you for reading and thinking about it. :)
EDIT: The requested EXPLAIN result
select_type table type possible_keys key key_len ref rows Extra
SIMPLE canvas index_merge x1,x2 x1,x2 8,8 NULL 133532 Using sort_union(x1,x2); Using where; Using fileso...
EDIT2: The requested my.cnf
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
innodb_buffer_pool_size = 1G
sort_buffer_size = 4M
read_buffer_size = 1M
read_rnd_buffer_size = 16M
innodb_file_format = Barracuda
query_cache_type = 1
query_cache_size = 100M
# http://dev.mysql.com/doc/refman/5.5/en/performance-schema.html
;performance_schema
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
The innoDB values are for my innoDB try... guess they are not necessary anymore. The sever runs 4 other Websites as well, but they are rather small and not really worth to mention. I'm gonna move this project to a dedicated box soon anyways. Your ideas can be radical - I don't mind experiments.
EDIT3 - BENCHMARKS WITH INDEXES
Ok guys... I've made some benchmarks with different indexes and the results are pretty good so far. For this benchmark I've was selecting all rows within a box of 2000x2000 pixels.
SELECT SQL_NO_CACHE x1,y1,x2,y2,s,c FROM canvas_test WHERE 1 AND (( x1 BETWEEN -6728 AND -4328 AND y1 BETWEEN -6040 AND -4440 ) OR ( x2 BETWEEN -6728 AND -4328 AND y2 BETWEEN -6040 AND -4440 ) ) ORDER BY id asc
Using the table/index definition I've posted above the avarage query time was: 1740ms
Then I dropped all indexes, except for the primary key -> 1900ms
added one index for x1 -> 1800ms
added one index for y1 -> 1700ms
added one index for x2 -> 1500ms
added one index for y2 -> 900ms!
That's quite astonishing so far... for some reason I was thinking making combined indexes for x1/y1 and x2/y2 would make sense somehow, but actually it looks like I was wrong.
EXPLAIN now returns this:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE canvas_test index_merge x1,y1,x2,y2 y1,y2 4,4 NULL 263998 Using sort_union(y1,y2); Using where; Using fileso..
Now I'm wondering why it's using y1/y2 as keys and not all four?
However, I'm still looking for more ideas and advices, especially regarding partitions and proper hashing.
First, I'd modify the SELECT as
SELECT x1,y1,x2,y2,s,c,r,m FROM canvas
WHERE
x1 BETWEEN 0 AND 400 AND y1 BETWEEN 0 AND 400 OR
x2 BETWEEN 0 AND 400 AND y2 BETWEEN 0 AND 400
ORDER BY id desc
And also be sure to have an index on that expression:
CREATE INDEX canvas400 ON canvas(
x1 BETWEEN 0 AND 400 AND y1 BETWEEN 0 AND 400 OR
x2 BETWEEN 0 AND 400 AND y2 BETWEEN 0 AND 400
)
How much memory is your server currently utilizing?
Is this the only database/table on the server?
Are you using MyISAM exclusively?
MyISAM is okay to use, so long as you're not updating your rows. When you update a row on a MyISAM table MySQL locks the entire table, blocking any SELECTs and INSERTS from executing until the UPDATE is complete. UPDATE has precedence over SELECT, so if you have a lot of UPDATEs running, your SELECTS will wait until they're all complete before they return any rows.
If that is okay with you, then move to your server configuration. What does your my.cnf file look like? You'll want to optimize this file to maximize the amount of memory you can use for indexes. If these SELECTs are slowing down, it's because your table indexes are not fitting in memory. If MySQL can't fit your table indexes into memory, then it has to go to disk and do a table scan to fetch your data. This will kill performance.
EDIT 5/18/2011 9:30PM EST
After looking at your my.cnf, I notice you have zero MyISAM optimizations in place. Your starting place is going to be the key_buffer_size variable. This variable is, as a rule of thumb, set somewhere between 25% and 50% of the total available memory on your system. Your system has 8GB memory available, so somewhere around 3GB is a minimum starting point, I'd say. However, you can estimate how much you will need and optimize it as needed if you know you have control over the other variables on the system.
What you should do is cd to your mysql data dir (typically /var/lib/mysql) which is where all your data files are located. A quick way to tell how much index data you have is to do
sudo du -hc `find . -type f -name "*.MYI"
This command will look at the size of all your MyISAM Index files and tell you their total size. If you have enough memory, you want to make your key_buffer_size in your my.cnf BIGGER than the total size of all your MYI files. This will ensure that your MyISAM indexes are in memory, so MySQL won't have to hit the disk for the index data.
A quick note, don't go increasing your key_buffer_size willy nilly. This is just one area of MySQL that needs memory, there are other moving parts that you need to balance memory usage with. MySQL connections take up memory, and different table engines use different memory pools for their indexes, and MySQL uses other bits of memory for different things. If you run out of memory because you set the key_buffer_size too large, your server could start paging (using virtual memory, which will KILL performance even MORE) or worse, crash. Start with smaller values if you're unsure, check your memory usage, and increase it until you're satisfied with the performance, and your server isn't crashing.
Remember that MySQL will only use one index per table per query. Your SELECT query won't be able to make use of both of your indexes in the same query - it will use one or the other. You might find that it's more efficient to UNION two SELECT queries together so that each one can use the appropriate index, eg:
SELECT x1,y1,x2,y2,s,c,r,m FROM canvas
WHERE
x1 >= 0
AND x1 <= 400
AND y1 >= 0
AND y1 <= 400
UNION
SELECT x1,y1,x2,y2,s,c,r,m FROM canvas
WHERE
x2 >= 0
AND x2 <= 400
AND y2 >= 0
AND y2 <= 400
;
or you could use BETWEEN like one of the other replies suggested, eg:
SELECT x1,y1,x2,y2,s,c,r,m FROM canvas
WHERE x1 BETWEEN 0 AND 400 AND y1 BETWEEN 0 AND 400
UNION
SELECT x1,y1,x2,y2,s,c,r,m FROM canvas
WHERE x2 BETWEEN 0 AND 400 AND y2 BETWEEN 0 AND 400
;
It's a while since I've used a UNION so I'm not sure where you'd put your ORDER BY clause but you can experiment with that.
As one of the other replies mentioned, use EXPLAIN to see how many rows MySQL will have to consider in order to satisfy the queries.
It might also be worth looking at an RTREE index, though I've not played with those myself.
What kind of speeds are you getting? Since you don't need any relational stuff you should consider moving your data to Redis, it should easily do +100k inserts or reads/sec on your machine.
I have a large dataset (around 1.9 million rows) of 3D points that I'm selecting from. The statement I use most often is similar to:
SELECT * FROM points
WHERE x > 100 AND x < 200
AND y > 100 AND y < 200
AND z > 100 AND z < 200
AND otherParameter > 10
I have indicies on x, y, and z as well as the otherParameter. I've also tried adding a multi-part index to x,y,z but that hasn't helped.
Any advice on how to make this SELECT query quicker?
B-Tree indexes won't help much for such a query.
What you need as an R-Tree index and the minimal bounding parallelepiped query over it.
Unfortunately, MySQL does not support R-Tree indexes over 3d points, only 2d. However, you may create an index over, say, X and Y together which will be more selective that any of the B-Tree indexes on X and Y alone:
ALTER TABLE points ADD xy POINT;
UPDATE points
SET xy = Point(x, y);
ALTER TABLE points MODIFY xy POINT NOT NULL;
CREATE SPATIAL INDEX sx_points_xy ON points (xy);
SELECT *
FROM points
WHERE MBRContains(LineString(Point(100, 100), Point(200, 200), xy)
AND z BETWEEN 100 and 200
AND otherParameter > 10;
This is only possible if your table is MyISAM.
I don't have mySQL to test but I'm curious how efficient its INTERSECT is:
select points.*
from points
join
(
select id from points where x > 100 AND x < 200
intersect
select id from points where y > 100 AND y < 200
intersect
select id from points where z > 100 AND z < 200
) as keyset
on points.id = keyset.id
Not necessarily recommending this -- but it's something to try, especially if you have separate indexes on x, y, and z.
EDIT: Since mySQl doesn't support INTERSECT the query above could be rewritten using JOINS of inline views. Each view would contain a keyset and each view would have the advantage of the separate indexes you have placed on x, y, and z. The performance would depend on the numnber of keys returned and on the intersect/join algorithm.
I first tested the intersect approach (in SQLite) to see if there were ways to improve performance in spatial queries short of using their R-Tree module. INTERSECT was actually slower than using a single non-composite index on one of the spatial values and then scanning the subset of the base table to get the other spatial values. But the results can vary depending on the size of the database. After the table has reached gargantuan size and disk i/o becomes more important as a performance factor, it may be more efficient to intersect discrete keysets, each of which has been instantiated from an index, than to do a scan of the base table subequent to an initial fetch-from-index.