How to build and optimize MySQL DB for multi-dimensional search? - mysql

I want to create a table with columns like: A1, A2, A3, L1, L2, L3, L4
The main job for this database is:
User provides some float number: a, b, c, d, then find the best one that have min Euclidean distance, that is the min of (a-L1)^2+(b-L2)^2+(c-L3)^2+(d-L4)^2
Also, some time user may provides some range information for A1, A2, A3,
e.g., A1 > 0.15, 2 < A2 < 3.5, A3 <= 1.2
and then based on these constraints, do the search for L1-L4.
I have read some topics related to this and done a test to insert all data into MySQL using MyISAM engine, and use command like:
select * from table1
order by (x-L1)*(x-L1)+ (y-L2)*(y-L2)+ (z-L3)*(z-L3)
limit 1
But I want to improve the speed as fast as possible, I noticed that there are some optimization part. But still not clear how to do them, and which of them suitable for my case:
there are column index, but based on my problem, how to build index?
also there are "SPATIAL indexes", can I benefit from this? How to use this?
which search command should I use? stick on the "order" one that I'm using?
Anything else for improving the speed?
All the work are done in C/C++, I'm now using MySQL C API, and using mysql_query() function, is this the best way?

your result will be based on a specific formula.
As you are using Mysql 5 (i assume) can you try to create a procedure and after compiling
when ever you want you can call and performance will be better than the normal select query i guess.
you can pass the input parameters for that stored procedure as the range.
you can use indexes if you think the result set is based on any key.
but i dont really understand you have any key!!
The Mysql and C combination am hearing it for first time. i dont know how you will be seeing the result.(my less knowledge) :-(

In case you're still at the stage of experimenting with MySQL, you might also want to look into using Postgres.
It has a bunch of geometry types, for one.
And Postgres 9.1 in particular (in beta) implements out of the box k-nearest searches using the gist index (see E.1.3.5.5. Indexes). If the latter implementation doesn't fit your exact requirements, you'll also find it interesting that gist indexes are extensible.

Related

Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USING(x)?

I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.
I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.
But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.
PostgreSQL:
It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.
MSSQL Server:
MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.
MySQL Server:
The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.
SQLite:
Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.
So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).
It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?
Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.
I think in some cases query parsing time (if it was not prepared) alone can kill performance.
Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.
But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.
Do you?
Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.
Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:
Returns 1 if expr is equal to any of the values in the IN list, else
returns 0. If all values are constants, they are evaluated according
to the type of expr and sorted. The search for the item then is done
using a binary search. This means IN is very quick if the IN value
list consists entirely of constants.
This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.
I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.
I think it is bad application design. Those values using IN operator are most probably not hardcoded but dynamic. In such case we should always use prepared statements the only reliable mechanism to prevent SQL injection.
In each case it will result in dynamically formatting the prepared statement (as number of placeholders is dynamic too) and it will also result in having excessive hard parsing (as many unique queries as we have number of IN values - IN (?), IN (?,?), ...).
I would either load these values into table as use join as you mentioned (unless loading is too overhead) or use Oracle pipelined function IN foo(params) where params argument can be complex structure (array) coming from memory (PLSQL/Java etc).
If the number of values is larger I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN. In such case EXISTS provides better performance than IN.

Can Postgres use a function in a partial index where clause?

I have a large Postgres table where I want to partial index on 1 of the 2 columns indexed. Can I and how do I use a Postgres function in the where clause of a partial index and then have the select query utilize that partial index?
Example Scenario
First column is "magazine" and the second column is "volume" and the third column is "issue". All the magazines can have same "volume" and "issue" #'s but I want the index to only contain the two most recent volumes for that magazine. This is because a magazine could be older than others and have higher volume numbers than younger magazines.
Two immutable strict functions were created to determine the current and last years volumes for a magazine f_current_volume('gq') and f_previous_volume('gq'). Note: current/past volume # only changes once per year.
I tried creating a partial index with the functions however when using explain on a query it only does a seq scan for a current volume magazine.
CREATE INDEX ix_issue_magazine_volume ON issue USING BTREE ( magazine, volume )
WHERE volume IN (f_current_volume(magazine), f_previous_volume(magazine));
-- Both these do seq scans.
select * from issue where magazine = 'gq' and volume = 100;
select * from issue where magazine = 'gq' and volume = f_current_volume('gq');
What am I doing wrong to get this work? And if it is possible why does it need to be done that way for Postgres to use the index?
-- UPDATE: 2013-06-17, the following surprisingly used the index.
-- Why would using a field name rather than value allow the index to be used?
select * from issue where magazine = 'gq' and volume = f_current_volume(magazine);
Immutability and 'current'
If your f_current_volume function ever changes its behaviour - as is implied by its name, and the presence of an f_previous_volume function, then the database is free to return completely bogus results.
PostgreSQL would've refused to let you create the index, complaining that you can only use IMMUTABLE functions. The thing is, marking a function IMMUTABLE means that you are telling PostgreSQL something about the function's behaviour, as per the documentation. You're saying "I promise this function's results won't change, feel free to make assumptions on that basis."
One of the biggest assumptions made is when building an index. If the function returns different outputs for different inputs on multiple invocations, things go splat. Or possibly boom if you're unlucky. In theory you can kind-of get away with changing an immutable function by REINDEXing everything, but the only really safe way is to DROP every index that uses it, DROP the function, re-create the function with its new definition and re-create the indexes.
That can actually be really useful to do if you have something that changes only infrequently, but you really have two different immutable functions at different points in time that just happen to have the same name.
Partial index matching
PostgreSQL's partial index matching is pretty dumb - but, as I found when writing test cases for this, a lot smarter than it used to be. It ignores a dummy OR true. It uses an index on WHERE (a%100=0 OR a%1000=0) for a WHERE a = 100 query. It even got it with a non-inline-able identity function:
regress=> CREATE TABLE partial AS SELECT x AS a, x
AS b FROM generate_series(1,10000) x;
regress=> CREATE OR REPLACE FUNCTION identity(integer)
RETURNS integer AS $$
SELECT $1;
$$ LANGUAGE sql IMMUTABLE STRICT;
regress=> CREATE INDEX partial_b_fn_idx
ON partial(b) WHERE (identity(b) % 1000 = 0);
regress=> EXPLAIN SELECT b FROM partial WHERE b % 1000 = 0;
QUERY PLAN
---------------------------------------------------------------------------------------
Index Only Scan using partial_b_fn_idx on partial (cost=0.00..13.05 rows=50 width=4)
(1 row)
However, it was unable to prove the IN clause match, eg:
regress=> DROP INDEX partial_b_fn_idx;
regress=> CREATE INDEX partial_b_fn_in_idx ON partial(b)
WHERE (b IN (identity(b), 1));
regress=> EXPLAIN SELECT b FROM partial WHERE b % 1000 = 0;
QUERY PLAN
----------------------------------------------------------------------------
Seq Scan on partial (cost=10000000000.00..10000000195.00 rows=50 width=4)
So my advice? Rewrite IN as an OR list:
CREATE INDEX ix_issue_magazine_volume ON issue USING BTREE ( magazine, volume )
WHERE (volume = f_current_volume(magazine) OR volume = f_previous_volume(magazine));
... and on a current version it might just work, so long as you keep the immutability rules outlined above in mind. Well, the second version:
select * from issue where magazine = 'gq' and volume = f_current_volume('gq');
might. Update: No, it won't; for it to be used, Pg would have to recognise that magazine='gq' and realise that f_current_volume('gq') was therefore equiavalent to f_current_volume(magazine). It doesn't attempt to prove equivalences on that level with partial index matching, so as you've noted in your update you have to write f_current_volume(magazine) directly. I should've spotted that. In theory PostgreSQL could use the index with the second query if the planner was smart enough, but I'm not sure how you'd go about efficiently looking for places where a substitution like this would be worthwhile.
The first example, volume = 100 will never use the index, since at query planning time PostgreSQL has no idea that f_current_volumne('gg'); will evaluate to 100. You could add an OR clause OR volume = 100 to your partial index WHERE clause and PostgreSQL would figure it out then, though.
First off, I'd like to volunteer a wild guess, because you're making it sound like your f_current_volume() function calculates something based on a separate table.
If so, be wary because this means your function volatile, in that it needs to be recalculated on every call (a concurrent transaction might be inserting, updating or deleting rows). Postgres won't allow to index those, and I presume you worked around this by declaring the function immutable. Not only is this incorrect, but you also run into the issue of the index containing garbage, because the function gets evaluated as you edit the row, rather than at run time. What you'd probably want instead -- again if my guess is correct -- is to store and maintain the totals in the table itself using triggers.
Regarding your specific question, partial indexes need to have their where condition be met in the query to prompt Postgres to use them. I'm quite sure that Postgres is smart enough to identify that e.g. 10 is between 5 and 15 and use a partial index with that clause. I'm very suspicious that it would know that f_current_volume('gq') is 100 in your case, however, considering the above-mentioned caveat.
You could try this query and see if the index gets used:
select *
from issue
where magazine = 'gq'
and volume in (f_current_volume('gq'), f_previous_volume('gq'));
(Though again, if your function is in fact volatile, you'll get a seq scan as well.)

Efficiency of finding nearest X locations to decimal degree coordinates using MySQL

This sounds like a problem many others have posted about, but there's a nuance to it that I can't figure out: I don't want to limit my bounds when asking for the nearest X data points, and the query needs to be fast.
Right now I'm using SQL as follows:
SELECT * FROM myTable WHERE col1 = 'String' AND col2 = 1
ORDER BY (latCol - <suppliedLat>) + (longCol - <suppliedLong>)
LIMIT X; //X is usually lower than 100
With Lat and Long stored as doubles, and the table containing about a million rows, this query takes ~6 seconds on our server - not fast enough. EXPLAIN SELECT shows me that it's using no indexes (as expected - there's only one index and it's not location-related), performing a filesort, and hitting all ~1 million rows.
Removing the two WHERE clauses don't improve performance at all, and the one index we applied to col1, col2 and a third col actually DECREASED the performance of this query, despite greatly increasing the speed of others.
My reading up on how to solve this leads me to believe that spatial indexing is the way to go, but we never intend to use any of the more advanced spatial features like polygons and circular bounds, we just need the speed. Is there an easy way to apply a spatial (or other sort of) index to an already-existing decimal-degrees table to improve the speed of the above query? Is there a better way of writing the query to make it more efficient?
The big killer is that most things i read about implementing a spatial index in MySQL seem to require changes in how you INSERT the data, but modifying our INSERT statements to use geo/spatial data types greatly increases our dev cycle.
The idea is to use a quadkey. It can look like this 12131212. Then each character in the key represents a leaf node (of a quadtree). If you want to find a similar location you can simply use the mysql substring in the where clause: WHERE SUBSTRING(Field,0,4) = "1213". For the above data it would return the first location 12131212 and any other location starting with 1213. Of course you can replace the character 1,2,3,4 with any other character more meaningful. You download my php class hilbert-curve # phpclasses.org.

Mysql create SQL subquery ALIAS

Basically, I have multiple queries like this:
SELECT a, b, c FROM (LONG QUERY) X WHERE ...
The thing is, I am using this LONG QUERY really frequently. I am looking to give this subquery an alias which would primarily:
Shorten & simplify queries (also reduce errors and code duplication)
Possibly optimize performance. (I believe this is done by default by mysql query caching)
Till now, I have been doing it this way to store:
variable = LONG QUERY;
Query("SELECT a, b, c FROM ("+variable+") X WHERE ...");
Which is not bad. I am looking for a way to do this with mysql internally.
Is it possible to create a simple, read-only view that would generate NO overhead, so I could do everywhere? I believe this is more propper & readable way of doing it.
SELECT a, b, c FROM myTable WHERE ...
Typically these are called views. For example:
CREATE VIEW vMyLongQuery
AS
SELECT a, b, c FROM (LONG QUERY) X WHERE ...
Which can then be referenced like this:
SELECT a, b, c FROM vMyLongQuery
See http://dev.mysql.com/doc/refman/5.0/en/create-view.html for more info on syntax.
As far as performance goes, best case performance will be near enough exactly the same as what you're doing now and worst case it will kill your app. It depends on what you do with the views and how MySQL processes them.
MySQL implements views two ways merge and temptable. The merge option is pretty much exactly what you're doing now, your view is merged into your query as a subquery. With a temptable it will actually spool all the data into a temptable and then select/join to that temptable. You also lose index benefits when data is joined to the temptable.
As a heads up, the merge query plan doesn't support any of the following in your view.
Aggregate functions (SUM(), MIN(), MAX(), COUNT(), and so forth)
DISTINCT
GROUP BY
HAVING
LIMIT
UNION or UNION ALL
Subquery in the select list
Reference to literals without an underlying table
So if your subquery uses these, you're likely going to hurt performance.
Also, heed OMG Ponies' advice carefully, a view is NOT the same as a base class. Views have their place in a DB but can easily be misused. When an engineer comes to the database from an OO background, views seem like a convenient way to promote inheritance and reusability of code. Often people eventually find themselves in a position where they have nested views joined to nested views of nested views. SQL processes nested views by essentially taking the definition of each individual view and expanding that into a beast of a query that will make your DBA cry.
Also, you followed excellent practice in your example and I encourage you to continue this. You specified all your columns individually, never ever use SELECT * to specify the results of your view. It will, eventually, ruin your day.
Im not sure if this is what you are looking for but you can use stored procedures to recall mysql queries. Im not sure if you can use it inside another query though?
http://www.mysqltutorial.org/getting-started-with-mysql-stored-procedures.aspx

How do I optimize MySQL's queries with constants?

NOTE: the original question is moot but scan to the bottom for something relevant.
I have a query I want to optimize that looks something like this:
select cols from tbl where col = "some run time value" limit 1;
I want to know what keys are being used but whatever I pass to explain, it is able to optimize the where clause to nothing ("Impossible WHERE noticed...") because I fed it a constant.
Is there a way to tell mysql to not do constant optimizations in explain?
Am I missing something?
Is there a better way to get the info I need?
Edit: EXPLAIN seems to be giving me the query plan that will result from constant values. As the query is part of a stored procedure (and IIRC query plans in spocs are generated before they are called) this does me no good because the value are not constant. What I want is to find out what query plan the optimizer will generate when it doesn't known what the actual value will be.
Am I missing soemthing?
Edit2: Asking around elsewhere, it seems that MySQL always regenerates query plans unless you go out of your way to make it re-use them. Even in stored procedures. From this it would seem that my question is moot.
However that doesn't make what I really wanted to know moot: How do you optimize a query that contains values that are constant within any specific query but where I, the programmer, don't known in advance what value will be used? -- For example say my client side code is generating a query with a number in it's where clause. Some times the number will result in an impossible where clause other times it won't. How can I use explain to examine how well optimized the query is?
The best approach I'm seeing right off the bat would be to run EXPLAIN on it for the full matrix of exist/non-exist cases. Really that isn't a very good solution as it would be both hard and error prone to do by hand.
You are getting "Impossible WHERE noticed" because the value you specified is not in the column, not just because it is a constant. You could either 1) use a value that exists in the column or 2) just say col = col:
explain select cols from tbl where col = col;
For example say my client side code is generating a query with a number in it's where clause.
Some times the number will result in an impossible where clause other times it won't.
How can I use explain to examine how well optimized the query is?
MySQL builds different query plans for different values of bound parameters.
In this article you can read the list of when does the MySQL optimizer does what:
Action When
Query parse PREPARE
Negation elimination PREPARE
Subquery re-writes PREPARE
Nested JOIN simplification First EXECUTE
OUTER->INNER JOIN conversions First EXECUTE
Partition pruning Every EXECUTE
COUNT/MIN/MAX elimination Every EXECUTE
Constant subexpression removal Every EXECUTE
Equality propagation Every EXECUTE
Constant table detection Every EXECUTE
ref access analysis Every EXECUTE
range/index_merge analysis and optimization Every EXECUTE
Join optimization Every EXECUTE
There is one more thing missing in this list.
MySQL can rebuild a query plan on every JOIN iteration: a such called range checking for each record.
If you have a composite index on a table:
CREATE INDEX ix_table2_col1_col2 ON table2 (col1, col2)
and a query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 = t1.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, MySQL will NOT use an index RANGE access from (t1.value1, t1.value2_lowerbound) to (t1.value1, t1.value2_upperbound). Instead, it will use an index REF access on (t1.value) and just filter out the wrong values.
But if you rewrite the query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 <= t1.value1
AND t2.col1 >= t2.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, then MySQL will recheck index RANGE access for each record from table1, and decide whether to use RANGE access on the fly.
You can read about it in these articles in my blog:
Selecting timestamps for a time zone - how to use coarse filtering to filter out timestamps without a timezone
Emulating SKIP SCAN - how to emulate SKIP SCAN access method in MySQL
Analytic functions: optimizing LAG, LEAD, FIRST_VALUE, LAST_VALUE - how to emulate Oracle's analytic functions in MySQL
Advanced row sampling - how to select N records from each group in MySQL
All these things employ RANGE CHECKING FOR EACH RECORD
Returning to your question: there is no way to tell which plan will MySQL use for every given constant, since there is no plan before the constant is given.
Unfortunately, there is no way to force MySQL to use one query plan for every value of a bound parameter.
You can control the JOIN order and INDEX'es being chosen by using STRAIGHT_JOIN and FORCE INDEX clauses, but they will not force a certain access path on an index or forbid the IMPOSSIBLE WHERE.
On the other hand, for all JOIN's, MySQL employs only NESTED LOOPS. That means that if you build right JOIN order or choose right indexes, MySQL will probably benefit from all IMPOSSIBLE WHERE's.
How do you optimize a query with values that are constant only to the query but where I, the programmer, don't known in advance what value will be used?
By using indexes on the specific columns (or even on combination of columns if you always query the given columns together). If you have indexes, the query planner will potentially use them.
Regarding "impossible" values: the query planner can conclude that a given value is not in the table from several sources:
if there is an index on the particular column, it can observe that the particular value is large or smaller than any value in the index (min/max values take constant time to extract from indexes)
if you are passing in the wrong type (if you are asking for a numeric column to be equal with a text)
PS. In general, creation of the query plan is not expensive and it is better to re-create than to re-use them, since the conditions might have changed since the query plan was generated and a better query plan might exists.