Related
I am aware that in MySQL indices on (A,B,C) benefit ANDed WHERE clauses with |A|, |A,B|, |A,B,C|. This makes it seem that having the index (A,B,C) means that there is no point in having a single index on (A) or a composite on (A,B).
1. Is this true?
2. Is it just a waste maintaining an index on (A) when you already have an index on (A,B,C)?
I believe the answer to both your questions is the same: it's almost entirely true; it's almost always wasteful to have indexes on both (A, B, C) and (A).
As Danblack mentioned, the size could make a minor difference, although that's probably negligible.
More importantly, in my experience, note that (A) is actually (A, Primary), where Primary is those primary key columns that are not already explicitly included in the index. In practice, that often means (A, Id). The other index, then, is actually (A, B, C, Id). Note how this affects the order in which rows are encountered in the index.
Imagine doing this:
SELECT *
FROM MyTable
WHERE A = 'Whatever'
ORDER BY Id
Index (A), AKA (A, Id), is perfect for this. For any fixed value of A, corresponding rows are then ordered by Id. No sorting is needed - the results are in our desired order.
However, for index (A, B, C), AKA (A, B, C, Id), it's different. For any fixed value of A, corresponding rows are then ordered by B! This means that the above query will require sorting of the results.
EXPLAIN should confirm what I have described. A filesort will take place if only the (A, B, C) index is available, but not if (A) is available.
It should be easy to see that this matters very little if there are generally very few rows for a particular value of A. However, if there could be 100,000 rows for such a value, then the filesort starts to be impactful. In such a case, you might choose to have index (A) to optimize for this scenario.
Generally speaking, such prefix indexes are superfluous. It's good to analyze your indexes and queries to identify these scenario's, though. In a rare case, one may be worth adding. In the more common case, at least you'll be able to weigh such effects into your overall index choices.
true
almost always
There is a very rare case that if:
A as a standalone index is used most frequently, and
that queries that use A,B or A,B,C are very rare, and
that the sizeof(A) is significantly less than sizeof(A,B,C), and
you are memory constrained such that normally index A,B,C usage is using a significant buffer pool size/key cache size to the determent of other queries;
then there may be a may small benefit having a small duplicate subset of a index A.
Note: possibly might include other conditions
My professor told us to think about this question (not homework, don't worry), and I'm stumped. I know it would tell you if columns a, b and c exist in t, but if there were more columns, wouldn't it just be left out of the output?
If you could watch the system as it executed the query, and you knew the structure of the table, and there were a lot of columns, you could get a rough estimate of the length of each row by how fast the query results were produced.
You would only know that t contained a, b, and c.
The query provides no other information.
Unfortunately, the execution plan is not going to mention the columns that are not in the query. The execution plan might even be using an index on (a, b, c), so performance would not be a guide.
Lets say I have a table with columns A, B, C, D, E, F, G, H, I in that order, and I need to select only columns A, C, F, I (it could be the case that the table has many more columns and I have to retrieve many more columns too).
My question is, would it make a difference (performance wise) if I keep the order of the columns to be retrieved in the projection in ascending column index number (e.g A, C, F, I) rather than retrieving them at a complete random order (e.g. F, A, I, C). And why?
I understand that sequential access is faster than random access, however none of the cases in my example is sequential so I'm not sure what the performance difference of these two projection orders would be.
Thank you.
Short answer: NO.
Long answer: it depends.
In general case, this question is impossible to answer without knowing which product you use.
Ordering of output columns should not matter.
In most row-based relational databases (including Microsoft, PostgreSQL and Oracle), ordering of output columns will make no visible difference. This is because row data is read from memory block-wise (in 8kB or 32kB chunks, for example). After reading into memory, processing is quite cheap.
Number of output columns can make a difference, especially in databases built with columnar (column-based) storage. Also with row-based storage this can matter (just because of in-memory processing cost and data transfer cost).
Please specify if you have particular database engine on your mind.
The order by which you write columns in a SELECT statement SELECT A, B, C and SELEC B, A, C is exactly the same thing. It is absolutely irrelevant.
The one thing that matters is weather or not, if you are selecting only 3 columsn out of a gigantic table with 100 columns. If you have a composite non sparse index on columns A,B,C that the database engine could use to avoid doing a full row read.
If you hand an index on columns A,B,C that you are referring in the SELECT statement then potentially... the DB engine may decide the best thing to do is to execute an Index only plan without needing to load all the Bytes involved in a single DB row column of 100 columns.
With that said.
The order by which you declared TABLES in a FROM clause is not at all Irrelevant.
You should normally name your tables in a FROM clause starting from the TABLES you believe to have more selective predicates for filtering data and by which you would yourself implemented nested loop joins.
I've seen DBs like HSQL whose DB engine optimize failed to use all the appropraite indexes I had created depending on the order by which I named the tables in the FROM clause.
That depends on how the DB query optimize is implemented and how many query execution plans it it will explore. Writeing tables in the appropriate order in the FROM clause will help you out.
Knowing how to plan indexes for tuning a query as well.
Good luck.
My question is, would it make a difference (performance wise) if I keep the order of the columns to be retrieved in the projection in ascending column index number (e.g A, C, F, I) rather than retrieving them at a complete random order (e.g. F, A, I, C). And why?
Possibly, but it would be unlikely to be significant, and it will vary depending on implementation. MySQL and SQL Server could easily have entirely different answers.
My understanding of SQL Server, for example, is that it reads the disk in fixed chunks called pages, which are 8 kilobytes in size. With some exceptions for LOBs, a single row is not allowed to span more than one page, which creates the 8060 byte limit. If your data would exceed that and you're not using LOBs, you'd actually have to create another table. So, not matter what you do, when SQL Server reads a record from a table it's reading the entire page and therefore the entire record.
Now, there's a number of things that can alter what goes on. Indexes that cover all your columns, sparse columns, LOBs, and so on will significantly alter how the data are stored and accessed in your tables. But none of that is going to be affected by how you order things. Part of the query engine's job is to determine the most efficient way to retrieve the data from disk.
Bottom line: I/O is going to orders of magnitude more costly than the ordering of those columns in memory. Beyond a possible contrived example, I can't think of a reason that this would be a consideration for writing a query.
I want to create a table with columns like: A1, A2, A3, L1, L2, L3, L4
The main job for this database is:
User provides some float number: a, b, c, d, then find the best one that have min Euclidean distance, that is the min of (a-L1)^2+(b-L2)^2+(c-L3)^2+(d-L4)^2
Also, some time user may provides some range information for A1, A2, A3,
e.g., A1 > 0.15, 2 < A2 < 3.5, A3 <= 1.2
and then based on these constraints, do the search for L1-L4.
I have read some topics related to this and done a test to insert all data into MySQL using MyISAM engine, and use command like:
select * from table1
order by (x-L1)*(x-L1)+ (y-L2)*(y-L2)+ (z-L3)*(z-L3)
limit 1
But I want to improve the speed as fast as possible, I noticed that there are some optimization part. But still not clear how to do them, and which of them suitable for my case:
there are column index, but based on my problem, how to build index?
also there are "SPATIAL indexes", can I benefit from this? How to use this?
which search command should I use? stick on the "order" one that I'm using?
Anything else for improving the speed?
All the work are done in C/C++, I'm now using MySQL C API, and using mysql_query() function, is this the best way?
your result will be based on a specific formula.
As you are using Mysql 5 (i assume) can you try to create a procedure and after compiling
when ever you want you can call and performance will be better than the normal select query i guess.
you can pass the input parameters for that stored procedure as the range.
you can use indexes if you think the result set is based on any key.
but i dont really understand you have any key!!
The Mysql and C combination am hearing it for first time. i dont know how you will be seeing the result.(my less knowledge) :-(
In case you're still at the stage of experimenting with MySQL, you might also want to look into using Postgres.
It has a bunch of geometry types, for one.
And Postgres 9.1 in particular (in beta) implements out of the box k-nearest searches using the gist index (see E.1.3.5.5. Indexes). If the latter implementation doesn't fit your exact requirements, you'll also find it interesting that gist indexes are extensible.
When I manually create tables in MySQL, I add indexes one at a time for each field that I think I will use for queries.
When I use phpMyAdmin to create tables for me, and I select my indexes in the create-table form, I see that phpMyAdmin combines my indexes into 1 (plus my primary).
What's the difference? Is one better than the other? In which case?
Thanks!
Neither is a particularly good strategy, but if I had to choose I'd pick the multiple single indexes.
The reason is that an index can only be used if you use all the fields in any complete prefix of the index. If you have an index (a, b, c, d, e, f) then this works fine for a query that filters on a or a query that filter on both a and b, but it will be useless for a query filtering only on c.
There's no easy rule that always works for choosing the best indexes. You need to look at the types of queries you are making and choose the indexes that would speed up those particular queries. If you think carefully about the order of the columns you can find a small number of indexes that will be useful for multiple different queries. For example if in one query you filter on both a and b, and another query you filter on only b then an index on (b, a) will be usable by both queries but an index an (a, b) will not.
This actually depends on your queries. Some queries make better use of multicolumn indexes, some not.
EXPLAIN is your friend.
http://dev.mysql.com/doc/refman/5.6/en/explain.html
Also a very good resource is here:
http://dev.mysql.com/doc/refman/5.6/en/optimization-indexes.html