Given a dataset D_{26xn} with columns named from [a-z] (no of columns is just an example) and n observations. Each column (x) has (r_x) unique states. Rows in D are sorted with descending priority on columns [a-z].
Task: For columns (b, j, p) return indexes of rows such that indexes of identical rows are consecutive. Ordering among rows with different set of values for (b, j, p) is immaterial.
Can there a solution with a complexity of O(n)?
Sol1: Columns (b, j ,p) can be sorted and respective can be returned indexes. But the complexity for this solution is O(no_columns*nlog(n)).
Sol2: Iterate over each row and hash them. But Hashing would more expensive practically.
Can there a solution with a complexity of O(n)?
Seems unlikely. Would you get such solution, you'd be able to sort arbitrary key length data in O(n).
Related
I have a single table where I need to query based on 4 columns a,b,c,d
The most common query will be a select based on all 4 columns at the same time, however I need to be able to search quickly for each of the columns taken separately, or also combinations of them (e.g. a&b, a&d, b&c&d and so on).
Shall I create an index for every combination? or it's better to have only an index for a&b&c&d and one for a, one for b, one for c, one for d? in this last case a query that matches only a&b for example will be sped up because both a and be have an index?
If you want to satisfy all the combinations with an index, you need the following:
(a, b, c, d)
(a, b, d)
(a, c, d)
(a, d)
(b, c, d)
(b, d)
(c, d)
d
You don't need other combinations because any prefix of an index is also an index. The first index will be used for queries that test just a, a&b, a&b&c, so you don't need indexes for those combinations.
Whether you really need all these indexes depends on how much data you have. It's possible that just having indexes on each column will narrow down the search sufficiently that you don't need indexes on the combinations. The only real way to tell is by benchmarking the performance of your applications. The indexes take up disk space and memory, so trying to create all possible indexes can cause problems of its own; you need to determine if the need is strong enough.
One thing to note is that a "range" is only useful as the last item in an index:
WHERE x=2 AND y>5 -- INDEX(x,y) is useful; INDEX(y,x) only uses `y`
WHERE x=2 AND y BETWEEN 11 AND 22 -- ditto
WHERE x=2 AND s LIKE 'foo%' -- ditto
Another thing: "flags" (true/false, etc) are useless to index by themselves. They can be somewhat useful in combination:
WHERE published=1 AND ...
Also, order matters in the INDEX, but not in the WHERE: Suppose you have INDEX(a,b):
WHERE a=1 AND b=2 -- good index
WHERE b=2 AND a=1 -- equally good
WHERE a=1 -- the index is good
WHERE b=2 -- the index is useless
If some column is always a range (such as a date), it gets messier. For optimal indexing two indexes are needed here:
WHERE d BETWEEN ... -- needs INDEX(d)
WHERE a=1 AND d BETWEEN ... -- needs INDEX(a,d)
So, I might do these:
Make all 2-column combinations of a,b,c,d -- This would be 6 combinations if nothing is involved in "ranges". I would be sure to vary which col starts the indexes: ab, bc, cd, da, ac, db
Turn on the slowlog to see what is not being well indexed.
Log the actual combinations that people use. Some combinations will be very rarely used. Get rid of the indexes that are useless.
More on understanding index creation.
I have these queries:
1st query:
SELECT (..) FROM db WHERE A = const AND B > const AND C >= const ORDER BY B DESC LIMIT const
2nd query (different db):
SELECT (...) FROM db' WHERE A' = const ORDER BY X' DESC LIMIT const
Question about 1st query:
Is it sufficient to have a multiple row index (A, B, C) or do I need an additional single row index (B) (or a different one) because of the ORDER BYstatement?
Question about 2nd query: Do I need a multiple row index (A', X') or two single row indices (A'), (X') to make us of them in this query?
It is an important thing to know that MySQL will use at most one index (for searching, filtering and ordering) per table and subquery (so basically per row in explain), so you can use only one index here.
For your first query, an index (A,B) will allow MySQL to do a range scan and use the order. If you use (A,B,C), the column C cannot be used in the range condition (because B is already a range), but MySQL will save the time to read the actual tabledata to get the value for C to check the last condition. So (A,B,C) is in general the fastest choice here.
"In general", because you can of course have a data distribution where another index would be best: If you e.g. have only one or two rows that match C >= const and 10M+ rows with A = const, using an index on just C would be fastest. And if C is a very big column (e.g. varchar(700)), it could blow up the index and slow it down. But to estimate such exceptions would require deeper knowledge of your data.
For your second query, (A', X') will be the best choice. If you have the two indexes (A'), (X'), MySQL will in most cases (unless A' is unique, but then you wouldn't need an order by anyway) use the index on X' and hope it will find matching rows for A' soon. This will sometimes be unexpectedly and painfully slow if you only have some rows that match A' = const (because it has to jump back and forth in the table (that is ordered by the primary key) in the order of X' to find rows that match the condition for A').
You might get the same problem for your first query if you have the indexes (A) and (B) (but not (A,B) or (A,B,C)) there: MySQL will probably use (B) instead of (A) (but check the explain to make sure). Even if you just add one index now, this can e.g. happen when you add the index (B) to optimize a different query next week and forgot about this query, so I'd suggest to stick with (at least) (A,B)
I have a question about optimizing sql queries with multiple index.
Imagine I have a table "TEST" with fields "A, B, C, D, E, F".
In my code (php), I use the following "WHERE" query :
Select (..) from TEST WHERE a = 'x' and B = 'y'
Select (..) from TEST WHERE a = 'x' and B = 'y' and F = 'z'
Select (..) from TEST WHERE a = 'x' and B = 'y' and (D = 'w' or F = 'z')
what is the best approach to get the best speed when running queries?
3 multiple Index like (A, B), (A, B, F) and (A, B, D, F)?
Or A single multiple index (A, B, D, F)?
I would tend to say that the 3 index would be best even if the space of index in the database will be larger.
In my problem, I search the best execution time not the space.
The database being of a reasonable size.
Multiple-column indexes:
MySQL can use multiple-column indexes for queries that test all the columns in the index, or queries that test just the first column, the first two columns, the first three columns, and so on. If you specify the columns in the right order in the index definition, a single composite index can speed up several kinds of queries on the same table.
In other words, it is a waste of space an computing power to define an index that covers the same first N columns as another index and in the same order.
The best way to exam the index is to practice. Use "explain" in mysql, it will give you a query plan and tell you which index to use. In addition, it will give you an estimate time for your query to run. Here is an example
explain select * from TEST WHERE a = 'x' and B = 'y'
It is hard to give definitive answers without experiments.
BUT: ordinarily an index like (A,B,D) is considered to be superfluous if you have an index on (A,B,D,F). So, in my opinion you only need the one multicolumn index.
There is one other consideration. If your table has a lot of columns and a lot of rows and your SELECT list has a small subset of those columns, you might consider including those columns in your index. For example, if your query says SELECT D,F,G,H FROM ... you should try creating an index on
(A,B,D,F,G,H)
as it will allow the query to be satisfied from the index without having to refer back to the rows of the table. This can sometimes help performance a great deal.
It's hard to explain well, but generally you should use as few indexes as you can get away with, using as many columns of the common queries as you can, with the most commonly queried columns first.
In your example WHERE clauses, A and B are always included. These should thus be part of an index. If A is more commonly used in a search then list that first, if B is more commonly used then list that first. MySQL can partially use the index as long as each column (seen from the left) in the index is used in the WHERE clause. So if you have an index ( A, B, C ) then WHERE ( A = .. AND B = .. AND Z = .. ) can still use that index to narrow down the search. If you have a WHERE ( B = .. AND Z = .. ) clause then A isn't part of the search condition and it can't be used for that index.
You want the single multiple column index A, B, D, F OR A, B, F, D (only one of these at a time can be used), but which depends mostly on the number of times D or F are queried for, and the distribution of data. Say if most of the values in D are 0 but one in a hundred values are 1 then that column would have a poor key distribution and thus putting the index on that column wouldn't be all that useful.
The optimiser can use a composite index for where conditions that follow the order of the index with no gaps:
An index on (A,B,F) will cover the first two queries.
The last query is a bit trickier, because of the OR. I think only the A and B conditions will be covered by (A,B,F) but using a separate index (D) or index (F) may speed up the query depending on the cardinality of the rows.
I think an index on (A,B,D,F) can only be used for the A and B conditions on all three queries. Not the F condition on query two, because the D value in the index can be anything and not the D and F conditions because of the OR.
You may have to add hints to the query to get the optimiser to use the best index and you can see which indexes are being used by running an EXPLAIN ... on the query.
Also, adding indexes slows down DML statements and can cause locking issues, so it's best to avoid over-indexing where possible.
I have a table with 5 columns,say - A(Primary key), B, C, D and E.
This table has almost 150k rows and there are no indices on this table. As expected the select queries are very slow.
These queries are generated by the user search requests so he can enter values in any of the fields (B, C, D and E) and these are 'IN' kind of queries. I am not sure what should be the good indexing strategy here - having indexes on each of these columns or have them in some combinations.
Selectivity of each of these columns is the same (around 50).
Any help would be appreciated.
Are you running the same query regardless of what the user gives you? In that case, that query should tell you what indexes to use.
For example, if your query might look like
SELECT * FROM mytable WHERE
B IN (...) AND
C IN (...) AND
D IN (...) AND
E IN (...)
In this case, where you restrict on all columns, a combined index with all five columns would probably be ok.
Otherwise, create one index per column, or combine columns that you always restrict on together in separate indexes.
Remember that if you have a combined index on e.g. B and C, then a query that does not restrict on B will not use that combined index.
if you can group two columns in one index that would okay. Having an index on each column is not so bad as long as you don't query Cartesian product like cross join. But better not too ..
I wonder how the MySQL will deal with the statement? If both Column A, B are indexed.
I suppose there will be two ways to do.
a. Select all records from t that A==123 as a temp result
b. find the max B one from the temp result and return.
The time complexity might be O(lgN + m).
Get the record in one step, in other word, T(N) = O(lgN)?
Thanks in advance.
My instinct would tell me that unless B is nullable and B is sparsely populated (really sparse, as low as 1% or lower as well as numbering less than 10% of the average number of values per index key A), such that inspecting B in descending order then checking for A=123 on those records is worthwhile, MySql won't have a bar of the index on B for this query.
More than likely it will just use A (if A is selective enough), retrieve from the table the records, sort by B descending and return the result.
This would mean your 1st case, O(N + m). N is directly proportional to table size, which is also statistically how many records on average would satisfy A={any x}