You are given a relation R with N columns of the same type. Write an SQL query that returns
the tuples (perhaps more than one) having the minimum number of different values. The solution
should have size polynomial in N and use aggregation in an essential way.
Examples:
R1 = {t1 : (a, a, b),(t2 : (b, a, c)}: the result is t1.
R2 = {t1 : (a, a, b),(t2 : (b, a, c), t3 : (b, b, b)}: the result is t3.
A part of the solution is devising a way to uniquely identify tuples without introducing the
notion of a tuple identifier.
I can't grasp the tuple concept.
Related
I have 9 columns in a table with 500 million rows in which I will do SELECT queries and all of the columns may or may not be in a WHERE or GROUP BY.
For example:
Columns -> (A, B, C, D, E, F, I, J, K)
query ->
SELECT *
FROM table WHERE A = 'x' AND J = 'y' GROUP BY B, E, K
What is the best way to indexing and optimize the database? Do I have to do a multiple column index (composite index) for each permutation of columns?
For 3 columns I know I could do:
(a, b, c), (b, c), (c), (a, c)
but what about 9 columns?
You can't achieve the desired goal. Some possible alternatives:
Discover which columns are most commonly used; then make up to 10 indexes with up to 3 columns each. If you help the most common combinations, then it might be "good enough".
Look into MariaDB with "Columnstore".
Look into addon packages.
First, select * is not appropriate with group by. Happily, MySQL no longer allows that syntax (by default).
If you intend:
SELECT B, E, K
FROM table
WHERE A = 'x' AND J = 'y'
GROUP BY B, E, K;
Then the best index is on (A, J, B, E, K);
Given a dataset D_{26xn} with columns named from [a-z] (no of columns is just an example) and n observations. Each column (x) has (r_x) unique states. Rows in D are sorted with descending priority on columns [a-z].
Task: For columns (b, j, p) return indexes of rows such that indexes of identical rows are consecutive. Ordering among rows with different set of values for (b, j, p) is immaterial.
Can there a solution with a complexity of O(n)?
Sol1: Columns (b, j ,p) can be sorted and respective can be returned indexes. But the complexity for this solution is O(no_columns*nlog(n)).
Sol2: Iterate over each row and hash them. But Hashing would more expensive practically.
Can there a solution with a complexity of O(n)?
Seems unlikely. Would you get such solution, you'd be able to sort arbitrary key length data in O(n).
I have a question about optimizing sql queries with multiple index.
Imagine I have a table "TEST" with fields "A, B, C, D, E, F".
In my code (php), I use the following "WHERE" query :
Select (..) from TEST WHERE a = 'x' and B = 'y'
Select (..) from TEST WHERE a = 'x' and B = 'y' and F = 'z'
Select (..) from TEST WHERE a = 'x' and B = 'y' and (D = 'w' or F = 'z')
what is the best approach to get the best speed when running queries?
3 multiple Index like (A, B), (A, B, F) and (A, B, D, F)?
Or A single multiple index (A, B, D, F)?
I would tend to say that the 3 index would be best even if the space of index in the database will be larger.
In my problem, I search the best execution time not the space.
The database being of a reasonable size.
Multiple-column indexes:
MySQL can use multiple-column indexes for queries that test all the columns in the index, or queries that test just the first column, the first two columns, the first three columns, and so on. If you specify the columns in the right order in the index definition, a single composite index can speed up several kinds of queries on the same table.
In other words, it is a waste of space an computing power to define an index that covers the same first N columns as another index and in the same order.
The best way to exam the index is to practice. Use "explain" in mysql, it will give you a query plan and tell you which index to use. In addition, it will give you an estimate time for your query to run. Here is an example
explain select * from TEST WHERE a = 'x' and B = 'y'
It is hard to give definitive answers without experiments.
BUT: ordinarily an index like (A,B,D) is considered to be superfluous if you have an index on (A,B,D,F). So, in my opinion you only need the one multicolumn index.
There is one other consideration. If your table has a lot of columns and a lot of rows and your SELECT list has a small subset of those columns, you might consider including those columns in your index. For example, if your query says SELECT D,F,G,H FROM ... you should try creating an index on
(A,B,D,F,G,H)
as it will allow the query to be satisfied from the index without having to refer back to the rows of the table. This can sometimes help performance a great deal.
It's hard to explain well, but generally you should use as few indexes as you can get away with, using as many columns of the common queries as you can, with the most commonly queried columns first.
In your example WHERE clauses, A and B are always included. These should thus be part of an index. If A is more commonly used in a search then list that first, if B is more commonly used then list that first. MySQL can partially use the index as long as each column (seen from the left) in the index is used in the WHERE clause. So if you have an index ( A, B, C ) then WHERE ( A = .. AND B = .. AND Z = .. ) can still use that index to narrow down the search. If you have a WHERE ( B = .. AND Z = .. ) clause then A isn't part of the search condition and it can't be used for that index.
You want the single multiple column index A, B, D, F OR A, B, F, D (only one of these at a time can be used), but which depends mostly on the number of times D or F are queried for, and the distribution of data. Say if most of the values in D are 0 but one in a hundred values are 1 then that column would have a poor key distribution and thus putting the index on that column wouldn't be all that useful.
The optimiser can use a composite index for where conditions that follow the order of the index with no gaps:
An index on (A,B,F) will cover the first two queries.
The last query is a bit trickier, because of the OR. I think only the A and B conditions will be covered by (A,B,F) but using a separate index (D) or index (F) may speed up the query depending on the cardinality of the rows.
I think an index on (A,B,D,F) can only be used for the A and B conditions on all three queries. Not the F condition on query two, because the D value in the index can be anything and not the D and F conditions because of the OR.
You may have to add hints to the query to get the optimiser to use the best index and you can see which indexes are being used by running an EXPLAIN ... on the query.
Also, adding indexes slows down DML statements and can cause locking issues, so it's best to avoid over-indexing where possible.
Query:
SELECT a, b, c FROM table WHERE a = .. and b like 'example%' and c = '..'
Does this query use index (a,b,c) or (a,b)?
For a covering index to even begin to help this query, it needs to be
a,c,b
That's because the query wants a specific single value for a and c and a range of values (LIKE 'string%') for b.
The compound BTREE index gets random-accessed to the specific a,c value and the starting b value. It scans (in a so-called tight scan) to the last eligible b value.
Note that
c,a,b
will also work.
I'm using mysql and would like to process a very large table with a primary key of 4 parts in blocks of 10,000 (marshalling data to another system). The database is offline when I am doing the processing so I don't have to worry about any modifications. Say the primary key is (A, B, C, D) which are all integers. I first tried using LIMIT OFFSET to achieve this like this:
SELECT * FROM LargeTable ORDER BY (A, B, C, D) LIMIT 10000 OFFSET 0;
Where I increased the offset by 10000 on each call. This seemed to get very slow as it got towards the higher rows in the table. Is it not possible to do this LIMIT OFFSET efficiently?
Then I tried a different approach that uses comparison on the composite primary key. I can get the first block like this:
SELECT * FROM LargeTable ORDER BY (A, B, C, D) LIMIT 10000;
If the last row of that block has A = a, B = b, C = c, and D = d then I can get the next block with:
SELECT * FROM LargeTable
WHERE
A > a OR
(A = a AND B > b) OR
(A = a AND B = b AND C > c) OR
(A = a AND B = b AND C = c AND D > d)
ORDER BY (A, B, C, D) LIMIT 10000;
And then repeat that for each block. This also seemed to slow down greatly as I got to the higher rows in the table. Is there a better way to do this? Am I missing something obvious?
Start processing data from the very start using just plain
SELECT *
FROM LargeTable
ORDER BY (A, B, C, D)
and fetch rows one by one in your client code. You can fetch 10000 rows in your fetch loop if you want, or add LIMIT 10000 clause. When you want to stop this block, remember last tuple (A, B, C, D) that was processed, lets call it (A1, B1, C1, D1).
Now, when you want to restart from last point, fetch rows again one by one, but this time use tuple comparison in your WHERE clause:
SELECT *
FROM LargeTable
WHERE (A, B, C, D) > (A1, B1, C1, D1)
ORDER BY (A, B, C, D)
(you can also add LIMIT 10000 clause if you don't want to rely on client code exiting fetch loop prematurely).
Key to this solution is that MySQL correctly implements tuple comparison.
EDIT: mentioned that optional LIMIT 10000 can be added.
You're probably invoking a sequential scan of the table in some way.
Further, you're conditional SELECT is not doing what you think it does. It's short circuiting on the first condition A > a.
It'll be more efficient if you skip the ORDER BY and LIMIT and use a statement like:
SELECT *
FROM LargeTable
WHERE A = a AND B = b AND C = c;
And just iterate through sets of a, b, and c.
A lot depends on the context in which you're doing your 'marshalling' operations, but is there a reason why you can't let the unconstrained SELECT run, and have your code do the grouping into blocks of 10,000 items?
In pseudo-code:
while (fetch_row succeeds)
{
add row to marshalled data
if (10,000 rows marshalled)
{
process 10,000 marshalled rows
set number of marshalled rows to 0
}
}
if (marshalled rows > 0)
{
process N marshalled rows
}
Limit with offset needs to throw away rows until it finds the ones you actually want so it gets slow as you have a higher offset.
Here's an idea. Since your database is offline while you do this the data doesn't actually have to be present during the job. Why not move all processed rows to another table while processing them? I'm not sure it will be faster, it depends on how many indexes the table have but you should try it.
CREATE TABLE processed AS LargeTable;
SELECT * FROM LargeTable LIMIT 10000;
INSERT INTO processed SELECT * FROM LargeTable LIMIT 10000;
DELETE FROM LargeTable LIMIT 10000;
DELETE TABLE LargeTable;
RENAME TABLE processed TO LargeTable;