I'm using MySQL, although I suspect this is a generic database question.
I have a table consisting of 6 numeric columns. The first 5 of these make up the primary key.
It is a large table (20 million rows and growing), so some queries take time - about 10secs, which in itself is not too long, but I need to run a lot of them.
I understand that the primary key is automatically indexed - is there any advantage in me separately indexing some groups of columns within the primary key that I usually query on?
That is,, if I regularly query on the first 3 of the 5 primary key columns, should I create an additional index for these 3, or is that redundant because it's already part of the primary key index?
Ten seconds is quite a long time for a query that returns one or a tiny handful of rows. If the query is returning 3% of the table's contents, though, ten seconds is not too long.
Your primary unique key is backed up by a composite index, let's say an index on
(I1,I2,I3,I4,I5)
You are correct that a query like
WHERE I1 = val AND I2 = val AND I3 = val
and
WHERE I3 = val AND I2 = val AND I1 = val
should use the index created for the primary key. The important thing is that the columns in the composite index are all used, starting with the leftmost one. A query like
WHERE I3 = val AND I4 = val AND I5 = val
won't use the primary key's composite index very well, if at all. Neither will a query that does some kind of computation on the column values mentioned in the key, like
WHERE I1+I2+I3=sumvalue
Keep in mind that "should work" is not the same as "does work." Try using the EXPLAIN command in MySQL to figure out whether the DBMS is doing what you expect it to for your query.
http://dev.mysql.com/doc/refman/5.1/en/explain.html
Why not just create a few test queries, create the index on a copy of the table and see how it performs?
When it comes to performance, measuring is always better than trusting an opinion.
The "best" solution in a database largely depends on the specific details of the table(s) involved. What range of values in the columns, what distribution of values, what type of queries, relative frequency of select/delete/insert/update queries, etc.
That being said, my guess is that an index on a subset will help if that subset contains all columns used in a query. You might get better performance if you include the result set (column in the select) in the index.
Related
I'm running a couple tests on MySQL Clustered vs Non Clustered indexes where I have a table 100gb_table which contains ~60 million rows:
100gb_table schema:
CREATE TABLE 100gb_table (
id int PRIMARY KEY NOT NULL AUTO_INCREMENT,
c1 int,
c2 text,
c3 text,
c4 blob NOT NULL,
c5 text,
c6 text,
ts timestamp NOT NULL default(CURRENT_TIMESTAMP)
);
and I'm executing a query that only reads the clustered index:
SELECT id FROM 100gb_table ORDER BY id;
I'm seeing that it takes almost an ~55 min for this query to complete which is strangely slow. I modified the table by adding another index on top of the Primary Key column and ran the following query which forces the non-clustered index to be used:
SELECT id FROM 100gb_table USE INDEX (non_clustered_key) ORDER BY id;
This finished in <10 minutes, much faster than reading with the clustered index. Why is there such a large discrepancy between these two? My understanding is that both indexes store the index column's values in a tree structure, except the clustered index contains table data in the leaf nodes so I would expect both queries to be similarly performant. Could the BLOB column possibly be distorting the clustered index structure?
The answer comes in how the data is laid out.
The PRIMARY KEY is "clustered" with the data; that is, the data is order ed by the PK in a B+Tree structure. To read all of the ids, the entire BTree must be read.
Any secondary index is also in a B+Tree structure, but it contains (1) the columns of the index, and (2) any other columns in the PK.
In your example (with lots of [presumably] bulky columns), the data BTree is a lot bigger than the secondary index (on just id). Either test probably required reading all the relevant blocks from the disk.
A side note... This is not as bad as it could be. There is a limit of about 8KB on how big a row can be. TEXT and BLOB columns, when short enough, are included in that 8KB. But when one is bulky, it is put in another place, leaving behind a 'pointer' to the text/blob. Hence, the main part of the data BTree is smaller than it might be if all the text/blob data were included directly.
Since SELECT id FROM tbl is a mostly unnecessary query, the design of InnoDB does not worry about the inefficiency you discovered.
Tack on ORDER BY or WHERE, etc, and there are many different optimizations that could into play. You might even find that INDEX(c1) will let your query run in not much more than 10 minutes. (I think I have given you all the clues for 'why'.)
Also, if you had done SELECT * FROM tbl, it might have taken much longer than 55 minutes. This is because of having extra [random] fetches to get the texts/blobs from the "off-record" storage. And from the network time to shovel far more data.
I have an InnoDB table with 750,000 records. Its primary key is a BIGINT.
When I do:
SELECT COUNT(*) FROM table;
it takes 900ms. explain shows that the index is not used.
When I do:
SELECT COUNT(*) FROM table WHERE pk >= 3000000;
it takes 400ms. explain shows that the index, in this case, is used.
I am looking to do fast counts where x >= pk >= y.
It is my understanding that since I use the primary key of the table, I am using a clustered index, and that therefore the rows are (physically?) ordered by this index. Should it then not be very, very fast to do this count? I was expecting the result to be available in a dozen milliseconds or so.
I have read that faster results can be expected if I select only a small part of the table. I am however interested in doing these counts of ranges. Perhaps I should organize my data in a different way?
In a different case, I have a table with spatial data and use an RTREE index, and then I use MBRContains to count matching rows (and on a secondary index). Surprisingly, this is faster than the simple case above.
In InnoDB, the PRIMARY KEY is "clustered" with the data. This means that the data is sorted by the PK and where pk BETWEEN x AND y must read all the rows from x through y.
So, how does it do a scan by PK? It must read the data blocks. They are bulky in that they have other columns.
But what about COUNT(*) without a WHERE? In this case, the Optimizer looks for the least-bulky index and counts the rows in it. So...
If you have a secondary index, it will use that.
If you only have the PK, then it will read the entire table to do the count.
That is, the artificial addition of a secondary index on the narrowest column is likely to speedup SELECT COUNT(*) FROM tbl.
But wait... Be sure to run each timing test twice. The first time (after a restart) must read the needed blocks from disk. Slow.
The second time all the blocks are likely to be sitting in RAM. Much faster.
SPATIAL and FULLTEXT indexing complicated this discussion. Especially if you have 2 parts to the WHERE, one with Spatial or Fulltext, one with a regular test.
COUNT(1) and COUNT(*) are identical. COUNT(x) checks x for being NOT NULL before including the row in the tally.
I'm trying to understand if it's possible to use an index on a join if there is no limiting where on the first table.
Note: this is not a line-by-line real-case usage, just a thing I draft together for understanding purposes. Don't point out the obvious "what are your trying to obtain with this schema?", "you should use UNSIGNED" or the likes because that's not the question.
Note2: this MySQL JOINS without where clause is somehow related but not the same
Schema:
CREATE TABLE posts (
id_post INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
text VARCHAR(100)
);
CREATE TABLE related (
id_relation INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
id_post1 INT NOT NULL,
id_post2 INT NOT NULL
);
CREATE INDEX related_join_index ON related(id_post1) using BTREE;
Query:
EXPLAIN SELECT * FROM posts FORCE INDEX FOR JOIN(PRIMARY) INNER JOIN related ON id_post=id_post1 LIMIT 0,10;
SQL Fiddle: http://sqlfiddle.com/#!2/84597/3
As you can see, the index is being used on the second table, but the engine is doing a full table scan on the first one (the FORCE INDEX is there just to highlight the general question).
I'd like to understand if it's possible to get a "ref" on the left side too.
Thanks!
Update: if the first table has significantly more record than the second, the thing swap: the engine uses an index for the first one and a full table scan for the second http://sqlfiddle.com/#!2/3a3bb/1 Still, no way to get indexes used on both.
The DBMS has an optimizer to figure out the best plan to execute a query. It's up to the optimizer to decide whether to use an index or simply read the table directly.
An index makes sense when the DBMS expects only few records to read from a table (say 1% of all rows only). But once it expects to read many records (say 99% of all rows) it will not use the index. The threshold may lie at low as 5% (i.e. <= 5% -> index; > 5% table scan).
There are exceptions. One is when an index holds all columns needed. Then the table itself doesn't have to be read at all. Another may be when the optimizer thinks an index access may result faster in spite of having to read many rows. It's also always possible the optimizer simply guesses wrong.
There is a page on the MySQL documentation about this subject.
Regarding the possibility to get a ref on the first table from the query, the short answer is NO.
The reason is obvious: because there is no WHERE clause ALL the rows from table posts are analyzed because they could be included in the result set. There is no reason to use an index for that, a full table scan is better because it gets all the rows; and because the order doesn't matter, the access is (more or less) sequential. Using an index requires reading more information from the storage (index and data).
MySQL will use the join type index if all the columns that appear in the SELECT clause are present in an index. In this case MySQL will perform a full index scan (join type index) instead of a full table scan (join type ALL) because it requires reading less information from the storage (an index is usually smaller than the entire table data).
We have a table that has multiple columns, and we have a UNIQUE index on one of our columns (lets call it GBID), and we have another column (lets call it flag) that has no indicies. This table can be quite large and we query WHERE gbid IN () AND flag = 1 a lot, we occasionally query WHERE gbid = "XXX" and rarely query WHERE flag = 1.
Which is more efficient when it comes to indicies:
Have gbid as UNIQUE and flag with no index
Have gbid as UNIQUE and have a multi column index for (gbid, flag)
Have gbid as UNIQUE and have a multi column index for (flag, gbid)
It depends on the % of rows with flag=1, and on how many rows you select (how many gbid's you have in the IN clause).
If it is low (1-2%) and you are selecting a lot of gbid's, options 2 and 3 might be faster (I think option 3 will be better in that case).
If you have a more even distribution of flag values having it in the index won't make a difference.
If you want to be sure you should benchmark it with a sample of real data.
I have a table of the form
CREATE TABLE data
{
pk INT PRIMARY KEY AUTO_INCREMENT,
dt BLOB
};
It has about 160,000 rows and about 2GB of data in the blob column (avg. 14kb per blob). Another table has foreign keys into this table.
Something like 3000 of the blobs are identical. So what I want is a query that will give me a re map table that will allow me to remove the duplicates.
The naive approach took about an hour on 30-40k rows:
SELECT a.pk, MIN(b.pk)
FROM data AS a
JOIN data AS b
ON a.dt=b.dt
WHERE b.pk < a.pk
GROUP BY a.pk;
I happen to have, for other reasons, a table that has the sizes of the blobs:
CREATE TABLE sizes
(
fk INT, // note: non-unique
sz INT
// other cols
);
By building indexes for both fk and another for sz the direct query from that takes about 24 sec with 50k rows:
SELECT da.pk,MIN(db.pk)
FROM data AS da
JOIN data AS db
JOIN sizes AS sa
JOIN sizes AS sb
ON
sa.size=sb.size
AND da.pk=sa.fk
AND db.pk=sb.fk
WHERE
sb.fk<sa.fk
AND da.dt=db.dt
GROUP BY da.pk;
However that is doing a full table scan on da (the data table). Given that the hit rate should be fairly low I'd think that an index scan would be better. With that in mind in added a 3rd copy of data as a 5th join to get that, and lost about 3 sec.
OK so for the question: Am I going to get much better than the second select? If so, how?
A bit of a corollary is: if I have a table where the key column's get very heavy use but the rest should only get rarely used, will I ever be better off adding another join of that table to encourage an index scan vs. a full table scan?
Xgc on #mysql#irc.freenode.net points out that the adding a utility table like sizes but with a unique constraint on fk might help a lot. Some fun with triggers and what not might make it even not to bad to keep up to date.
You can always use a hashing function (MD5 or SHA1) for your data and then compare the hashes.
The question is if you can save the hashes in your database?