MySQL performance gain by reducing index size? - mysql

I have a table with ~1.2m rows in it. It has 6 columns indexed, including one varchar(255) field that contains urls.
I need to be able to scan the table to see whether a url exists in the table, hence the index, but I'm wondering whether I would see a performance gain by reducing the index size to around 50?
Of course this would mean that it may have to scan more rows when searching for a url in the database.. but I only have to do this query about once every 30 seconds, so I'm wondering if the smaller index size would be worth it. Thoughts?

Two reasons why lowering maybe better - (Assuming your index is useful)
1) Indexes too get loaded in memory, so there maybe a rare possibility that your index size grows to an extent that it is not completely cacheable in memory. Thats when you will see a performance hit (with all the new hardware specs... hardly a possibility with 1.2M rows, but still worth noting).
2) Manytimes, just the first 'n' characters are good enough to be able to quickly identify each record. You may not need to index the whole 255 characters at all.
Two reason why you may not care -
1) As stated, you may never see your indexes growing to be out of your key buffer, so why worry.
2) You will need to determine the first 'n' characters, and even after that the performance will less than or equal to a full index... never more. Do you really need to spend time on that? Is it worth the possible loss of accuracy?

From my SQL indexing tutorial (covers MySQL as well):
Tip: Always aim to index the original data.
That is often the most useful
information you can put into an index.
This is a general rule I suggest until there is a very strong reason to do something different.
Space is not the issue, in most cases.
Performance wise, the index tree depth grows logarithmically with the number of index leaf nodes. That means, cutting the index size half is probably not reducing the tree depth at all. Hence, the performance gain might be limited to the improved cache-hit-rate. But you mentioned you execute that query once every 30 seconds. On a moderately loaded machine, that means you index will not be cached at all (except, maybe, you search for the same URL every 30 seconds).
After all: I don't see any reason to act against the general advice mentioned above.
If you really want to save index space, try to find redundant indexes first (e.g., those starting with the same columns). These are typically the low-hanging fruits.

Keep a md5 hash of your url that is fixed 32 length.

I doubt you would see any difference by changing the index to only use the first 50 characters.
Since it's a VARCHAR column, the indexed values will only be as long as each URL anyway, so looking at typical URL's you may only be indexing around 50 characters per URL already.
Even if the URL's are all significantly longer, reducing the index size may just increase the chance that that part of the index is already in memory, but again i doubt you would notice any difference. This might only be useful if it was very high volume and you needed to start micro-optimising for additional performance.

index size only matters on disk space, So you wont be having serious problems by that.
Having or not having an index could be based on your CRUD operations, do you have more selects or more insert/update/deletes ?

Related

In which case would MySQL's InnoDB's COMPACT row_format would be faster/better than REDUNDANT?

I saw many comparisons on Stackoverflow, DBA and Server Fault, but it's never actually clear when it comes to performance with specific situations, whether using COMPACT or REDUNDANT row format with InnoDB
Are there any cases in which some simple tables would have a definitive performance boost? For example in a simple relational table user_roles that would map a users table with a roles table, using two integers that will always make a row the same size on disk?
If that's not a good example, are there good examples that would make a clear difference?
Thanks!
COMPACT format is slightly better with storing field lengths. REDUNDANT stores length for every field even if it's a fixed size INT. COMPACT however stores only lengths of variable length fields.
IMO that will contribute to performance difference so little that it doesn't make sense to bother with formats.
YMMV.
I'm pretty sure this is a case where there is no easy categorization of schemas into "this schema would be better (worse) using that row format".
You did not mention DYNAMIC or COMPRESSED, which were introduced in later versions. Let me give my opinion on all 4 formats...
Better would be to see if your code takes advantage of any of the row formats. The main difference has to do with the handling of TEXT (etc) columns. SELECT * asks for all columns to be returned, but if you specify all but the text columns, the query will probably run a lot faster because the off-row columns don't need to be fetched.
Having the first 767 bytes of a column included in the main part of the row may provide some speedup. But this depends on what you are doing that could use only the first part of the text, and whether there is an optimization to actually take advantage of the particular case. Note: "prefix indexes" (eg, INDEX my_text(44)) are useful only in a few cases.)
COMPRESSED is of dubious utility -- there is a lot of overhead involved in having both the compressed and uncompressed copies in the buffer_pool. I have trouble imagining a situation where compressed is clearly better. And the compression rate is typically only 2:1. Ordinary text compression is 3:1. If you have big text/blob strings, I think it is better to do client compression (to cut back on network bandwidth) of selected TEXT columns and store into BLOBs.
If you have a table with only two integers in it, there will be a lot of overhead. See SHOW TABLE STATUS; it will probably say Avg_row_length of maybe 40 bytes. And even more if that does not include a PRIMARY KEY.
Furthermore, due to various other things, it would be difficult to see much difference in "faster/better" unless the table has over, say, a million rows.
Bottom line: Go with the default for your version. Don't lose sleep over the decision.
For performance focus on indexes and formulation of queries. For space, test your table and report back. Be aware that "free space" comes in many flavors, most of which are not reported anywhere, so size number can change if you sneeze at the table.

MySql partitoning vs indexing performance

In MySql InnoDB, is there an performance advantage of partitioning the table compared to simply using an index?
Common considerations:
Is an Index the Best Solution?
An index isn’t always the right tool. At a high level, keep in mind that indexes are most
effective when they help the storage engine find rows without adding more work than
they avoid. For very small tables, it is often more effective to simply read all the rows
in the table. For medium to large tables, indexes can be very effective. For enormous
tables, the overhead of indexing, as well as the work required to actually use the indexes,
can start to add up. In such cases you might need to choose a technique that identifies
groups of rows that are interesting to the query, instead of individual rows. You can
use partitioning for this purpose.
If you have lots of tables, it can also make sense to create a metadata table to store some
characteristics of interest for your queries. For example, if you execute queries that
perform aggregations over rows in a multitenant application whose data is partitioned
into many tables, you can record which users of the system are actually stored in each
table, thus letting you simply ignore tables that don’t have information about those
users. These tactics are usually useful only at extremely large scales. In fact, this is a
crude approximation of what Infobright does. At the scale of terabytes, locating individual rows doesn’t make sense; indexes are replaced by per-block metadata.
One thing is sure: you can’t scan the whole table every time you want to query it,
because it’s too big. And you don’t want to use an index because of the maintenance
cost and space consumption. Depending on the index, you could get a lot of fragmentation and poorly clustered data, which would cause death by a thousand cuts through
random I/O. You can sometimes work around this for one or two indexes, but rarely
for more. Only two workable options remain: your query must be a sequential scan
over a portion of the table, or the desired portion of the table and index must fit entirely
in memory.
It’s worth restating this: at very large sizes, B-Tree indexes don’t work. Unless the index
covers the query completely, the server needs to look up the full rows in the table, and
that causes random I/O a row at a time over a very large space, which will just kill query
response times. The cost of maintaining the index (disk space, I/O operations) is also
very high. Systems such as Infobright acknowledge this and throw B-Tree indexes out
entirely, opting for something coarser-grained but less costly at scale, such as per-block
metadata over large blocks of data.
This is what partitioning can accomplish, too. The key is to think about partitioning
as a crude form of indexing that has very low overhead and gets you in the neighborhood
of the data you want. From there, you can either scan the neighborhood sequentially,
or fit the neighborhood in memory and index it. Partitioning has low overhead because
there is no data structure that points to rows and must be updated—partitioning
doesn’t identify data at the precision of rows, and has no data structure to speak of.
Instead, it has an equation that says which partitions can contain which categories of
rows.
(many thanks to High Performance MySQL great book)
99% of cases I have looked at do not benefit from PARTITIONing as much as from INDEXing.
My Rules of Thumb for using Partitioning are in http://mysql.rjweb.org/doc.php/partitionmaint . Also, that lists the only 4 use cases where partitioning improves performance.
OK, I can't say "exactly" 99%, but it is very close to that. I do believe strongly in the "4" -- I have been searching since partitioning was added to MySQL many years ago.
For Data Warehousing, the usual performance solution is to create and maintain "Summary tables". This works nicely for 'most' DW applications.
"Very large BTrees don't work"? Bull. A million-row index will have a BTree depth of about 3. A trillion rows -- about 6. Where's the "won't work"? A "point query" on a trillion row table will touch twice as many nodes in the BTree, and more of them are unlikely to be cached. But it "will work".
Infobright, with its "columnar storage", has its niche. TokuDB, with its "fractal indexing", has its niche. Neither one can say "we are better than BTrees most of the time". (Both those engines get part of their speed by compression.)
Bottom Line: Use an index. Probably a "composite" index. (More indexing tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql )

Does number of columns affect MYSQL speed?

I have a table. I only need to run one type of query: to find a given unique in column 1, then get say, the first 3 columns out.
now, how much would it affect speed if I added an extra few columns to the table for basically "data storage". I know I should use a saparate table, but lets assume I am constrained to having just 1 table, so the only way is to add on some columns at the end.
So, if I add on some columns, say 10 at the end, 30 varchar each, will this slow down any query given in the first sentence? If so, by how much of a factor do you think compared to without the extra reduntant yet present columns?
Yes, extra data can slow down queries because it means fewer rows can fit into a page, and this means more disk accesses to read a certain number of rows and fewer rows can be cached in memory.
The exact factor in slow down is hard to predict. It could be negligible, but if you are near the boundary between being able to cache the entire table in memory or not, a few extra columns could make a big difference to the execution speed. The difference in the time it takes to fetch a row from a cache in memory or from disk is several orders of magnitude.
If you add a covering index the extra columns should have less of an impact as the query can use the relatively narrow index without needing to refer to the wider main table.
I don't understand the 'I know I should use a separate table' bit. What you've described is the reason you have a DB, to associate a key with some related data. Look at it another way, how else do you retrieve this information if you don't have the key?
To answer your question, the only way to know what the performance hit is going to be is empirical testing (though Mark's answer, posted just prior to mine, is one - of VERY many - factors to speed).
That depends a bit on how much data you already have in the records. The difference would normally be somewhere between almost none at all to not so big.
The difference comes from how much more data has to be loaded from disk to get to the data. The extra columns will likely mean that there is room for less records in each page, but it's possible that it happens to be room enough left in each page for most of the extra data so that there are few extra blocks needed. It depends on how well the current data lines up in the pages.

Should I index my sort fields in MySQL

I have a field called 'sort_order' and it's bigint, i use it in my mysql queries for sorting.
Is it wise I put an index on it?
Generally, yes. If you are doing an ORDER BY on that field, it should probably be indexed. Of course you'll want to test it first to make sure it actually helps - if you only ever select a small number of rows it may not make that much of a difference.
Do you honestly expect to have a sort_order value that maxes out at 9,223,372,036,854,775,807?! Assuming zero based, INT is still pretty large at a max of 2,147,483,647...
Depends on your queries, but I'd look at using it in a covering index before a stand alone one. MySQL has a space limit on indexes, you're likely to hit the ceiling if you define an index per column:
Prefix support and lengths of prefixes (where supported) are storage engine dependent. For example, a prefix can be up to 1000 bytes long for MyISAM tables, and 767 bytes for InnoDB tables.
As Eric already mentioned above, the answer is Yes.
Although, if you are actually doing a lot of inserts and updates in the table, MySQL builds a separate block of information for indexing that needs to be updated every time there are changes made to the table. Thus the overhead can be there in some cases.
So basically it's a mixed case and the circumstances should always be considered.
Generally, no. You justify indexes with searches. By the time you've reduced the record count to the number you normally display (say, less than several hundred) having an index doesn't buy you anything.
So only add an index if you will use the field for selecting (which would include, say, "LIMIT 500" for instance.)

Are indexes good or bad for a large database?

I read on MySQL Performance Blog that when tables are large, it is better to scan full tables, instead of using indexes.
I have a table with tens of millions of rows. When conducting queries, if I use no indexes, then queries are 24 times slower than with indexes. I know lot of things may cause this (e.g., are rows stored sequentially), but can you please give me some hints what might be happening? Or how I should start examining this issue? I want to understand when use of indexes is preferred and when it's not
Thanks
The article says that when dealing with very large data sets, where the amount of rows you need to work with are approaching the number of rows that is in the table, using an index might hurt performance.
In this case, going through the index will indeed hurt performance, as long as you need more data than is present in the index.
To go through the index, the database engine first has to read large parts of the index table (it is a type of table), then for each row (or set of rows) from this result, go to the real table and start cherrypicking pages to read.
If, on the other hand, you only need to retrieve columns that area already part of the index table, then the database engine only has to read from that, and not continue on to the full table for more data.
If you end up reading most or close to most of the actual table in question, all the work required to deal with the index might be more overhead than just doing a full table-scan to begin with.
Now, this is all the article is saying. For most work dealing with a database, using indexes is the exact right thing to do.
For instance, if you need to extract a small set of rows, going through an index instead of a full table scan will be many order of magnitudes faster.
In any case, if you're in doubt, you should do some performance profiling to find out how your application behaves under different types of loads, and then start tweaking, don't take a single article as a silver bullet for anything.
For instance, one way to speed up the example queries that does a count on the pad column in the article, would be to create a single index that covered both val and pad, in this way, the count would simply be a index-scan, and not a index-scan + table-lookup, and would run faster than the full table-scan.
Your best option is to know your data, and to experiment, and to know how the tools you use work, so indeed, learn more about indexes, but in the end, it is you who decides what is best for your program.
As always, it depends. I've so far never ran into a scenario as described in that blog posts. Using indexes on my queries for large (50+ million rows) has been on the order of 100 to 10000 times faster than doing a full table scan on these big tables.
There's probably no silver bullet here, you have to test for your particular data and your particular queries.
It is good practice to put the index on each column which you used in a WHERE clause.