Am I doing indexes for my MySQL table right? - mysql

I have a MySQL table that has a list of messages, where each message belongs to a room. I do queries like this:
SELECT * FROM messages WHERE room='offtopic' ORDER BY id DESC LIMIT 5;
As my table has increased to the number of hundreds of thousands of messages, the DB is becoming a bit slow. I added a index called room, BTREE, not unique, not packed, column room(5), and now with a cardinality of 425.
Will this help performance? Aka am I doing it right?

Yes, it's almost certainly a good idea for that particular query, can't say for other queries since you haven't shown them.
By adding an index on the room column, it should be able to far more efficiently discount a large proportion of the table early in the process, using the index to throw away every row where the room is not what you want. A cardinality of 425 (which is usually an estimate) means that there are that many unique values in the index so that becomes your starting point (400-odd rows) rather than the hundreds of thousands you mentioned.
But the basic idea is to run queries in a production-like environment to see how they perform (by using explain), then add the index and try again to see what sort of improvement you get.
Optimisation is a fool's errand without measurement. The best mantra I've ever heard on the subject is "measure, don't guess".

Related

Optimised way to store large key value kind of data

I am working on a database that has a table user having columns user_id and user_service_id. My application needs to fetch all the users whose user_service_id is a particular value. Normally I would add an index to the user_service_id column and run a query like this :
select user_id from user where user_service_id = 2;
Since the cardinality of the column user_service_id is very less than around 3-4 and the table has around 10M entries, the query will end up scanning almost the entire table.
I was wondering what is the recommendation for such usecases. Also, would it make more sense to move the data to another nosql datastore as this doesn't seem to be an efficient usecase for MySQL or any SQL datastore? Tried to search this but couldn't find any recommendations here. Can someone please help or provide the necessary references?
Thanks in advance.
That query needs this index, which is both "composite" and "covering":
INDEX(user_service_id, user_id) -- in this order
But what will you do with the millions of rows that you get? Sounds like it will choke the client, whether it comes fast or slow.
See my Index Cookbook
"very dynamic" -- Not a problem.
"cache" -- the dynamic nature defeats caching.
"cardinality" -- not important, except to point out that there will be millions of rows.
"millions of rows" -- that takes time to deliver to the client. The number of rows delivered is the biggest factor in cost.
"select entire table, then filter in client" -- That will be even slower! (See "millions of rows".)

SQL Index to Optimize WHERE Query

I have a Postgres table with several columns, one column is the datetime that the column was last updated. My query is to get all the updated rows between a start and end time. It is my understanding for this query to use WHERE in this query instead of BETWEEN. The basic query is as follows:
SELECT * FROM contact_tbl contact
WHERE contact."UpdateTime" >= '20150610' and contact."UpdateTime" < '20150618'
I am new at creating SQL queries, I believe this query is doing a full table scan. I would like to optimize it if possible. I have placed a Normal index on the UpdateTime column, which takes a long time to create, but with this index the query is faster. One thing I am not sure about is if have to keep recalculating this index if the table gets bigger/columns get changed. Also, I am considering a CLUSTERED index on the UpdateTime row, but I wanted to ask if there was a canonical way of optimizing this/if I was on the right track first
Placing an index on UpdateTime is correct. It will allow the index to be used instead of full table scans.
2 WHERE conditions like the above vs. using the BETWEEN keyword are the exact same:
http://dev.mysql.com/doc/refman/5.7/en/comparison-operators.html#operator_between
BETWEEN is just "syntactical sugar" for those that like that syntax better.
Indexes allow for faster reads, but slow down writes (because like you mention, the new data has to be inserted into the index as well). The entire index does not need to be recalculated. Indexes are smart data structures, so the extra data can be added without a lot of extra work, but it does take some.
You're probably doing many more reads than writes, so using an index is a good idea.
If you're doing lots of writes and few reads, then you'd want to think a bit more about it. It would then come down to business requirements. Although overall the throughput may be slowed, read latency may not be a requirement but write latency may be, in which case you wouldn't want the index.
For instance, think of this lottery example: Everytime someone buys a ticket, you have to record their name and the ticket number. However the only time you ever have to read that data, is after the 1 and only drawing to see who had that ticket number. In this database, you wouldn't want to index the ticket number since they'll be so many writes and very few reads.

Why does this covering index query performs better when paginating far away from beginning?

While reading "Optimizing Sorts" section of "High Performance MySQL 2nd Edition", I find it hard to understand the following:
mysql> SELECT FROM profiles WHERE sex='M' ORDER BY rating LIMIT 100000, 10;
Such queries can be a serious problem no matter how they’re indexed, because the
high offset requires them to spend most of their time scanning a lot of data that they
will then throw away.
...
Another good strategy for optimizing such queries is to use a covering index to
retrieve just the primary key columns of the rows you’ll eventually retrieve. ... Here’s an example that requires an index on(sex, rating)to work efficiently:
mysql>SELECT (cols) FROM profiles INNER JOIN (
-> SELECT (primary key cols) FROM profiles
-> WHERE x.sex='M' ORDER BY rating LIMIT 100000, 10
->) AS x USING(primary key cols);
My question is, if the first query cant utilize (sex rating) index to find rows 100000-100010, how would the second query do?
From "High Performance MySQL 2nd Edition"
Such queries can be a serious problem no matter how they’re indexed,
because the high offset requires them to spend most of their time
scanning a lot of data that they will then throw away. Denormalizing,
precomputing, and caching are likely to be the only strategies that
work for queries like this one. An even better strategy is to limit
the number of pages you let the user view. This is unlikely to impact
the user’s experience, because no one really cares about the 10,000th
page of search results.
Another good strategy for optimizing such queries is to use a covering
index to retrieve just the primary key columns of the rows you’ll
eventually retrieve. You can then join this back to the table to
retrieve all desired columns. This helps minimize the amount of work
MySQL must do gathering data that it will only throw away.
It has been stated that there is no better improvement to this query. It just suggest that to minimize number selected of columns just to index column (of course to save memory of storing all the data of all columns of 10,0000 rows to store only index column of 10,000 rows during the scan process). Then use the 10 retrieved index to construct the full columns.

Indexing on column with few fixed values but values constitue to less than 25% of total rows

I have a field table_name in a table which can have only 20 different values. The total records in the table is about few tens of thousands of rows. If I do a query like this:
SELECT * FROM table WHERE table_name = 'adasd';
at most the returned records are 25% of the total rows. Mostly I get only 10% of the total records. Is there a scope to index the field table_name here? I hear that for indexes to work well it requires the values in that field to be unique or close to it. In my case, its not at all close to unique. But I also heard that if the returned rows are less in number compared to total number of rows, it makes a good case for indexing.
How should I go about this?
No they don't have to be unique to get a benefit from using indexes, however take some time to think about what the DBMS does when processing a query:
Full table scan - a sequential read through the data (i.e. very few seek operations)
Index lookup - a few seeks on the index to find the start of the selected data, then a sequential read (few seeks) to identify rows in the underlying table, then LOTS AND LOTS of seeks to fetch the rows from the table
Seeks are expensive.
(there is a secondary effect of full table scans in that they are more prone to flushing hot data out of the cache - but you should address the primary concern first).
In this case, it's unlikely that the DBMS would use the index if it were present, and even if it did, it would probably be slower than a full table scan. As a (very) rough rule of thumb, you're only going to get a benefit from an index if a predicate identifies less than around 5% of the rows (but it will vary depending on the relative size of the index and the data).
i.e. don't bother adding an index on this field alone.
I think you may benefit from spending some time thinking about why you need to run queries which return so many rows?
Revised Answer
I just learned that creating an index does not mean that MySQL will use it. Keeping that in mind, I will re-phrase my answer:
You should create an index on that column if (general or your own) practices suggest you to do so. MySQL will use heuristics; which include looking at the available indexes and their respective cardinality, to determine the best index to use or not to use an index at all.
Interesting reading about this topic here.

mysql index optimization for a table with multiple indexes that index some of the same columns

I have a table that stores some basic data about visitor sessions on third party web sites. This is its structure:
id, site_id, unixtime, unixtime_last, ip_address, uid
There are four indexes: id, site_id/unixtime, site_id/ip_address, and site_id/uid
There are many different types of ways that we query this table, and all of them are specific to the site_id. The index with unixtime is used to display the list of visitors for a given date or time range. The other two are used to find all visits from an IP address or a "uid" (a unique cookie value created for each visitor), as well as determining if this is a new visitor or a returning visitor.
Obviously storing site_id inside 3 indexes is inefficient for both write speed and storage, but I see no way around it, since I need to be able to quickly query this data for a given specific site_id.
Any ideas on making this more efficient?
I don't really understand B-trees besides some very basic stuff, but it's more efficient to have the left-most column of an index be the one with the least variance - correct? Because I considered having the site_id being the second column of the index for both ip_address and uid but I think that would make the index less efficient since the IP and UID are going to vary more than the site ID will, because we only have about 8000 unique sites per database server, but millions of unique visitors across all ~8000 sites on a daily basis.
I've also considered removing site_id from the IP and UID indexes completely, since the chances of the same visitor going to multiple sites that share the same database server are quite small, but in cases where this does happen, I fear it could be quite slow to determine if this is a new visitor to this site_id or not. The query would be something like:
select id from sessions where uid = 'value' and site_id = 123 limit 1
... so if this visitor had visited this site before, it would only need to find one row with this site_id before it stopped. This wouldn't be super fast necessarily, but acceptably fast. But say we have a site that gets 500,000 visitors a day, and a particular visitor loves this site and goes there 10 times a day. Now they happen to hit another site on the same database server for the first time. The above query could take quite a long time to search through all of the potentially thousands of rows for this UID, scattered all over the disk, since it wouldn't be finding one for this site ID.
Any insight on making this as efficient as possible would be appreciated :)
Update - this is a MyISAM table with MySQL 5.0. My concerns are both with performance as well as storage space. This table is both read and write heavy. If I had to choose between performance and storage, my biggest concern is performance - but both are important.
We use memcached heavily in all areas of our service, but that's not an excuse to not care about the database design. I want the database to be as efficient as possible.
I don't really understand B-trees besides some very basic stuff, but it's more efficient to have the left-most column of an index be the one with the least variance - correct?
There is one important property of B-tree indices you need to be aware of: It is possible (efficient) to search for an arbitrary prefix of the full key, but not a suffix. If you have an index site_ip(site_id, ip), and you ask for where ip = 1.2.3.4, MySQL will not use the site_ip index. If you instead had ip_site(ip, site_id), then MySQL would be able to use the ip_site index.
The is a second property of B-tree indices you should be aware of as well: they are sorted. A b-tree index can be used for queries like where site_id < 40.
There is also an important property of disk drives to keep in mind: sequential reads are cheap, seeks are not. If there are any columns used that are not in the index, MySQL must read the row from the table data. That's generally a seek, and slow. So if MySQL believes it'd wind up reading even a small percent of the table like this, it'll instead ignore the index. One big table scan (a sequential read) is usually faster than random reads of even a few percent of the rows in a table.
The same, by the way, applies to seeks through an index. Finding a key in a B-tree actually potentially requires a few seeks, so you'll find that WHERE site_id > 800 AND ip = '1.2.3.4' may not use the site_ip index, becuase each site_id requires several index seeks to find the start of the 1.2.3.4 records for that site. The ip_site index, however, would be used.
Ultimately, you're going to have to make liberal use of benchmarking and EXPLAIN to figure out the best indices for your database. Remember, you can freely add and drop indices as needed. Non-unique indices are not part of your data model; they are merely an optimization.
PS: Benchmark InnoDB as well, it often has better concurrent performance. Same with PostgreSQL.
First of all, if you are using ip as a string than change it to INT UNSIGNED column and use INET_ATON(expr) and INET_NTOA(expr) function to deal with this. Indexing on integer value is more efficient than indexing on strings of variable length.
Well indexes trade storage for performance. Its hard if you want both. Its hard to optimize this any further without know all the queries you run and their quantities per interval.
What you have will work. If you're running into a bottleneck, you'll need to find out whether its cpu,ram,disk and/or network and adjust accordingly. Its hard and wrong to prematurely optimize.
You probably want to switch to innodb if you have any updates, other wise myisam is good for insert/select. Also since your row size is small, you could look into mysql cluster (nbd). There is also an archive engine that can help with storage requirements but partitioning in 5.1 is probably a better thing to look into.
Flipping the order of your index doesn't make any sense, if these indexes are already used in all of your queries.
but it's more efficient to have the left-most column of an index be the one with the least variance - correct?
not sure but I haven't heard this before. Doesn't seem true to me for this application. The index order matters for sorting and by having multiple unique 1st most index fields, allows more possible queries to use index.