Optimised way to store large key value kind of data - mysql

I am working on a database that has a table user having columns user_id and user_service_id. My application needs to fetch all the users whose user_service_id is a particular value. Normally I would add an index to the user_service_id column and run a query like this :
select user_id from user where user_service_id = 2;
Since the cardinality of the column user_service_id is very less than around 3-4 and the table has around 10M entries, the query will end up scanning almost the entire table.
I was wondering what is the recommendation for such usecases. Also, would it make more sense to move the data to another nosql datastore as this doesn't seem to be an efficient usecase for MySQL or any SQL datastore? Tried to search this but couldn't find any recommendations here. Can someone please help or provide the necessary references?
Thanks in advance.

That query needs this index, which is both "composite" and "covering":
INDEX(user_service_id, user_id) -- in this order
But what will you do with the millions of rows that you get? Sounds like it will choke the client, whether it comes fast or slow.
See my Index Cookbook
"very dynamic" -- Not a problem.
"cache" -- the dynamic nature defeats caching.
"cardinality" -- not important, except to point out that there will be millions of rows.
"millions of rows" -- that takes time to deliver to the client. The number of rows delivered is the biggest factor in cost.
"select entire table, then filter in client" -- That will be even slower! (See "millions of rows".)

Related

Will records order change between two identical query in mysql without order by

The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.

Whether or not SQL query (SELECT) continues or stops reading data from table when find the value

Greeting,
My question; Whether or no sql query (SELECT) continues or stops reading data (records) from table when find the value that I was looking for?
referance: "In order to return data for this query, mysql must start at the beginning of the disk data file, read in enough of the record to know where the category field data starts (because long_text is variable length), read this value, see if it satisfies the where condition (and so decide whether to add to the return record set), then figure out where the next record set is, then repeat."
link for referance: http://www.verynoisy.com/sql-indexing-dummies/#how_the_database_finds_records_normally
In general you don't know and you don't care, but you have to adapt when queries take too long to execute. When you do something like
select a,b,c from mytable where a=3 and b=5
then the database engine has a couple of options to optimize. When all these options fail, then it will do a "full table scan" - which means, it will have to examine the entire table to see which rows are eligible. When you have indices on e.g. column a then the database engine can optimize the search because it can pre-select rows where a has value 3. So, in general, make sure that you have indices for the columns that are most searched. (Perversely, some database engines get confused when you have too many indices and will fall back to a full table scan because they've lost their way...)
As to whether or not the scanning stops: In general, the database engine has to examine all data in the table (hopefully aided by indices) and won't stop after having found just one hit. If you want just the first hit, use a limit 1 clause to make sure that your result set has only one outcome. But then again, if you have a sort by clause, the database engine cannot stop after the first hit, there might be next ones that should get priority given the sorting.
Summarizing, how the db engine does its scan depends on how smart it is, what indices are available etc.. If your select queries take too long then consider re-organizing your indices, writing your select statements differently, or rebuilding the table.
The RDBMS reading data from disk is something you cannot know, you should not care and you must not rely on.
The issue is too broad to get a precise answer. The engine reads data from storage in blocks, a block can contain records that are not needed by the query at hand. If all the columns needed by the query is available in an index, the RDBMS won't even read the data file, it will only use the index. The data it needs could already be cached in memory (because it was read during the execution of a previous query). The underlying OS and the storage media also keep their own caches.
On a busy system, all these factors could lead to very different storage access patterns while running the same query several times on a couple of minutes apart.
Yes it scans the entire file. Unless you put something like
select * from user where id=100 limit 1
This of course will still search entire rows if id 100 is the last record.
If id is a primary key it will automatically be indexed and searching would be optimized
I'm sorry... I thought the table.
I will change question and I will explain it in the following image;
I understand that in CASE 1 all columns must be read with each iteration.
My question is: If it's the same in the CASE 2 or columns that are not selected in the query are excluded from reading in each iteration.
Also, are the both queries are the some in performance perspective?
Clarify:
CASE: 1 In first CASE select print all data
CASE: 2 In second CASE select print columns first_name and last_name
Whether in CASE 2 mysql server (SQL query) reads only columns first_name, last_name or read the entire table to get that data(rows)=(first_name, last_name)?
An interest of me how the server reads table row in CASE 1 and CASE 2?

Am I doing indexes for my MySQL table right?

I have a MySQL table that has a list of messages, where each message belongs to a room. I do queries like this:
SELECT * FROM messages WHERE room='offtopic' ORDER BY id DESC LIMIT 5;
As my table has increased to the number of hundreds of thousands of messages, the DB is becoming a bit slow. I added a index called room, BTREE, not unique, not packed, column room(5), and now with a cardinality of 425.
Will this help performance? Aka am I doing it right?
Yes, it's almost certainly a good idea for that particular query, can't say for other queries since you haven't shown them.
By adding an index on the room column, it should be able to far more efficiently discount a large proportion of the table early in the process, using the index to throw away every row where the room is not what you want. A cardinality of 425 (which is usually an estimate) means that there are that many unique values in the index so that becomes your starting point (400-odd rows) rather than the hundreds of thousands you mentioned.
But the basic idea is to run queries in a production-like environment to see how they perform (by using explain), then add the index and try again to see what sort of improvement you get.
Optimisation is a fool's errand without measurement. The best mantra I've ever heard on the subject is "measure, don't guess".

mysql index optimization for a table with multiple indexes that index some of the same columns

I have a table that stores some basic data about visitor sessions on third party web sites. This is its structure:
id, site_id, unixtime, unixtime_last, ip_address, uid
There are four indexes: id, site_id/unixtime, site_id/ip_address, and site_id/uid
There are many different types of ways that we query this table, and all of them are specific to the site_id. The index with unixtime is used to display the list of visitors for a given date or time range. The other two are used to find all visits from an IP address or a "uid" (a unique cookie value created for each visitor), as well as determining if this is a new visitor or a returning visitor.
Obviously storing site_id inside 3 indexes is inefficient for both write speed and storage, but I see no way around it, since I need to be able to quickly query this data for a given specific site_id.
Any ideas on making this more efficient?
I don't really understand B-trees besides some very basic stuff, but it's more efficient to have the left-most column of an index be the one with the least variance - correct? Because I considered having the site_id being the second column of the index for both ip_address and uid but I think that would make the index less efficient since the IP and UID are going to vary more than the site ID will, because we only have about 8000 unique sites per database server, but millions of unique visitors across all ~8000 sites on a daily basis.
I've also considered removing site_id from the IP and UID indexes completely, since the chances of the same visitor going to multiple sites that share the same database server are quite small, but in cases where this does happen, I fear it could be quite slow to determine if this is a new visitor to this site_id or not. The query would be something like:
select id from sessions where uid = 'value' and site_id = 123 limit 1
... so if this visitor had visited this site before, it would only need to find one row with this site_id before it stopped. This wouldn't be super fast necessarily, but acceptably fast. But say we have a site that gets 500,000 visitors a day, and a particular visitor loves this site and goes there 10 times a day. Now they happen to hit another site on the same database server for the first time. The above query could take quite a long time to search through all of the potentially thousands of rows for this UID, scattered all over the disk, since it wouldn't be finding one for this site ID.
Any insight on making this as efficient as possible would be appreciated :)
Update - this is a MyISAM table with MySQL 5.0. My concerns are both with performance as well as storage space. This table is both read and write heavy. If I had to choose between performance and storage, my biggest concern is performance - but both are important.
We use memcached heavily in all areas of our service, but that's not an excuse to not care about the database design. I want the database to be as efficient as possible.
I don't really understand B-trees besides some very basic stuff, but it's more efficient to have the left-most column of an index be the one with the least variance - correct?
There is one important property of B-tree indices you need to be aware of: It is possible (efficient) to search for an arbitrary prefix of the full key, but not a suffix. If you have an index site_ip(site_id, ip), and you ask for where ip = 1.2.3.4, MySQL will not use the site_ip index. If you instead had ip_site(ip, site_id), then MySQL would be able to use the ip_site index.
The is a second property of B-tree indices you should be aware of as well: they are sorted. A b-tree index can be used for queries like where site_id < 40.
There is also an important property of disk drives to keep in mind: sequential reads are cheap, seeks are not. If there are any columns used that are not in the index, MySQL must read the row from the table data. That's generally a seek, and slow. So if MySQL believes it'd wind up reading even a small percent of the table like this, it'll instead ignore the index. One big table scan (a sequential read) is usually faster than random reads of even a few percent of the rows in a table.
The same, by the way, applies to seeks through an index. Finding a key in a B-tree actually potentially requires a few seeks, so you'll find that WHERE site_id > 800 AND ip = '1.2.3.4' may not use the site_ip index, becuase each site_id requires several index seeks to find the start of the 1.2.3.4 records for that site. The ip_site index, however, would be used.
Ultimately, you're going to have to make liberal use of benchmarking and EXPLAIN to figure out the best indices for your database. Remember, you can freely add and drop indices as needed. Non-unique indices are not part of your data model; they are merely an optimization.
PS: Benchmark InnoDB as well, it often has better concurrent performance. Same with PostgreSQL.
First of all, if you are using ip as a string than change it to INT UNSIGNED column and use INET_ATON(expr) and INET_NTOA(expr) function to deal with this. Indexing on integer value is more efficient than indexing on strings of variable length.
Well indexes trade storage for performance. Its hard if you want both. Its hard to optimize this any further without know all the queries you run and their quantities per interval.
What you have will work. If you're running into a bottleneck, you'll need to find out whether its cpu,ram,disk and/or network and adjust accordingly. Its hard and wrong to prematurely optimize.
You probably want to switch to innodb if you have any updates, other wise myisam is good for insert/select. Also since your row size is small, you could look into mysql cluster (nbd). There is also an archive engine that can help with storage requirements but partitioning in 5.1 is probably a better thing to look into.
Flipping the order of your index doesn't make any sense, if these indexes are already used in all of your queries.
but it's more efficient to have the left-most column of an index be the one with the least variance - correct?
not sure but I haven't heard this before. Doesn't seem true to me for this application. The index order matters for sorting and by having multiple unique 1st most index fields, allows more possible queries to use index.

MySQL: low cardinality/selectivity columns = how to index?

I need to add indexes to my table (columns) and stumbled across this post:
How many database indexes is too many?
Quote:
“Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”
Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)
Id (BIGINT)
fullname (VARCHAR)
address (VARCHAR)
status (VARCHAR)
Further conditions:
The Database contains 300 Million records
Status can only be “enabled” and “disabled”
150 Million records have status= enabled and 150 Million records have
stauts= disabled
My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?
How efficient is the lookup when I use a BTREE index on status?
Should I index this column or not?
What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?
The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.
The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.
In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.
Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.
There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html
I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.
This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.
I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.
If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.
It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)
If the data is distributed like 50:50 then query like where status="enabled" will avoid half scanning of the table.
Having index on such tables is completely depends on distribution of data, i,e : if entries having status enabled is 90% and other is 10%. and for query where status="disabled" it scans only 10% of the table.
so having index on such columns depends on distribution of data.
#a'r answer is correct, however it needs to be pointed out that the usefulness of an index is given not only by its cardinality but also by the distribution of data and the queries run on the database.
In OP's case, with 150M records having status='enabled' and 150M having status='disabled', the index is unnecessary and a waste of resource.
In case of 299M records having status='enabled' and 1M having status='disabled', the index is useful (and will be used) in queries of type SELECT ... where status='disabled'.
Queries of type SELECT ... where status='enabled' will still run with a full table scan.
You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)
Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.
Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.
You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!
I have a similar column in my MySQL database. Approximately 4 million rows, with the distribution of 90% 1 and 10% 0.
I've just discovered today that my queries (where column = 1) actually run significantly faster WITHOUT the index.
Foolishly I deleted the index. I say foolishly, because I now suspect the queries (where column = 0) may have still benefited from it. So, instead I should explicitly tell MySQL to ignore the index when I'm searching for 1, and to use it when I'm searching for 0. Maybe.