Actually, it is the question for an interview of a company which builds high-load service.
For example, we have a table with 1TB of records with primary b-tree index.
We need to select all records in a range from 5000 to 5000000.
We cannot block the whole database. Database in under high load.
Does it make sense to split a huge select query into parts like
select * from a where id > =5000 and id < 10000;
select * from a where id >= 10000 and id < 15000;
...
Please help me to compare behaviour in case when we use Postgres and MySQL.
Are there any other optimal techniques to select all required records?
Thanks.
There are many unknowns in your question. First of all, what is the table structure ? Will this query use any indexes ?
The best way to find out is to run an execution plan and analyze performance.
But trying to retrieve so many rows in one pass does not seem very reasonable. The query will very likely cause heavy load on the server + RAM consumption + usage of a temp file probably. It could fail or time out.
And then the resultset has to travel across the network and it could be huge. You have to evaluate the size of the dataset, we cannot guess without insight into the table structure.
The big question is, why retrieve so many rows, what is the ultimate goal ? Say you have a GUI application with a datagridview or something like that. You are not going to display 500 millions rows at once, this would crash the application. What the user probably wants is to paginate or search records using some filter. Maybe you'll show a few hundreds of records at a time max.
What are you going to do with all those records ?
Related
I am working on a database that has a table user having columns user_id and user_service_id. My application needs to fetch all the users whose user_service_id is a particular value. Normally I would add an index to the user_service_id column and run a query like this :
select user_id from user where user_service_id = 2;
Since the cardinality of the column user_service_id is very less than around 3-4 and the table has around 10M entries, the query will end up scanning almost the entire table.
I was wondering what is the recommendation for such usecases. Also, would it make more sense to move the data to another nosql datastore as this doesn't seem to be an efficient usecase for MySQL or any SQL datastore? Tried to search this but couldn't find any recommendations here. Can someone please help or provide the necessary references?
Thanks in advance.
That query needs this index, which is both "composite" and "covering":
INDEX(user_service_id, user_id) -- in this order
But what will you do with the millions of rows that you get? Sounds like it will choke the client, whether it comes fast or slow.
See my Index Cookbook
"very dynamic" -- Not a problem.
"cache" -- the dynamic nature defeats caching.
"cardinality" -- not important, except to point out that there will be millions of rows.
"millions of rows" -- that takes time to deliver to the client. The number of rows delivered is the biggest factor in cost.
"select entire table, then filter in client" -- That will be even slower! (See "millions of rows".)
I'm not very knowledgeable about databases. I would want to retrieve, say the "newest" 10 rows with owner ID matching something, and then perhaps paginate to retrieve the next "newest" 10 rows with that owner, and so on. But say I'm adding more and more rows into a database table -- at some point, would such a query become unbearably slow, or are databases generally good enough that this won't be a worry?
I imagine it would be an issue because to get the "newest" 10 rows you'd have to order by date, which is O(n log n). With this assumption, I sought a possible solution from SQL Server SELECT LAST N Rows.
It pointed me to http://www.sqlservercurry.com/2009/02/retrieve-last-n-rows-based-on-condition.html where I found that there is a PARTITION BY option for a query. I imagine this means first selecting all the rows that match the owner ID, and THEN ordering them, which would be significantly faster, and fast enough to not worry about for most applications. Is this the correct understanding?
Otherwise, is there some better way to get the "newest" N rows ( seems to suggest it is)?
I'm developing the app in Django if anyone knows a convenient way, but otherwise Django also allows raw database queries.
Okay, if you are using django, then you don't have to worry about DB's complexity. ORM is here to resolve your worries.
Simple fact, Django uses lazy query. So, it will reduce your DB hits and improve system performance.
So, according to your initial part of question, you can simply run this query:
queryset = YourModel.objects.filter(**lookup_condition).order_by('id')
It will get a queryset with the objects which match the condition from database of that Model class. For details, check this: https://docs.djangoproject.com/en/1.9/ref/models/querysets/#django.db.models.query.QuerySet.filter
And to paginate over it, run like this:
first_ten_values = queryset[0:9]
second_ten_values = queryset[10:19]
...
case 1: i have a table A with 1 insert/per seconde .
From my admin i need to make some heavy read and delets on this table to perform some statistic and maintenance .
Is it make sense to insert incoming data in 2 differents tables A and B , and use the table B for my administration. Goal is to not overload table A .
case 2 :
Another exemple to fully understand the logic , i have a table (tmpA) dedicated to fill search result . Each time there is a search , result is insert into this table and help for pagination.The night , olds results are delet .
actually i have 5 request per second for this table , so aproximativly 500 rows * 5 = 2500 rows /per second .
Is it make sens to creat more tables (tmpA , tmpB , tmpC ,etc..) to dispatch insert and avoid overload ?
for case 1 , if make sens to duplicate ,
whats is the difference with inserting "manualy" incoming data in 2 (or more) differentes
tables between use the mysql replication ?
Thanks to you,
jess
This is kinda difficult to answer, as it depends on your setup hardware-wise.
An insert per second isn't that much. A properly setup server should be able to handle it.
Reads on a table are non-blocking. so gathering info to do statistics (and assuming you don do the calculations for the statistics in the database) shouldn't influence the performance of your database.
Deletes on the other hand are blocking, and will add up to load on a table with heavy inserts.
For Case 1, I do not understand how you would want to split the load on different tables. Generally speaking, there's a database-server load, and not specifically a table load (unless we define blocking processes as table load).
I gather from the comments that Case 1 are user signups/registration. splitting user information over two tables is horrid from a maintenance perspective, plus the coupling of two tables that inevitably needs happen only increases overhead -load-, instead of decreasing it. Deleting data (users?) is also a major issue if the data is divided over two tables. Can you explain how you see administering your data if this is divided over two tables? I'm probably missing something.
Looking at the above, I do not recommend splitting this data between tables.
What I do recommend is:
Use InnoDB as a table type. It has smaller locking than MyISAM (which does table locking?)
Optimize your RAM/memory usage for MySQL. Proper memory settings allow for very quick reads and writes.
Optimize your indexes. the EXPLAIN statement can show which ones are used for each query
Case 2
I don't fully understand the use case, but it might make sense to spit this data up into several tables. Depending on why you want to push the data in these temp tables, splitting might happen per user, keyword or other significant features.
Depending on the use case try limiting the search results (and thus utilizing pagination) through LIMIT BY statements. You donĀ“t need store results for pagination that way, or store the results at all. Can you explain why you want to store these results? 2500 rows/sec is a lot.
Replication is a whole other topic, much more complicated and not achieved by copying tables, but by copying servers. I can't help you with that, never done it, as I never needed it. (my largest MySQL server was aprox. 80Gb large, 350 million rows, with inserts peaking at 224 rows per second)
Can you paste the architecture of your tables you currently use, and some sample data? That might makes the cases at tad more clear.
We have a data warehouse with denormalized tables ranging from 500K to 6+ million rows. I am developing a reporting solution, so we are utilizing database paging for performance reasons. Our reports have search criteria and we have created the necessary indexes, however, performance is poor when dealing with the million(s) row tables. The client is set on always knowing the total records, so I have to fetch the data as well as the record count.
Are there any other things I can do to help with performance? I'm not the MySQL dba and he has not really offered anything up, so I'm not sure what he can do configuration wise.
Thanks!
You should use "Partitioning"
It's main goal is to reduce the amount of data read for particular SQL operations so that overall response time is reduced.
Refer:
http://dev.mysql.com/tech-resources/articles/performance-partitioning.html
If you partition the large tables and store the parts on different servers, than your query will run faster.
see: http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Also note that using NDB tables you can use HASH keys that get looked up in O(1) time.
For the number of lines you can keep a running total in a separate table and update that. For example in a after insert and after delete trigger.
Although the trigger will slow down deletes/inserts this will be spread over time. Note that you don't have to keep all totals in one row, you can store totals per condition. Something like:
table field condition row_count
----------------------------------------
table1 field1 cond_x 10
table1 field1 cond_y 20
select sum(row_count) as count_cond_xy
from totals where field = field1 and `table` = table1
and condition like 'cond_%';
//just a silly example you can come up with more efficient code, but I hope
//you get the gist of it.
If you find yourself always counting along the same conditions, this can speed your redesigned select count(x) from bigtable where ... up from minutes to instantly.
1. So if I search for the word ball inside the toys table where I have 5.000.000 entries does it search for all the 5 millions?
I think the answer is yes because how should it know else, but let me know please.
2. If yes: If I need more informations from that table isn't more logic to query just once and work with the results?
An example
I have this table structure for example:
id | toy_name | state
Now I should query like this
mysql_query("SELECT * FROM toys WHERE STATE = 1");
But isn't more logical to query for all the table
mysql_query("SELECT * FROM toys"); and then do this if($query['state'] == 1)?
3. And something else, if I put an ORDER BY id LIMIT 5 in the mysql_query will it search for the 5 million entries or just the last 5?
Thanks for the answers.
Yes, unless you have a LIMIT clause it will look through all the rows. It will do a table scan unless it can use an index.
You should use a query with a WHERE clause here, not filter the results in PHP. Your RDBMS is designed to be able to do this kind of thing efficiently. Only when you need to do complex processing of the data is it more appropriate to load a resultset into PHP and do it there.
With the LIMIT 5, the RDBMS will look through the table until it has found you your five rows, and then it will stop looking. So, all I can say for sure is, it will look at between 5 and 5 million rows!
Read this about indexes :-)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
It makes it uber-fast :-)
Full table scan is here only if there are no matching indexes and indeed very slow operation.
Sorting is also accelerated by indexes.
And for the #2 - this is slow because transfer rate from MySQL -> PHP is slow, and MySQL is MUCH faster at doing filtering.
For your #1 question: Depends on how you're searching for 'ball'. If there's no index on the column(s) where you're searching, then the entire table has to be read. If there is an index, then...
WHERE field LIKE 'ball%' will use an index
WHERE field LIKE '%ball%' will NOT use an index
For your #2, think of it this way: Doing SELECT * FROM table and then perusing the results in your application is exactly the same as going to the local super walmart, loading the store's complete inventory into your car, driving it home, picking through every box/package, and throwing out everything except the pack of gum from the impulse buy rack by the front till that you'd wanted in the first place. The whole point of a database is to make it easy to search for data and filter by any kind of clause you could think of. By slurping everything across to your application and doing the filtering there, you've reduced that shiny database to a very expensive disk interface, and would probably be better off storing things in flat files. That's why there's WHERE clauses. "SELECT something FROM store WHERE type=pack_of_gum" gets you just the gum, and doesn't force you to truck home a few thousand bottles of shampoo and bags of kitty litter.
For your #3, yes. If you have an ORDER BY clause in a LIMIT query, the result set has to be sorted before the database can figure out what those 5 records should be. While it's not quite as bad as actually transferring the entire record set to your app and only picking out the first five records, it still involves a bit more work than just retrieving the first 5 records that match your WHERE clause.