Database `ORDER BY` speeds as more entries are added - mysql

I'm not very knowledgeable about databases. I would want to retrieve, say the "newest" 10 rows with owner ID matching something, and then perhaps paginate to retrieve the next "newest" 10 rows with that owner, and so on. But say I'm adding more and more rows into a database table -- at some point, would such a query become unbearably slow, or are databases generally good enough that this won't be a worry?
I imagine it would be an issue because to get the "newest" 10 rows you'd have to order by date, which is O(n log n). With this assumption, I sought a possible solution from SQL Server SELECT LAST N Rows.
It pointed me to http://www.sqlservercurry.com/2009/02/retrieve-last-n-rows-based-on-condition.html where I found that there is a PARTITION BY option for a query. I imagine this means first selecting all the rows that match the owner ID, and THEN ordering them, which would be significantly faster, and fast enough to not worry about for most applications. Is this the correct understanding?
Otherwise, is there some better way to get the "newest" N rows ( seems to suggest it is)?
I'm developing the app in Django if anyone knows a convenient way, but otherwise Django also allows raw database queries.

Okay, if you are using django, then you don't have to worry about DB's complexity. ORM is here to resolve your worries.
Simple fact, Django uses lazy query. So, it will reduce your DB hits and improve system performance.
So, according to your initial part of question, you can simply run this query:
queryset = YourModel.objects.filter(**lookup_condition).order_by('id')
It will get a queryset with the objects which match the condition from database of that Model class. For details, check this: https://docs.djangoproject.com/en/1.9/ref/models/querysets/#django.db.models.query.QuerySet.filter
And to paginate over it, run like this:
first_ten_values = queryset[0:9]
second_ten_values = queryset[10:19]
...

Related

Will records order change between two identical query in mysql without order by

The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.

Does it make sense to split a huge select query into parts?

Actually, it is the question for an interview of a company which builds high-load service.
For example, we have a table with 1TB of records with primary b-tree index.
We need to select all records in a range from 5000 to 5000000.
We cannot block the whole database. Database in under high load.
Does it make sense to split a huge select query into parts like
select * from a where id > =5000 and id < 10000;
select * from a where id >= 10000 and id < 15000;
...
Please help me to compare behaviour in case when we use Postgres and MySQL.
Are there any other optimal techniques to select all required records?
Thanks.
There are many unknowns in your question. First of all, what is the table structure ? Will this query use any indexes ?
The best way to find out is to run an execution plan and analyze performance.
But trying to retrieve so many rows in one pass does not seem very reasonable. The query will very likely cause heavy load on the server + RAM consumption + usage of a temp file probably. It could fail or time out.
And then the resultset has to travel across the network and it could be huge. You have to evaluate the size of the dataset, we cannot guess without insight into the table structure.
The big question is, why retrieve so many rows, what is the ultimate goal ? Say you have a GUI application with a datagridview or something like that. You are not going to display 500 millions rows at once, this would crash the application. What the user probably wants is to paginate or search records using some filter. Maybe you'll show a few hundreds of records at a time max.
What are you going to do with all those records ?

ThinkingSphinx group_by causing slow searches

Setup:
Rails 4
MySQL
ThinkingSphinx
I have a model (Record) in my app with almost 500 million rows. This model has 32 fields, but the only two I care about for a particular Sphinx search are name and token. name is what I am searching against using Sphinx, and token is what I want returned to perform other actions in Rails with.
My indices set up is:
ThinkingSphinx::Index.define :records, :with => :real_time do
# fields
indexes name
indexes token
# attributes
has token, as: :token_attr, type: :string
# < several additional attributes >
end
What I want to do is query Sphinx on :records matching against name and have it return distinct token strings in an array.
Here's what I have:
Record.search("red", indices: %w(records), max_matches: num_tokens_i_need, group_by: :token_attr)
... where num_tokens_i_need is generally somewhere in the thousands (less than 10,000)
The above query takes between 5-8 minutes to complete. However, when I simply do:
Record.search("red", indices: %w(records), max_matches: num_tokens_i_need).map(&:token).uniq
The search is incredibly fast (returning several million records in a couple hundred milliseconds), but I don't get back num_tokens_i_need due to the .uniq call.
Basically what I need to do is have a fast Sphinx search which gives me back an exact number of distinct token for a given term (such as "red").
If seeing my sphinx.conf or anything else would be helpful, please let me know.
The Sphinx docs note that grouping is done in memory, so, to get grouped search results, every single document's attributes needs to be in memory at some point. Given there are several million documents in your Record index, I'm guessing this is the cause of the slowness.
Keep in mind that in your second example, millions of records may match your query, but they're not all being returned by Sphinx (and the matching is purely done on fields, attributes aren't involved), which is part of why that query is much faster.
Some thoughts on better ways forward:
If you're just wanting the tokens from Record instances where the name matches exactly, then SQL is probably a better tool for this job. Even with partial matches, using your database's fuzzy matching may be quicker.
If you're just after the number of tokens, rather than the token values, then Sphinx really isn't the right tool for the job. It's not built with aggregation in mind, hence why it's not tuned towards the query you're running.
If the keyword values (in your examples, red) are a known set (rather than user-provided), perhaps you can cache the values and recalculate them on a regular basis (once a day?).
None of these are clear winners, but hopefully they help you find a better solution.

Using MySQL to search through large data sets?

Now I'm a really advanced PHP developer and heavily knowledged on small-scale MySQL sets, however I'm now building a large infrastructure for a startup I've recently joined and their servers push around 1 million rows of data every day using their massive server power and previous architecture.
I need to know what is the best way to search through large data sets (it currently resides at 84.9 million) rows with a database size of 394.4 gigabytes. It is hosted using Amazon RDS so it does not have any downtime or slowness, it's just that I want to know what's the best way to access large data sets internally.
For example, if I wanted to search through a database of 84 million rows it takes me 6 minutes. Now, if I made a direct request to a specific id or title it would serve it instantly. So how would I search through a large data set.
Just to remind you, it's fast to find information through database by passing in one variable but when searching it performs VERY slow.
MySQL query example:
SELECT u.*, COUNT(*) AS user_count, f.* FROM users u LEFT JOIN friends f ON u.user_id=(f.friend_from||f.friend_to) WHERE u.user_name LIKE ('%james%smith%') GROUP BY u.signed_up LIMIT 0, 100
That query under 84m rows is sigificantly slow. Specifically 47.41 seconds to perform this query standalone, any ideas guys?
All I need is that challenge sorted and I'll be able to get the drift. Also, I know MySQL isn't very good for large data sets and something like Oracle or MSSQL however I've been told to rebuild it on MySQL rather than other database solutions at this moment.
LIKE is VERY slow for a variety of reasons:
Unless your LIKE expression starts with a constant, no index will be used.
E.g. LIKE ('james%smith%') is good, LIKE ('%james%smith%') is bad for indexing. Your example will NOT use any indexes on "user_name" field.
String matching is complex (algorythmically) business compared to regular operators.
To resolve:
Make sure your LIKE expression starts with a constant, not a wildcard, if you have an index on that field you might be able to use.
Consider making an index table (in the literature/library context of the word "index", not a database index context) if you search for whole words. Or a substring lookup table if searching for random often repeating substrings.
E.g. if all user names are of the form "FN LN" or "LN, FN" - split them up and store first names and/or last names in a dictionary table, joining to that table (and doing straight equality) in your query.
LIKE ('%james%smith%')
Avoid these things like the plague. They are impossible for a general DBMS to optimise.
The right way is to calculate things like this (first and last names) at the time where the data is inserted or updated so that the cost is amortised across all reads. This can be done by adding two new columns (indexed) and using insert/update triggers.
Or, if you want all words in the column, have the trigger break the data into words then have an application-level index table to find relevant records, something like:
main_table:
id integer primary key
blah blah blah
text varchar(60)
appl_index:
id index
word varchar(20)
primary key (id,word)
index (word)
Then you can query appl_index to find those ids that have both james and smith in them, far faster than the abominable like '%...'. You could also break the actual words out to a separate table and use word IDs but that's a matter of taste - it's effect on performance would be questionable.
You may well have a similar problems with f.friend_from||f.friend_to but I've not seen that syntax before (if, as it seems to be, the context is u.user_id can be one or the other).
Basically, if you want your databases to scale, don't do anything that even looks like a per-row function in your selects. Take that from someone who works with mainframe databases where 84 million rows is about the size of our config tables :-)
And, as with all optimisation questions, measure, don't guess!

Big tables and analysis in MySql

For my startup, I track everything myself rather than rely on google analytics. This is nice because I can actually have ips and user ids and everything.
This worked well until my tracking table rose about 2 million rows. The table is called acts, and records:
ip
url
note
account_id
...where available.
Now, trying to do something like this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.url LIKE '%some_marketing_page%';
Basically never finishes. I switched to this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.note = 'some_marketing_page';
But it is still very slow, despite having an index on note.
I am obviously not pro at mysql. My question is:
How do companies with lots of data track things like funnel conversion rates? Is it possible to do in mysql and I am just missing some knowledge? If not, what books / blogs can I read about how sites do this?
While getting towards 'respectable', 2 Millions rows is still a relatively small size for a table. (And therefore a faster performance is typically possible)
As you found out, the front-ended wildcard are particularly inefficient and we'll have to find a solution for this if that use case is common for your application.
It could just be that you do not have the right set of indexes. Before I proceed, however, I wish to stress that while indexes will typically improve the DBMS performance with SELECT statements of all kinds, it systematically has a negative effect on the performance of "CUD" operations (i.e. with the SQL CREATE/INSERT, UPDATE, DELETE verbs, i.e. the queries which write to the database rather than just read to it). In some cases the negative impact of indexes on "write" queries can be very significant.
My reason for particularly stressing the ambivalent nature of indexes is that it appears that your application does a fair amount of data collection as a normal part of its operation, and you will need to watch for possible degradation as the INSERTs queries get to be slowed down. A possible alternative is to perform the data collection into a relatively small table/database, with no or very few indexes, and to regularly import the data from this input database to a database where the actual data mining takes place. (After they are imported, the rows may be deleted from the "input database", keeping it small and fast for its INSERT function.)
Another concern/question is about the width of a row in the cast table (the number of columns and the sum of the widths of these columns). Bad performance could be tied to the fact that rows are too wide, resulting in too few rows in the leaf nodes of the table, and hence a deeper-than-needed tree structure.
Back to the indexes...
in view of the few queries in the question, it appears that you could benefit from an ip + note index (an index made at least with these two keys in this order). A full analysis of the index situation, and frankly a possible review of the database schema cannot be done here (not enough info for one...) but the general process for doing so is to make the list of the most common use case and to see which database indexes could help with these cases. One can gather insight into how particular queries are handled, initially or after index(es) are added, with mySQL command EXPLAIN.
Normalization OR demormalization (or indeed a combination of both!), is often a viable idea for improving performance during mining operations as well.
Why the JOIN? If we can assume that no IP makes it into acts without an associated record in users then you don't need the join:
SELECT COUNT(distinct ip) FROM acts
WHERE acts.url LIKE '%some_marketing_page%';
If you really do need the JOIN it might pay to first select the distinct IPs from acts, then JOIN those results to users (you'll have to look at the execution plan and experiment to see if this is faster).
Secondly, that LIKE with a leading wild card is going to cause a full table scan of acts and also necessitate some expensive text searching. You have three choices to improve this:
Decompose the url into component parts before you store it so that the search matches a column value exactly.
Require the search term to appear at the beginning of the of the url field, not in the middle.
Investigate a full text search engine that will index the url field in such a way that even an internal LIKE search can be performed against indexes.
Finally, in the case of searching on acts.notes, if an index on notes doesn't provide sufficient search improvement, I'd consider calculating and storing an integer hash on notes and searching for that.
Try running 'EXPLAIN PLAN' on your query and look to see if there are any table scans.
Should this be a LEFT JOIN?
Maybe this site can help.