Time Complexity of Sorting a database - mysql

I'm currently developing a mobile app and using Codeigniter MySQL. I'm now faced with a situation where I have a table of books (this table will be 100k+ with records). With in this table I have a column called NotSelling. Example of db:
Book A 45
Book B 0
Book C 159
Book D 78
.
.
.
Book Z 450
Where above the numbers are what appears in the NotSelling column in the db. I need to extract the top 20 books from this large table. Now my solution to doing this is to sort the table and then just use TOP to extract the top 20 records.
What I would like to know is about the performance of sorting of the table. As I'm sure constantly sorting the table to simply get the top 20 results would take a hideously long time. I have been given solutions to the problem:
index the NotSelling problem.
cache the query (but I've read about coarse invalidation which may cause problems as my case the invalidation frequency would be high)
Sort the table take the top 20 records, place them in another table and then periodically just update the table say every hour or so.
But all this being said does anyone know of a better solution to this problem or have a way/method of optimizing the performance of the functionality I'm looking to do? Note I am a newbie so should anyone be able to point me in the right direction where I can read up about database performance I would really appreciate it.

I think you are thinking too much here. Definitely a case of premature optimization. While all the above mentioned solutions are perfectly valid. You should know that 100K+ records is chowder to Mysql. We used to routinely order on tables with 30 million+ rows, with excellent perf.
But You MUST have index on the column being sorted on and double check your table schema. Reg. Caching too don't worry, mysql does that for you for repetitive queries when the table has not changed. But index on column is a must, primary and most important requirement.

Don't worry about the performance of sorting. That can always be fixed in the database at a later time by adding an index if it actually proves to be a problem.
In the design phase, optimization is a distraction. Instead, focus on the functionality and the directness that the implementation represents the problem. As long as those are on target, everything else can be fixed comparatively easily.

Depending on what kind of meta-data is kept inside of the data structure of the index backing the column, a traversal can likely be done in O(n) time with n being the number of items returned.
This means that in theory, whether you have 1 million or 200 trillion records, pulling the first 20 will be just as fast as long as you have an index. In practice, there is going to be a performance difference since a small index will fit in memory whereas a large one will have to use the disk.
So in short, you're worrying too much. As Srikar Appal, a properly indexed 100k record table is nothing to MySQL

Related

mysql getting rid of redundant values

I am creating a database to store data from a monitoring system that I have created. The system takes a bunch of data points(~4000) a couple times every minute and stores them in my database. I need to be able to down sample based on the time stamp. Right now I am planning on using one table with three columns:
results:
1. point_id
2. timestamp
3. value
so the query I'd be like to do would be:
SELECT point_id,
MAX(value) AS value
FROM results
WHERE timestamp BETWEEN date1 AND date2
GROUP BY point_id;
The problem I am running into is this seems super inefficient with respect to memory. Using this structure each time stamp would have to be recorded 4000 times, which seems a bit excessive to me. The only solutions I thought of that reduce the memory footprint of my database requires me to either use separate tables (which to my understanding is super bad practice) or storing the data in CSV files which would require me to write my own code to search through the data (which to my understanding requires me not to be a bum... and probably search substantially slower). Is there a database structure that I could implement that doesn't require me to store so much duplicate data?
A database on with your data structure is going to be less efficient than custom code. Guess what. That is not unusual.
First, though, I think you should wait until this is actually a performance problem. A timestamp with no fractional seconds requires 4 bytes (see here). So, a record would have, say 4+4+8=16 bytes (assuming a double floating point representation for value). By removing the timestamp you would get 12 bytes -- savings of 25%. I'm not saying that is unimportant. I am saying that other considerations -- such as getting the code to work -- might be more important.
Based on your data, the difference is between 184 Mbytes/day and 138 Mbytes/day, or 67 Gbytes/year and 50 Gbytes. You know, you are going to have to deal with biggish data issues regardless of how you store the timestamp.
Keeping the timestamp in the data will allow you other optimizations, notably the use of partitions to store each day in a separate file. This should be a big benefit for your queries, assuming the where conditions are partition-compatible. (Learn about partitioning here.) You may also need indexes, although partitions should be sufficient for your particular query example.
The point of SQL is not that it is the most optimal way to solve any given problem. Instead, it offers a reasonable solution to a very wide range of problems, and it offers many different capabilities that would be difficult to implement individually. So, the time to a reasonable solution is much, much less than developing bespoke code.
Using this structure each time stamp would have to be recorded 4000 times, which seems a bit excessive to me.
Not really. Date values are not that big and storing the same value for each row is perfectly reasonable.
...use separate tables (which to my understanding is super bad practice)
Who told you that!!! Normalising data (splitting into separate, linked data structures) is actually a good practise - so long as you don't overdo it - and SQL is designed to perform well with relational tables. It would perfectly fine to create a "time" table and link to the data in the other table. It would use a little more memory, but that really shouldn't concern you unless you are working in a very limited memory environment.

Doing SUM() and GROUP BY over millions of rows on mysql

I have this query which only runs once per request.
SELECT SUM(numberColumn) AS total, groupColumn
FROM myTable
WHERE dateColumn < ? AND categoryColumn = ?
GROUP BY groupColumn
HAVING total > 0
myTable has less than a dozen columns and can grow up to 5 millions of rows, but more likely about 2 millions in production. All columns used in the query are numbers, except for dateColumn, and there are indexes on dateColumn and categoryColumn.
Would it be reasonble to expect this query to run in under 5 seconds with 5 million rows on most modern servers if the database is properly optimized?
The reason I'm asking is that we don't have 5 millions of data and we won't even hit 2 millions within the next few years, if the query doesn't run in under 5 seconds then, it's hard to know where the problem lies. Would it be because the query is not suitable for a large table, or the database isn't optimized, or the server isn't powerful enough? Basically, I'd like to know whether using SUM() and GROUP BY over a large table is reasonable.
Thanks.
As people in comments under your question suggested, the easiest way to verify is to generate random data and test query execution time. Please note that using clustered index on dateColumn can significantly change execution times due to the fact, that with "<" condition only subset of continuous disk data is retrieved in order to calculate sums.
If you are at the beginning of the process of development, I'd suggest concentrating not on the structure of table and indexes that collects data - but rather what do you expect to need to retrieve from the table in the future. I can share my own experience with presenting website administrator with web usage statistics. I had several webpages being requested from server, each of them falling into one on more "categories". My first approach was to collect each request in log table with some indexes, but the table grew much larger than I had at first estimated. :-) Due to the fact that statistics where analyzed in constant groups (weekly, monthly, and yearly) I decided to create addidtional table that was aggregating requests in predefined week/month/year grops. Each request incremented relevant columns - columns were refering to my "categories" . This broke some normalization rules, but allowed me to calculate statistics in a blink of an eye.
An important question is the dateColumn < ? condition. I am guessing it is filtering records that are out of date. It doesn't really matter how many records there are in the table. What matters is how much records this condition cuts down.
Having aggressive filtering by date combined with partitioning the table by date can give you amazing performance on ridiculously large tables.
As a side note, if you are not expecting to hit this much data in many years to come, don't bother solving it. Your business requirements may change a dozen times by then, together with the architecture, db layout, design and implementation details. planning ahead is great but sometimes you want to give a good enough solution as soon as possible and handle the future painful issues in the next release..

is having millions of tables and millions of rows within them a common practice in MySQL database design?

I am doing database design for an upcoming web app, and I was wondering from anybody profusely using mysql in their current web apps if this sort of design is efficient for a web app for lets say 80,000 users.
1 DB
in DB, millions of tables for features for each user, and within each table, potentially millions of rows.
While this design is very dynamic and scales nicely, I was wondering two things.
Is this a common design in web applications today?
How would this perform, time wise, if querying millions of rows.
How does a DB perform if it contains MILLIONS of tables? (again, time wise, and is this even possible?)
if it performs well under above conditions, how could it perform under strenuous load, if all 80,000 users accessed the DB 20-30 times each for 10 -15 minute sessions every day?
how much server space would this require, very generally speaking (reiterating, millions of tables each containing potentially millions of rows with 10-15 columns filled with text)
Any help is appreciated.
1 - Definitely not. Almost anyone you ask will tell you millions of tables is a terrible idea.
2 - Millions of ROWS is common, so just fine.
3 - Probably terribly, especially if the queries are written by someone who thinks it's OK to have millions of tables. That tells me this is someone who doesn't understand databases very well.
4 - See #3
5 - Impossible to tell. You will have a lot of extra overhead from the extra tables as they all need extra metadata. Space needed will depend on indexes and how wide the tables are, along with a lot of other factors.
In short, this is a very very very seriously bad idea and you should not do it.
Millions of rows is perfectly normal usage, and can respond quickly if properly optimized and indexed.
Millions of tables is an indication that you've made a major goof in how you've architected your application. Millions of rows times millions of tables times 80,000 users means what, 80 quadrillion records? I strongly doubt you have that much data.
Having millions of rows in a table is perfectly normal and MySQL can handle this easily, as long as you use appropriate indexes.
Having millions of tables on the other hand seems like a bad design.
In addition to what others have said, don't forget that finding the right table based on the given table name also takes time. How much time? Well, this is internal to DBMS and likely not documented, but probably more than you think.
So, a query searching for a row can either take:
Time to find the table + time to find the row in a (relatively) small table.
Or, just the time to find a row in one large table.
The (2) is likely to be faster.
Also, frequently using different table names in your queries makes query preparation less effective.
If you are thinking of having millions of tables, I can't imagine that you actually designing millions of logically distinct tables. Rather, I would strongly suspect that you are creating tables dynamically based on data. That is, rather than create a FIELD for, say, the user id, and storing one or more records for each user, you are contemplating creating a new TABLE for each user id. And then you'll have thousands and thousands of tables that all have exactly the same fields in them. If that's what you're up to: Don't. Stop.
A table should represent a logical TYPE of thing that you want to store data for. You might make a city table, and then have one record for each city. One of the fields in the city table might indicate what country that city is in. DO NOT create a separate table for each country holding all the cities for each country. France and Germany are both examples of "country" and should go in the same table. They are not different kinds of thing, a France-thing and a Germany-thing.
Here's the key question to ask: What data do I want to keep in each record? If you have 1,000 tables that all have exactly the same columns, then almost surely this should be one table with a field that has 1,000 possible values. If you really seriously keep totally different information about France than you keep about Germany, like for France you want a list of provinces with capital city and the population but for Germany you want a list of companies with industry and chairman of the board of directors, then okay, those should be two different tables. But at that point the difference is likely NOT France versus Germany but something else.
1] Look up dimensions and facts tables in database design. You can start with http://en.wikipedia.org/wiki/Database_model#Dimensional_model.
2] Be careful about indexing too much: for high write/update you don't want to index too much because that gets very expensive (think average case or worst case of balancing a b-tree). For high read tables, index only the fields you search by. for example in
select * from mutable where A ='' and B='';
you may want to index A and B
3] It may not be necessary to start thinking about replication. but since you are talking about 10^6 entries and tables, maybe you should.
So, instead of me telling you a flat no for the millions of tables question (and yes my answer is NO), I think a little research will serve you better. As far as millions of records, it hints that you need to start thinking about "scaling out" -- as opposed to "scaling up."
SQL Server has many ways you can support large tables. You may find some help by splitting your indexes across multiple partitions (filegroups), placing large tables on their own filegroup, and indexes for the large table on another set of filegroups.
A filegroup is basically a separate drive. Each drive has its own dedicated read and write heads. The more drives the more heads are searching the indexes at a time and thus faster results finding your records.
Here is a page that talks in details about filegroups.
http://cm-bloggers.blogspot.com/2009/04/table-and-index-partitioning-in-sql.html

Large MySQL Table - Advice Needed

I have a large mysql MyISAM table with 1.5mil rows and 4.5GB big, still increasing everyday.
I have done all the necessary indexing and the performance has been greatly optimized. Yet, the database occasionally break down (showing 500 Internal Server error) usually due to query overload. Whenever there is a break down, the table will start to work very slowly and I'll have to do a silly but effective task : copy the entire table over to a new table and replace the new one with the old one!!
You may ask why such a stupid action. Why not repair or optimize the table? I've tried that but the time to do repair or optimization may be more than the time to simply duplicate the table and more importantly the new table performs much faster.
Newly built table usually work very well. But over time, it will become sluggish (maybe after a month) and eventually lead to another break down (500 internal server). That's when everything slow down significantly and I need to repeat the silly process of replacing table.
For your info:
- The data in the table seldom get deleted. So there isn't a lot of overhead in the table.
- Under optimal condition, each query takes 1-3 secs. But when it becomes sluggish, the same query can take more than 30 seconds.
- The table has 24 fields, 7 are int, 3 are text, 5 are varchar and the rest are smallint. It's used to hold articles.
If you can explain what cause the sluggishness or you have suggestion on how to improve the situation, feel free to share it. I will be very thankful.
Consider moving to InnoDB. One of its advantages is that it's crash safe. If you need full text capabilities, you can achieve that by implementing external tools like Sphinx or Lucene.
Partitioning is a common strategy here. You might be able to partition the articles by what month they were committed to the database (for example) and then have your query account for returning results from the month of interest (how you partition the table would be up to you and your application's design/behavior). You can union results if you will need your results to come from more than one table.
Even better, depending on your MySQL version, partitioning may be supported by your server. See this for details.

What is the cost of indexing multiple db columns?

I'm writing an app with a MySQL table that indexes 3 columns. I'm concerned that after the table reaches a significant amount of records, the time to save a new record will be slow. Please inform how best to approach the indexing of columns.
UPDATE
I am indexing a point_value, the
user_id, and an event_id, all required
for client-facing purposes. For an
instance such as scoring baseball runs
by player id and game id. What would
be the cost of inserting about 200 new
records a day, after the table holds
records for two seasons, say 72,000
runs, and after 5 seasons, maybe a
quarter million records? Only for
illustration, but I'm expecting to
insert between 25 and 200 records a
day.
Index what seems the most logical (that should hopefully be obvious, for example, a customer ID column in the CUSTOMERS table).
Then run your application and collect statistics periodically to see how the database is performing. RUNSTATS on DB2 is one example, I would hope MySQL has a similar tool.
When you find some oft-run queries doing full table scans (or taking too long for other reasons), then, and only then, should you add more indexes. It does little good to optimise a once-a-month-run-at-midnight query so it can finish at 12:05 instead of 12:07. However, it's a huge improvement to reduce a customer-facing query from 5 seconds down to 2 seconds (that's still too slow, customer-facing queries should be sub-second if possible).
More indexes tend to slow down inserts and speed up queries. So it's always a balancing act. That's why you only add indexes in specific response to a problem. Anything else is premature optimization and should be avoided.
In addition, revisit the indexes you already have periodically to see if they're still needed. It may be that the queries that caused you to add those indexes are no longer run often enough to warrant it.
To be honest, I don't believe indexing three columns on a table will cause you to suffer unless you plan on storing really huge numbers of rows :-) - indexing is pretty efficient.
After your edit which states:
I am indexing a point_value, the user_id, and an event_id, all required for client-facing purposes. For an instance such as scoring baseball runs by player id and game id. What would be the cost of inserting about 200 new records a day, after the table holds records for two seasons, say 72,000 runs, and after 5 seasons, maybe a quarter million records? Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
My response is that 200 records a day is an extremely small value for a database, you definitely won't have anything to worry about with those three indexes.
Just this week, I imported a days worth of transactions into one of our database tables at work and it contained 2.1 million records (we get at least one transaction per second across the entire day from 25 separate machines). And it has four separate composite keys which is somewhat more intensive than your three individual keys.
Now granted, that's on a DB2 database but I can't imagine IBM are so much better than the MySQL people that MySQL can only handle less than 0.01% of the DB2 load.
I made some simple tests using my real project and real MySql database.
My results are: adding average index (1-3 columns in an index) to a table - makes inserts slower by 2.1%. So, if you add 20 indexes, your inserts will be slower by 40-50%. But your selects will be 10-100 times faster.
So is it ok to add many indexes? - It depends :) I gave you my results - You decide!
Nothing for select queries, though updates and especially inserts will be order of magnitudes slower - which you won't really notice before you start inserting a LOT of rows at the same time...
In fact at a previous employer (single user, desktop system) we actually DROPPED indexes before starting our "import routine" - which would first delete all records before inserting a huge number of records into the same table...
Then when we were finished with the insertion job we would re-create the indexes...
We would save 90% of the time for this operation by dropping the indexes before starting the operation and re-creating the indexes afterwards...
This was a Sybase database, but the same numbers apply for any database...
So be careful with indexes, they're FAR from "free"...
Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
With that kind of insertion rate, the cost of indexing an extra column will be negligible.
Without some more details about expected usage of the data in your table worrying about indexes slowing you down smells a lot like premature optimization that should be avoided.
If you are really concerned about it, then setup a test database and simulate performance in the worst case scenarios. A test proving that is or is not a problem will probably be much more useful then trying to guess and worry about what may happen. If there is a problem you will be able to use your test setup to try different methods to fix the issue.
The index is there to speed retrieval of data, so the question should be "What data do I need to access quickly?". Without the index, some queries will do a full table scan (go through every row in the table) in order to find the data that you want. With a significant amount of records this will be a slow and expensive operation. If it is for a report that you run once a month then maybe thats okay; if it is for frequently accessed data then you will need the index to give your users a better experience.
If you find the speed of the insert operations are slow because of the index then this is a problem you can solve at the hardware level by throwing more CPUs, RAM and better hard drive technology at the problem.
What Pax said.
For the dimensions you describe, the only significant concern I can imagine is "What is the cost of failing to index multiple db columns?"