mysql amount of data per table - mysql

I'm designing a system, and by going deep into numbers, I realize that it could reach a point where there could be a table with 54,240,211,584 records/year (approximately). WOW!!!!
So, I brook it down & down to 73,271,952 records/year (approximately).
I got the numbers by making some excel running on what would happen if:
a) no success = 87 users,
b) low moderated success = 4300 users,
c) high moderated success = 13199 users,
d) success = 55100 users
e) incredible success = nah
Taking into account that the table is used for SELECT, INSERT, UPDATE & JOIN statements and that these statements would be executed by any user logged into the system hourly/daily/weekly (historical data is not an option):
Question 1: is 2nd quantity suitable/handy for the MySQL engine, such that performance would suffer little impact???
Question 2: I set the table as InnoDB but, given the fact that I handle all of the statements with JOINS & that I'm willing to run into the 4GB limit problem, is InnoDB useful???
Quick overview of the tables:
table #1: user/event purchase. Up to 15 columns, some of them VARCHAR.
table #2: tickets by purchase. Up to 8 columns, only TINYINT. Primary Key INT. From 4 to 15 rows inserted by each table #1 insertion.
table #3: items by ticket. 4 columns, only TINYINT. Primary Key INT. 3 rows inserted by each table #2 insertion. I want to keep it as a separated table, but if someone has to die...
table #3 is the target of the question. The way I reduced to 2nd quantity was by making each table #3's row be a table #2's column.
Something that I dont want to do, but I would if necessary, is to partition the tables by week and add more logic to application.
Every answer helps, but it would be more helpful something like:
i) 33,754,240,211,584: No, so lets drop the last number.
ii) 3,375,424,021,158: No, so lets drop the last number.
iii) 337,542,402,115: No, so lets drop the last number. And so on until we get something like "well, it depends on many factors..."
What would I consider "little performance impact"??? Up to 1,000,000 records, it takes no more than 3 seconds to exec the queries. If 33,754,240,211,584 records will take around 10 seconds, that's excellent to me.
Why don't I just test it by myself??? I think I'm not capable of doing such a test. The stuff I would do is just to insert that quantity of rows and see what happens. I prefer FIRST the point of view of someone who has already known of something similar. Remember, I'm still in design stage
Thanks in advance.

54,240,211,584 is a lot. I only have experience with mysql tables up to 300 million rows, and it handles that with little problem. I'm not sure what you're actually asking, but here's some notes:
Use InnoDB if you need transaction support, or are doing a lot of inserts/updates.
MyISAM tables are bad for transactional data, but ok if you're very read heavy and only do bulk inserts/updates every now and then.
There's no 4Gb limit with mysql if you're using recent releases/recen't operating systems. My biggest table is 211Gb now.
Purging data in large tables is very slow. e.g. deleting all records for a month takes me a few hours. (Deleting single records is fast though).
Don't use int/tinyint if you're expecting many billions of records, they'll wrap around.
Get something working, fix the scaling after the first release. An unrealized idea is pretty much useless, something that works(for now) might be very useful.
Test. There's no real substitute - your app and db usage might be wildely different from someone elses huge database.
Look into partitioned tables, a recent feature in MySQL that can help you scale in may ways.

Start at the level you're at. Build from there.
There are plenty of people out there who will sell you services you don't need right now.
If $10/month shared hosting isn't working anymore, then upgrade, and eventually hire someone to help you get around the record limitations of your DB.

There is no 4Gb limit, but of course there are limits. Don't plan too far ahead. If you're just starting up and you plan to be the next Facebook, that's great but you have no resources.
Get something working so you can show your investors :)

Related

MySQL Table Locks

I was asked to do some PHP scripts on MySQL DB to show some data when I noticed the strange design they had.
They want to perform a study that would require collecting up to 2000 records per user and they are automatically creating a new table for each user that registers. It's a pilot study at this stage so they have around 30 tables but they should have 3000 users for the real study.
I wanted to suggest gathering all of them in a single table but since there might be around 1500 INSERTs per minute to that database during the study period, I wanted to ask this question here first. Will that cause table locks in MySQL?
So, Is it one table with 1500 INSERTs per minute and a maximum size of 6,000,000 records or 3000 tables with 30 INSERTs per minute and a maximum size of 2000 records. I would like to suggest the first option but I want to be sure that it will not cause any issues.
I read that InnoDB has row-level locks. So, will that have a better performance combined with the one table option?
This is a huge loaded question. In my experience performance is not really measured accurately by table size alone. It comes down to design. Do you have the primary keys and indexes in place? Is it over indexed? That being said, I have also found that almost always one trip to the DB is faster than dozens. How big is the single table (columns)? What kind of data are you saving (larger than 4000K?). It might be that you need to create some prototypes to see what performs best for you. The most I can recommend is that you carefully judge the size of the data you are collecting and allocate accordingly, create indexes (but not too many, don't over index), and test.

Performance of additional columns vs additional rows

I have a question about table design and performance. I have a number of analytical machines that produce varying amounts of data (which have been stored in text files up to this point via the dos programs which run the machines). I have decided to modernise and create a new database to store all the machine results in.
I have created separate tables to store results by type e.g. all results from the balance machine get stored in the balance results table etc.
I have a common results table format for each machine which is as follows:
ClientRequestID PK
SampleNumber PK
MeasureDtTm
Operator
AnalyteName
UnitOfMeasure
Value
A typical ClientRequest might have 50 samples which need to tested by various machines. Each machine records only 1 line per sample, so there are apprx 50 rows per table associated with any given ClientRequest.
This is fine for all machines except one!
It measures 20-30 analytes per sample (and just spits them out in one long row), whereas all the other machines, I am only ever measuring 1 analyte per RequestID/SampleNumber.
If I stick to this format, this machine will generate over a miliion rows per year, because every sample can have as many as 30 measurements.
My other tables will only grow at a rate of 3000-5000 rows per year.
So after all that, my question is this:
Am I better to stick to the common format for this table, and have bucket loads of rows, or is it better to just add extra columns to represent each Analyte, such that it would generate only 1 row per sample (like the other tables). The machine can only ever measure a max of 30 analytes (and a $250k per machine, I won;t be getting another in my lifetime).
All I am worried about is reporting performance and online editing. In both cases, the PK: RequestID and SampleNumber remain the same, so I guess it's just a matter of what would load quicker. I know the multiple column approach is considered woeful from a design perspective, but would it yield better performance in this instance?
BTW the database is MS Jet / Access 2010
Any help would be greatly appreciated!
Millions of rows in a Jet/ACE database are not a problem if the rows have few columns.
However, my concern is how these records are inserted -- is this real-time data collection? If so, I'd suggest this is probably more than Jet/ACE can handle reliably.
I'm an experienced Access developer who is a big fan of Jet/ACE, but from what I know about your project, if I was starting it out, I'd definitely choose a server database from the get go, not because Jet/ACE likely can't handle it right now, but because I'm thinking in terms of 10 years down the road when this app might still be in use (remember Y2K, which was mostly a problem of apps that were designed with planned obsolescence in mind, but were never replaced).
You can decouple the AnalyteName column from the 'common results' table:
-- Table Common Results
ClientRequestID PK SampleNumber PK MeasureDtTm Operator UnitOfMeasure Value
-- Table Results Analyte
ClientRequestID PK SampleNumber PK AnalyteName
You join on the PK (Request + Sample.) That way you don't duplicate all the rest of the rows needlessly, can avoid the join in the queries where you don't require the AnalyteName to be used, can support extra Analytes and is overall saner. Unless you really start having a performance problem, this is the approach I'd follow.
Heck, even if you start having performance problems, I'd first move to a real database to see if that fixes the problems before adding columns to the results table.

Large MySQL Table - Advice Needed

I have a large mysql MyISAM table with 1.5mil rows and 4.5GB big, still increasing everyday.
I have done all the necessary indexing and the performance has been greatly optimized. Yet, the database occasionally break down (showing 500 Internal Server error) usually due to query overload. Whenever there is a break down, the table will start to work very slowly and I'll have to do a silly but effective task : copy the entire table over to a new table and replace the new one with the old one!!
You may ask why such a stupid action. Why not repair or optimize the table? I've tried that but the time to do repair or optimization may be more than the time to simply duplicate the table and more importantly the new table performs much faster.
Newly built table usually work very well. But over time, it will become sluggish (maybe after a month) and eventually lead to another break down (500 internal server). That's when everything slow down significantly and I need to repeat the silly process of replacing table.
For your info:
- The data in the table seldom get deleted. So there isn't a lot of overhead in the table.
- Under optimal condition, each query takes 1-3 secs. But when it becomes sluggish, the same query can take more than 30 seconds.
- The table has 24 fields, 7 are int, 3 are text, 5 are varchar and the rest are smallint. It's used to hold articles.
If you can explain what cause the sluggishness or you have suggestion on how to improve the situation, feel free to share it. I will be very thankful.
Consider moving to InnoDB. One of its advantages is that it's crash safe. If you need full text capabilities, you can achieve that by implementing external tools like Sphinx or Lucene.
Partitioning is a common strategy here. You might be able to partition the articles by what month they were committed to the database (for example) and then have your query account for returning results from the month of interest (how you partition the table would be up to you and your application's design/behavior). You can union results if you will need your results to come from more than one table.
Even better, depending on your MySQL version, partitioning may be supported by your server. See this for details.

MySQL speed optimization on a table with many rows : what is the best way to handle it?

I'm developping a chat application. I want to keep everything logged into a table (i.e. "who said what and when").
I hope that in a near future I'll have thousands of rows.
I was wondering : what is the best way to optimize the table, knowing that I'll do often rows insertion and sometimes group reading (i.e. showing an entire conversation from a user (look when he/she logged in/started to chat then look when he/she quit then show the entire conversation)).
This table should be able to handle (I hope though !) many many rows. (15000 / day => 4,5 M each month => 54 M of rows at the end of the year).
The conversations older than 15 days could be historized (but I don't know how I should do to do it right).
Any idea ?
I have two advices for you:
If you are expecting lots of writes
with little low priority reads. Then you
are better off with as little
indexes as possible. Indexes will
make insert slower. Only add what you really need.
If the log table
is going to get bigger and bigger
overtime you should consider log
rotation. Otherwise you might end up
with one gigantic corrupted table.
54 million rows is not that many, especially over a year.
If you are going to be rotating out lots of data periodically, I would recommend using MyISAM and MERGE tables. Since you won't be deleting or editing records, you won't have any locking issues as long as concurrency is set to 1. Inserts will then always be added to the end of the table, so SELECTs and INSERTs can happen simultaneously. So you don't have to use InnoDB based tables (which can use MERGE tables).
You could have 1 table per month, named something like data200905, data200904, etc. Your merge table would them include all the underlying tables you need to search on. Inserts are done on the merge table, so you don't have to worry about changing names. When it's time to rotate out data and create a new table, just redeclare the MERGE table.
You could even create multiple MERGE tables, based on quarter, years, etc. One table can be used in multiple MERGE tables.
I've done this setup on databases that added 30 million records per month.
Mysql does surprisingly well handling very large data sets with little more than standard database tuning and indexes. I ran a site that had millions of rows in a database and was able to run it just fine on mysql.
Mysql does have an "archive" table engine option for handling many rows, but the lack of index support will make it not a great option for you, except perhaps for historical data.
Index creation will be required, but you do have to balance them and not just create them because you can. They will allow for faster queries (and will required for usable queries on a table that large), but the more indexes you have, the more cost there will be inserting.
If you are just querying on your "user" id column, an index on there will not be a problem, but if you are looking to do full text queries on the messages, you may want to consider only indexing the user column in mysql and using something like sphynx or lucene for the full text searches, as full text searches in mysql are not the fastest and significantly slow down insert time.
You could handle this with two tables - one for the current chat history and one archive table. At the end of a period ( week, month or day depending on your traffic) you can archive current chat messages, remove them from the small table and add them to the archive.
This way your application is going to handle well the most common case - query the current chat status and this is going to be really fast.
For queries like "what did x say last month" you will query the archive table and it is going to take a little longer, but this is OK since there won't be that much of this queries and if someone does search like this he would be willing to wait a couple of seconds more.
Depending on your use cases you could extend this principle - if there will be a lot of queries for chat messages during last 6 months - store them in separate table too.
Similar principle (for completely different area) is used by the .NET garbage collector which has different storage for short lived objects, long lived objects, large objects, etc.

What is the cost of indexing multiple db columns?

I'm writing an app with a MySQL table that indexes 3 columns. I'm concerned that after the table reaches a significant amount of records, the time to save a new record will be slow. Please inform how best to approach the indexing of columns.
UPDATE
I am indexing a point_value, the
user_id, and an event_id, all required
for client-facing purposes. For an
instance such as scoring baseball runs
by player id and game id. What would
be the cost of inserting about 200 new
records a day, after the table holds
records for two seasons, say 72,000
runs, and after 5 seasons, maybe a
quarter million records? Only for
illustration, but I'm expecting to
insert between 25 and 200 records a
day.
Index what seems the most logical (that should hopefully be obvious, for example, a customer ID column in the CUSTOMERS table).
Then run your application and collect statistics periodically to see how the database is performing. RUNSTATS on DB2 is one example, I would hope MySQL has a similar tool.
When you find some oft-run queries doing full table scans (or taking too long for other reasons), then, and only then, should you add more indexes. It does little good to optimise a once-a-month-run-at-midnight query so it can finish at 12:05 instead of 12:07. However, it's a huge improvement to reduce a customer-facing query from 5 seconds down to 2 seconds (that's still too slow, customer-facing queries should be sub-second if possible).
More indexes tend to slow down inserts and speed up queries. So it's always a balancing act. That's why you only add indexes in specific response to a problem. Anything else is premature optimization and should be avoided.
In addition, revisit the indexes you already have periodically to see if they're still needed. It may be that the queries that caused you to add those indexes are no longer run often enough to warrant it.
To be honest, I don't believe indexing three columns on a table will cause you to suffer unless you plan on storing really huge numbers of rows :-) - indexing is pretty efficient.
After your edit which states:
I am indexing a point_value, the user_id, and an event_id, all required for client-facing purposes. For an instance such as scoring baseball runs by player id and game id. What would be the cost of inserting about 200 new records a day, after the table holds records for two seasons, say 72,000 runs, and after 5 seasons, maybe a quarter million records? Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
My response is that 200 records a day is an extremely small value for a database, you definitely won't have anything to worry about with those three indexes.
Just this week, I imported a days worth of transactions into one of our database tables at work and it contained 2.1 million records (we get at least one transaction per second across the entire day from 25 separate machines). And it has four separate composite keys which is somewhat more intensive than your three individual keys.
Now granted, that's on a DB2 database but I can't imagine IBM are so much better than the MySQL people that MySQL can only handle less than 0.01% of the DB2 load.
I made some simple tests using my real project and real MySql database.
My results are: adding average index (1-3 columns in an index) to a table - makes inserts slower by 2.1%. So, if you add 20 indexes, your inserts will be slower by 40-50%. But your selects will be 10-100 times faster.
So is it ok to add many indexes? - It depends :) I gave you my results - You decide!
Nothing for select queries, though updates and especially inserts will be order of magnitudes slower - which you won't really notice before you start inserting a LOT of rows at the same time...
In fact at a previous employer (single user, desktop system) we actually DROPPED indexes before starting our "import routine" - which would first delete all records before inserting a huge number of records into the same table...
Then when we were finished with the insertion job we would re-create the indexes...
We would save 90% of the time for this operation by dropping the indexes before starting the operation and re-creating the indexes afterwards...
This was a Sybase database, but the same numbers apply for any database...
So be careful with indexes, they're FAR from "free"...
Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
With that kind of insertion rate, the cost of indexing an extra column will be negligible.
Without some more details about expected usage of the data in your table worrying about indexes slowing you down smells a lot like premature optimization that should be avoided.
If you are really concerned about it, then setup a test database and simulate performance in the worst case scenarios. A test proving that is or is not a problem will probably be much more useful then trying to guess and worry about what may happen. If there is a problem you will be able to use your test setup to try different methods to fix the issue.
The index is there to speed retrieval of data, so the question should be "What data do I need to access quickly?". Without the index, some queries will do a full table scan (go through every row in the table) in order to find the data that you want. With a significant amount of records this will be a slow and expensive operation. If it is for a report that you run once a month then maybe thats okay; if it is for frequently accessed data then you will need the index to give your users a better experience.
If you find the speed of the insert operations are slow because of the index then this is a problem you can solve at the hardware level by throwing more CPUs, RAM and better hard drive technology at the problem.
What Pax said.
For the dimensions you describe, the only significant concern I can imagine is "What is the cost of failing to index multiple db columns?"