Large MySQL Table - Advice Needed - mysql

I have a large mysql MyISAM table with 1.5mil rows and 4.5GB big, still increasing everyday.
I have done all the necessary indexing and the performance has been greatly optimized. Yet, the database occasionally break down (showing 500 Internal Server error) usually due to query overload. Whenever there is a break down, the table will start to work very slowly and I'll have to do a silly but effective task : copy the entire table over to a new table and replace the new one with the old one!!
You may ask why such a stupid action. Why not repair or optimize the table? I've tried that but the time to do repair or optimization may be more than the time to simply duplicate the table and more importantly the new table performs much faster.
Newly built table usually work very well. But over time, it will become sluggish (maybe after a month) and eventually lead to another break down (500 internal server). That's when everything slow down significantly and I need to repeat the silly process of replacing table.
For your info:
- The data in the table seldom get deleted. So there isn't a lot of overhead in the table.
- Under optimal condition, each query takes 1-3 secs. But when it becomes sluggish, the same query can take more than 30 seconds.
- The table has 24 fields, 7 are int, 3 are text, 5 are varchar and the rest are smallint. It's used to hold articles.
If you can explain what cause the sluggishness or you have suggestion on how to improve the situation, feel free to share it. I will be very thankful.

Consider moving to InnoDB. One of its advantages is that it's crash safe. If you need full text capabilities, you can achieve that by implementing external tools like Sphinx or Lucene.

Partitioning is a common strategy here. You might be able to partition the articles by what month they were committed to the database (for example) and then have your query account for returning results from the month of interest (how you partition the table would be up to you and your application's design/behavior). You can union results if you will need your results to come from more than one table.
Even better, depending on your MySQL version, partitioning may be supported by your server. See this for details.

Related

Distributed database use cases

At the moment i do have a mysql database, and the data iam collecting is 5 Terrabyte a year. I will save my data all the time, i dont think i want to delete something very early.
I ask myself if i should use a distributed database because my data will grow every year. And after 5 years i will have 25 Terrabyte without index. (just calculated the raw data i save every day)
i have 5 tables and the most queries are joins over multiple tables.
And i need to access mostly 1-2 columns over many rows at a specific timestamp.
Would a distributed database be a prefered database than only a single mysql database?
Paritioning will be difficult, because all my tables are really high connected.
I know it depends on the queries and on the database table design and i can also have a distributed mysql database.
i just want to know when i should think about a distributed database.
Would this be a use case? or could mysql handle this large dataset?
EDIT:
in average i will have 1500 clients writing data per second, they affect all tables.
i just need the old dataset for analytics. Like machine learning and
pattern matching.
also a client should be able to see the historical data
Your question is about "distributed", but I see more serious questions that need answering first.
"Highly indexed 5TB" will slow to a crawl. An index is a BTree. To add a new row to an index means locating the block in that tree where the item belongs, then read-modify-write that block. But...
If the index is AUTO_INCREMENT or TIMESTAMP (or similar things), then the blocks being modified are 'always' at the 'end' of the BTree. So virtually all of the reads and writes are cacheable. That is, updating such an index is very low overhead.
If the index is 'random', such as UUID, GUID, md5, etc, then the block to update is rarely found in cache. That is, updating this one index for this one row is likely to cost a pair of IOPs. Even with SSDs, you are likely to not keep up. (Assuming you don't have several TB of RAM.)
If the index is somewhere between sequential and random (say, some kind of "name"), then there might be thousands of "hot spots" in the BTree, and these might be cacheable.
Bottom line: If you cannot avoid random indexes, your project is doomed.
Next issue... The queries. If you need to scan 5TB for a SELECT, that will take time. If this is a Data Warehouse type of application and you need to, say, summarize last month's data, then building and maintaining Summary Tables will be very important. Furthermore, this can obviate the need for some of the indexes on the 'Fact' table, thereby possibly eliminating my concern about indexes.
"See the historical data" -- See individual rows? Or just see summary info? (Again, if it is like DW, one rarely needs to see old datapoints.) If summarization will suffice, then most of the 25TB can be avoided.
Do you have a machine with 25TB online? If not, that may force you to have multiple machines. But then you will have the complexity of running queries across them.
5TB is estimated from INT = 4 bytes, etc? If using InnoDB, you need to multiple by 2 to 3 to get the actual footprint. Furthermore, if you need to modify a table in the future, such action probably needs to copy the table over, so that doubles the disk space needed. Your 25TB becomes more like 100TB of storage.
PARTITIONing has very few valid use cases, so I don't want to discuss that until knowing more.
"Sharding" (splitting across machines) is possibly what you mean by "distributed". With multiple tables, you need to think hard about how to split up the data so that JOINs will continue to work.
The 5TB is huge -- Do everything you can to shrink it -- Use smaller datatypes, normalize, etc. But don't "over-normalize", you could end up with terrible performance. (We need to see the queries!)
There are many directions to take a multi-TB db. We really need more info about your tables and queries before we can be more specific.
It's really impossible to provide a specific answer to such a wide question.
In general, I recommend only worrying about performance once you can prove that you have a problem; if you're worried, it's much better to set up a test rig, populate it with representative data, and see what happens.
"Can MySQL handle 5 - 25 TB of data?" Yes. No. Depends. If - as you say - you have no indexes, your queries may slow down a long time before you get to 5TB. If it's 5TB / year of highly indexable data it might be fine.
The most common solution to this question is to keep a "transactional" database for all the "regular" work, and a datawarehouse for reporting, using a regular Extract/Transform/Load job to move the data across, and archive it. The data warehouse typically has a schema optimized for querying, usually entirely unlike the original schema.
If you want to keep everything logically consistent, you might use sharding and clustering - a sort-a-kind-a out of the box feature of MySQL.
I would not, however, roll my own "distributed database" solution. It's much harder than you might think.

Best free way to store 20 million rows a day?

Daily 20-25 million rows that will be removed at midnight for next days data. Can mySQL handle 25 million indexed rows? What would be another good solution?
You give very little information on the context but sometimes not using a database and instead a binary/plain text file is just fine and can -- depending on your requirements -- be much more efficient and maintainable. e.g if it's sensor data storing it in a binary file with each record at a known offset could be a good solution. You saying that the data would be deleted every 24h seems to indicate that you might not need some the properties of a relational database solution such as ACID, replication, integrated backup and so on, so perhaps a flat file approach is just fine?
Our MySQL database has over 300 million rows indexed and we only ever experience problems with complex joins running a little slow - most can be optimized though.
Handling the rows was no problem - the key to our performance was good indexes.
Considering you are dropping the information at midnight, i would also look at MySQL partitioning which would allow you to drop that part of the table whilst allowing the next day to continue inserting if need be.
The issue is not the number of rows itself -- it's what you do with the database. Are you doing only inserts during the day followed by some batch report? Or, are you doing thousands of queries per second on the data? Inserts/Updates/Deletes? If you slam enough load at any database platform, you can max it out with a single table and a single row (taking it to the most extreme). I used MySQL 4.1 w/ MyISAM (hardly the most modern of anything) on a site with a 40M row user table. It did < 5ms queries, I think. We were rendering pages in less than 200ms. However, we had lots and lots of caching set up, so the number of queries wasn't too high. And, we were doing simple statements like SELECT * FROM USER WHERE USER_NAME = 'SMITH'
Can you comment more on your use case?
If you are using Windows, you could do worse than use SqlExpress 2008, which should easily handle that load, depending on how many indexes you are creating on it. So long as you keep < 4GB total db size, it shouldn't be a problem.
From my experience, mySQL tends to not scale well at all. If you must have a free solution for this I would highly recommend postgreSQL.
Also (this may or may not be an issue for you), but keep in mind that if you're dealing with that much data, the maximum size of a mySQL database is 4 terabytes, if I remember correctly.
I don't think there is a practical limit on the max number of rows in mySQL, so if you MUST use mySQL, I think it would work for what you want to do, but personally for a production system I wouldn't recommend it.
As a general solution I'd recommend PostgreSQL too, but depending on your specific needs, other solutions might be better/faster. For example, if you do not need to query your data while it is being written, TokyoCabinet (the table based API / TDB) might be faster and more lightweight/robust.
I haven't looked into them in mysql, but this sounds like a perfect application for table partitions
use only as an index database and store it in the form of file approach would be more effective because you will remove within 24 hours and the process will be faster also not burden your server

Building a Large Table in MySQL

This is my first time building a database with a table containing 10 million records. The table is a members table that will contain all the details of a member.
What do I need to pay attention when I build the database?
Do I need a special version of MySQL? Should I use MyISAM or InnoDB?
For a start, you may need to step back and re-examine your schema. How did you end up with 10 million rows in the member table? Do you actually have 10 million members (it seems like a lot)?
I suspect (although I'm not sure) that you have less than 10 million members in which case your table will not be correctly structured. Please post the schema, that's the first step to us helping you out.
If you do have 10 million members, my advice is to make your application vendor-agnostic to begin with (i.e., standard SQL). Then, if you start running into problems, just toss out your current DBMS and replace it with a more powerful one.
Once you've established you have one that's suitable, then, and only then would I advise using vendor-specific stuff. Otherwise it will be a painful process to change.
BTW, 10 million rows is not really considered a big database table, at least not where I come from.
Beyond that, the following is important (not necessarily an exhaustive list but a good start).
Design your tables for 3NF always. Once you identify performance problems, you can violate that rule provided you understand the consequences.
Don't bother performance tuning during development, your queries are in a state of flux. Just accept the fact they may not run fast.
Once the majority of queries are locked down, then start tuning your tables. Add whatever indexes speed up the selects, de-normalize and so forth.
Tuning is not a set-and-forget operation (which is why we pay our DBAs so much). Continuously monitor performance and tune to suit.
I prefer to keep my SQL standard to retain the ability to switch vendors at any time. But I'm pragmatic. Use vendor-specific stuff if it really gives you a boost. Just be aware of what you're losing and try to isolate the vendor-specific stuff as much as possible.
People that use "select * from ..." when they don't need every column should be beaten into submission.
Likewise those that select every row to filter out on the client side. The people that write our DBMS' aren't sitting around all day playing Solitaire, they know how to make queries run fast. Let the database do what it's best at. Filtering and aggregation is best done on the server side - only send what is needed across the wire.
Generate your queries to be useful. Other than the DoD who require reports detailing every component of their aircraft carriers down to the nuts-and-bolts level, no-one's interested in reading your 1200-page report no matter how useful you think it may be. In fact, I don't think the DoD reads theirs either, but I wouldn't want some general chewing me out because I didn't deliver - those guys can be loud and they have a fair bit of sophisticated weaponry under their control.
At least use InnoDB. You will feel the pain when you realize MyISAM has just lost your data...
Apart from this, you should give more information about what you want to do.
You don't need to use InnoDB if you don't have data integrity and atomic action requirements. You want to use InnoDB if you have foreign keys between tables and you are required to keep the constraints, or if you need to update multiple tables in atomic operation. Otherwise, if you just need to use the table to do analysis, MyISAM is fine.
For queries, make sure you build smart indexes to suite your query. For example, if you want to sort by columns c and selecting based on columns a, and b, make sure you have an index that covers columns a, b, and c, in that order, and that index includes full length of each column, rather than a prefix. If you don't do your index right, sorting over a large amount of data will kill you. See http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
Just a note about InnoDB and setting up and testing a large table with it. If you start injecting your data, it will take hours. Make sure you issue commits periodically, otherwise if you want to stop and redo for whatever reason, you end up have to 1) wait hours for transaction recovery, or 2) kill mysqld, set InnoDB recover flag to no recover and restart. Also if you want to re-inject data from scratch, DROP the table and recreate it is almost instantaneous, but it will take hours to actually "DELETE FROM table".

MySQL speed optimization on a table with many rows : what is the best way to handle it?

I'm developping a chat application. I want to keep everything logged into a table (i.e. "who said what and when").
I hope that in a near future I'll have thousands of rows.
I was wondering : what is the best way to optimize the table, knowing that I'll do often rows insertion and sometimes group reading (i.e. showing an entire conversation from a user (look when he/she logged in/started to chat then look when he/she quit then show the entire conversation)).
This table should be able to handle (I hope though !) many many rows. (15000 / day => 4,5 M each month => 54 M of rows at the end of the year).
The conversations older than 15 days could be historized (but I don't know how I should do to do it right).
Any idea ?
I have two advices for you:
If you are expecting lots of writes
with little low priority reads. Then you
are better off with as little
indexes as possible. Indexes will
make insert slower. Only add what you really need.
If the log table
is going to get bigger and bigger
overtime you should consider log
rotation. Otherwise you might end up
with one gigantic corrupted table.
54 million rows is not that many, especially over a year.
If you are going to be rotating out lots of data periodically, I would recommend using MyISAM and MERGE tables. Since you won't be deleting or editing records, you won't have any locking issues as long as concurrency is set to 1. Inserts will then always be added to the end of the table, so SELECTs and INSERTs can happen simultaneously. So you don't have to use InnoDB based tables (which can use MERGE tables).
You could have 1 table per month, named something like data200905, data200904, etc. Your merge table would them include all the underlying tables you need to search on. Inserts are done on the merge table, so you don't have to worry about changing names. When it's time to rotate out data and create a new table, just redeclare the MERGE table.
You could even create multiple MERGE tables, based on quarter, years, etc. One table can be used in multiple MERGE tables.
I've done this setup on databases that added 30 million records per month.
Mysql does surprisingly well handling very large data sets with little more than standard database tuning and indexes. I ran a site that had millions of rows in a database and was able to run it just fine on mysql.
Mysql does have an "archive" table engine option for handling many rows, but the lack of index support will make it not a great option for you, except perhaps for historical data.
Index creation will be required, but you do have to balance them and not just create them because you can. They will allow for faster queries (and will required for usable queries on a table that large), but the more indexes you have, the more cost there will be inserting.
If you are just querying on your "user" id column, an index on there will not be a problem, but if you are looking to do full text queries on the messages, you may want to consider only indexing the user column in mysql and using something like sphynx or lucene for the full text searches, as full text searches in mysql are not the fastest and significantly slow down insert time.
You could handle this with two tables - one for the current chat history and one archive table. At the end of a period ( week, month or day depending on your traffic) you can archive current chat messages, remove them from the small table and add them to the archive.
This way your application is going to handle well the most common case - query the current chat status and this is going to be really fast.
For queries like "what did x say last month" you will query the archive table and it is going to take a little longer, but this is OK since there won't be that much of this queries and if someone does search like this he would be willing to wait a couple of seconds more.
Depending on your use cases you could extend this principle - if there will be a lot of queries for chat messages during last 6 months - store them in separate table too.
Similar principle (for completely different area) is used by the .NET garbage collector which has different storage for short lived objects, long lived objects, large objects, etc.

What is the cost of indexing multiple db columns?

I'm writing an app with a MySQL table that indexes 3 columns. I'm concerned that after the table reaches a significant amount of records, the time to save a new record will be slow. Please inform how best to approach the indexing of columns.
UPDATE
I am indexing a point_value, the
user_id, and an event_id, all required
for client-facing purposes. For an
instance such as scoring baseball runs
by player id and game id. What would
be the cost of inserting about 200 new
records a day, after the table holds
records for two seasons, say 72,000
runs, and after 5 seasons, maybe a
quarter million records? Only for
illustration, but I'm expecting to
insert between 25 and 200 records a
day.
Index what seems the most logical (that should hopefully be obvious, for example, a customer ID column in the CUSTOMERS table).
Then run your application and collect statistics periodically to see how the database is performing. RUNSTATS on DB2 is one example, I would hope MySQL has a similar tool.
When you find some oft-run queries doing full table scans (or taking too long for other reasons), then, and only then, should you add more indexes. It does little good to optimise a once-a-month-run-at-midnight query so it can finish at 12:05 instead of 12:07. However, it's a huge improvement to reduce a customer-facing query from 5 seconds down to 2 seconds (that's still too slow, customer-facing queries should be sub-second if possible).
More indexes tend to slow down inserts and speed up queries. So it's always a balancing act. That's why you only add indexes in specific response to a problem. Anything else is premature optimization and should be avoided.
In addition, revisit the indexes you already have periodically to see if they're still needed. It may be that the queries that caused you to add those indexes are no longer run often enough to warrant it.
To be honest, I don't believe indexing three columns on a table will cause you to suffer unless you plan on storing really huge numbers of rows :-) - indexing is pretty efficient.
After your edit which states:
I am indexing a point_value, the user_id, and an event_id, all required for client-facing purposes. For an instance such as scoring baseball runs by player id and game id. What would be the cost of inserting about 200 new records a day, after the table holds records for two seasons, say 72,000 runs, and after 5 seasons, maybe a quarter million records? Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
My response is that 200 records a day is an extremely small value for a database, you definitely won't have anything to worry about with those three indexes.
Just this week, I imported a days worth of transactions into one of our database tables at work and it contained 2.1 million records (we get at least one transaction per second across the entire day from 25 separate machines). And it has four separate composite keys which is somewhat more intensive than your three individual keys.
Now granted, that's on a DB2 database but I can't imagine IBM are so much better than the MySQL people that MySQL can only handle less than 0.01% of the DB2 load.
I made some simple tests using my real project and real MySql database.
My results are: adding average index (1-3 columns in an index) to a table - makes inserts slower by 2.1%. So, if you add 20 indexes, your inserts will be slower by 40-50%. But your selects will be 10-100 times faster.
So is it ok to add many indexes? - It depends :) I gave you my results - You decide!
Nothing for select queries, though updates and especially inserts will be order of magnitudes slower - which you won't really notice before you start inserting a LOT of rows at the same time...
In fact at a previous employer (single user, desktop system) we actually DROPPED indexes before starting our "import routine" - which would first delete all records before inserting a huge number of records into the same table...
Then when we were finished with the insertion job we would re-create the indexes...
We would save 90% of the time for this operation by dropping the indexes before starting the operation and re-creating the indexes afterwards...
This was a Sybase database, but the same numbers apply for any database...
So be careful with indexes, they're FAR from "free"...
Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
With that kind of insertion rate, the cost of indexing an extra column will be negligible.
Without some more details about expected usage of the data in your table worrying about indexes slowing you down smells a lot like premature optimization that should be avoided.
If you are really concerned about it, then setup a test database and simulate performance in the worst case scenarios. A test proving that is or is not a problem will probably be much more useful then trying to guess and worry about what may happen. If there is a problem you will be able to use your test setup to try different methods to fix the issue.
The index is there to speed retrieval of data, so the question should be "What data do I need to access quickly?". Without the index, some queries will do a full table scan (go through every row in the table) in order to find the data that you want. With a significant amount of records this will be a slow and expensive operation. If it is for a report that you run once a month then maybe thats okay; if it is for frequently accessed data then you will need the index to give your users a better experience.
If you find the speed of the insert operations are slow because of the index then this is a problem you can solve at the hardware level by throwing more CPUs, RAM and better hard drive technology at the problem.
What Pax said.
For the dimensions you describe, the only significant concern I can imagine is "What is the cost of failing to index multiple db columns?"