Disk size converting CSV to MySQL (MyISAM)

Disk size converting CSV to MySQL (MyISAM) - mysql

I am looking for a ballpark estimate of the size of the database tables after I have converted the current CSV files to MyISAM tables. I know the file size of the CSV's and I need an estimate of the filesize of the MyISAM. I guess it will be bigger, because of indexes (just 1 simple index is enough), but how much? Is it about 2 times, 10 times or 50 times?

Presumably you're looking for capacity-planning size data to give to your server wrangler.
If your CSV files total less than about 200-300 megabytes in size, just tell her "two gigabytes or less."
If your CSV files are larger than that, you are definitely going to need to run some trials to figure this out. Notice that indexes aren't necessarily smaller than the data they index, and that they can grow larger faster than linearly.
If you are wrangling with your server wrangler about any amount of disk space less than a couple of gigabytes, then it's unfortunate evidence that your organization is short of money and talent.
In this case you're going to have to prove it to them: load the database on your personal computer, measure the size of the myISAM files, and tell them what you came up with.

Related

MySQL - Where can I find metrics on the performance of Blob vs file system?

I know and understand that there are performance hits in storing blob data in the database, but the blob portion of the data is going to be rarely retrieved/viewed, it is for smaller data (the vast majority under 256k with a max of 10mb), it is not going to be used by most customers, and the total rows is expected to be relatively low, very likely under half a million, if not less. Also some of the data is dynamic and can change for some users, as in it's not static images. In other words we're at the edge of whether or not it's worth it.
I keep reading that it's better to store in the file system but I can't find actual metrics that show the performance difference, just people repeating each other without any concrete proof or metrics. For us it may be worth the performance cost in exchange for being fully ACID as well as guaranteeing that all our backups are completely synched.
That being said does anyone know or have any real world metrics to show the performance difference between storing items as blobs vs in the file system. I'm trying to understand if the performance penalty is worth it or not rather than blindly following the general rule of thumb and after spending at least 2-3 hours I've yet to be able to see anyone show any actual numbers...
UPDATE: MySQL with an InnoDB table. The actual data table has a link to the blob data table, so the blob is not in the main able and is only retrieved when need be to avoid any I/O issues. In other words instead of the path to the data on the filesystem, it's an ID to another table with only blobs. How does that compare in terms of performance? Is it 25% worse? Is it 100%? Is it 200-500%? Is it 1000%?
If the cost is only 100%-200% it is probably worth it for us because again the data is rarely retrieved. So even if we had say 10,000 concurrent users, maybe only 50 users would be retrieving their blob data concurrently at best. Yes the data is specific to each user, it isn't images.

To BLOB or not to BLOB

I am in the process of writing a web app backed up by a MySQL database where one of the tables has a potential for getting very large (order of gigabytes) with a significant proportion of table operations being writes. One of the table columns needs to store a string sequence that can be quite big. In my tests thus far it has reached a size of 289 bytes but to be on the safe side I want to design for a maximum size of 1 kb. Currently I am storing that column as a MySQL MediumBlob field in an InnoDB table.
At the same time I have been googling to establish the relative merits and demerits of BLOBs vs other forms of storage. There is a plethora of information out there, perhaps too much. What I have gathered is that InnoDB stores the first few bytes (789 if memory serves me right) of the BLOB in the table row itself and the rest elsewhere. I have also got the notion that if a row has more than one BLOB (which my table does not) per column then the "elsewhere" is a different location for each BLOB. That apart I have got the impression that accessing BLOB data is significantly slower than accessing row data (which sounds reasonable).
My question is just this - in light of my BLOB size and the large potential size of the table should I bother with a blob at all? Also, if I use some form of inrow storage instead will that not have an adverse effect on the maximum number of rows that the table will be able to accommodate?
MySQL is neat and lets me get away with pretty much everything in my development environment. But... that ain't the real world.

I'm sure you've already looked here but it's easy to overlook some of the details since there is a lot to keep in mind when it comes to InnoDB limitations.
The easy answer to one of your questions (maximum size of a table) is 64TBytes. Using variable size types to move that storage into a separate file would certainly change the upper limit on number of rows but 64TBytes is quite a lot of space so the ratio might be very small.
Having a column with a 1KByte string type that is stored inside the table seems like a viable solution since it's also very small compared to 64TBytes. Especially if you have very strict requirements for query speed.
Also, keep in mind that the InnoDB 64TByte limit might be pushed down by the the maximum file size for the OS you're using. You can always link several files together to get more space for your table but then it's starting to get a bit more messy.

if the BLOB data is more then 250kb it is not worth it. In your case i wouldn't bother myself whit BLOB'n. Read this

MySQL InnoDB big table: to shard or to add more RAM?

Folks, I'm a developer of a social game and there are already 700k players in the game, and about 7k new players are registered every day, about 5k players are constantly online.
The DB server is running on a pretty powerful hardware: 16 cores CPU, 24 Gb RAM, RAID-10 with BBU built on 4 SAS disks. I'm using Percona server(patched MySQL-5.1) and currently InnoDB buffer pool is 18Gb(although according to innotop only a few free buffers available). The DB server is performing pretty well(2k QPS, iostat %util is 10-15%, almost always 0 processes in "b" state in vmstat, loadavg is 5-6). However from time to time(every few minutes) I'm getting about 10-100 slow queries(where each may last about 5-6 seconds).
There is one big InnoDB table in the MySQL database which occupies the most space. It has about 300 millions rows, it's size is about 20 Gb. Of course, this table is gradually growing... I'm starting to worry it's affecting the overall performance of the database in a negative way. In the nearest future I'll have to do something about it, but I'm not sure what exactly.
Basically question boils down to whether to shard or simply add more RAM. The latter is simpler, of course. Looks like I can add up to 256 Gb RAM. But the question is whether I should invest more time implementing sharding instead since it's more scalable?

Sharding seems reasonable if you need to have all 300m+ rows. It may be a pain to change now but when your table grows and grows there will be a point when no amount of ram will solve your problem. With such massive amounts of data it may be worth using something like couch db as you could store documents of data rather than rows ie 1 document could contain all records for an individual user.

Sounds to me like your main database table could use some normalization. Does all your information belong in that one table, or can you split it out to smaller tables? Normalization may invoke a small performance hit now, but as your table grows, that will be overwhelmed by the extra processing involved in accessing a huge, monolithic table.

I'm getting about 10-100 slow queries(where each may last about 5-6 seconds).
Quote of a comment: Database is properly normalized. The database has many tables, one of them is really huge and has nothing to do with normalization.
When im reading this i would say it has to do with your queries.. has nothing to do with your hardware.. Average companies would dream about kind of server you have!
If you write bad queries doesn't matter how good your tables are normalized, it will be slow.
maybe you got something about this, its almost a similar question with an answer(database is slow and stuff like that).
Also thought about archiving some stuff? For example from those 300 million it started with ID 1 so is that ID still get used? if not why not archive it to a other database or table(i would recommend database). I also believe that not every 700k users are logged in every day(if you got respect! but i don't believe that).
You also said 'This table contains player specific items' what kind of specific items?
Another question, can you post some of your 'slow' queries?
You also considered about a caching system from some data? that maybe changed once a month, like gear other game stuff?

MySQL Partitioning / Sharding / Splitting - which way to go?

We have an InnoDB database that is about 70 GB and we expect it to grow to several hundred GB in the next 2 to 3 years. About 60 % of the data belong to a single table. Currently the database is working quite well as we have a server with 64 GB of RAM, so almost the whole database fits into memory, but we’re concerned about the future when the amount of data will be considerably larger. Right now we’re considering some way of splitting up the tables (especially the one that accounts for the biggest part of the data) and I’m now wondering, what would be the best way to do it.
The options I’m currently aware of are
Using MySQL Partitioning that comes with version 5.1
Using some kind of third party library that encapsulates the partitioning of the data (like hibernate shards)
Implementing it ourselves inside our application
Our application is built on J2EE and EJB 2.1 (hopefully we’re switching to EJB 3 some day).
What would you suggest?
EDIT (2011-02-11):
Just an update: Currently the size of the database is 380 GB, the data size of our "big" table is 220 GB and the size of its index is 36 GB. So while the whole table does not fit in memory any more, the index does.
The system is still performing fine (still on the same hardware) and we're still thinking about partitioning the data.
EDIT (2014-06-04):
One more update: The size of the whole database is 1.5 TB, the size of our "big" table is 1.1 TB. We upgraded our server to a 4 processor machine (Intel Xeon E7450) with 128 GB RAM.
The system is still performing fine.
What we're planning to do next is putting our big table on a separate database server (we've already done the necessary changes in our software) while simultaneously upgrading to new hardware with 256 GB RAM.
This setup is supposed to last for two years. Then we will either have to finally start implementing a sharding solution or just buy servers with 1 TB of RAM which should keep us going for some time.
EDIT (2016-01-18):
We have since put our big table in it's own database on a separate server. Currently the size ot this database is about 1.9 TB, the size of the other database (with all tables except for the "big" one) is 1.1 TB.
Current Hardware setup:
HP ProLiant DL 580
4 x Intel(R) Xeon(R) CPU E7- 4830
256 GB RAM
Performance is fine with this setup.

You will definitely start to run into issues on that 42 GB table once it no longer fits in memory. In fact, as soon as it does not fit in memory anymore, performance will degrade extremely quickly. One way to test is to put that table on another machine with less RAM and see how poor it performs.
First of all, it doesn't matter as much splitting out tables unless you also move some of the tables to a separate physical volume.
This is incorrect. Partioning (either through the feature in MySQL 5.1, or the same thing using MERGE tables) can provide significant performance benefits even if the tables are on the same drive.
As an example, let's say that you are running SELECT queries on your big table using a date range. If the table is whole, the query will be forced to scan through the entire table (and at that size, even using indexes can be slow). The advantage of partitioning is that your queries will only run on the partitions where it is absolutely necessary. If each partition is 1 GB in size and your query only needs to access 5 partitions in order to fulfill itself, the combined 5 GB table is a lot easier for MySQL to deal with than a monster 42 GB version.
One thing you need to ask yourself is how you are querying the data. If there is a chance that your queries will only need to access certain chunks of data (i.e. a date range or ID range), partitioning of some kind will prove beneficial.
I've heard that there is still some buggyness with MySQL 5.1 partitioning, particularly related to MySQL choosing the correct key. MERGE tables can provide the same functionality, although they require slightly more overhead.
Hope that helps...good luck!

If you think you're going to be IO/memory bound, I don't think partitioning is going to be helpful. As usual, benchmarking first will help you figure out the best direction. If you don't have spare servers with 64GB of memory kicking around, you can always ask your vendor for a 'demo unit'.
I would lean towards sharding if you don't expect 1 query aggregate reporting. I'm assuming you'd shard the whole database and not just your big table: it's best to keep entire entities together. Well, if your model splits nicely, anyway.

This is a great example of what can MySql partitioning do in a real-life example of huge data flows:
http://web.archive.org/web/20101125025320/http://www.tritux.com/blog/2010/11/19/partitioning-mysql-database-with-high-load-solutions/11/1
Hoping it will be helpful for your case.

A while back at a Microsoft ArcReady event, I saw a presentation on scaling patterns that might be useful to you. You can view the slides for it online.

I would go for MariaDB InnoDB + Partitions (either by key or by date, depending on your queries).
I did this and now I don't have any Database problems anymore.
MySQL can be replaced with MariaDB in seconds...all the database files stay the same.

First of all, it doesn't matter as much splitting out tables unless you also move some of the tables to a separate physical volume.
Secondly, it's not necessarily the table with the largest physical size that you want to move. You may have a much smaller table that gets more activity, while your big table remains fairly constant or only appends data.
Whatever you do, don't implement it yourselves. Let the database system handle it.

What does the big table do.
If you're going to split it, you've got a few options:
- Split it using the database system (don't know much about that)
- Split it by row.
- split it by column.
Splitting it by row would only be possible if your data can be separated easily into chunks. e.g. Something like Basecamp has multiple accounts which are completely separate. You could keep 50% of the accounts in one table and 50% in a different table on a different machine.
Splitting by Column is good for situations where the row size contains large text fields or BLOBS. If you've got a table with (for example) a user image and a huge block of text, you could farm the image into a completely different table. (on a different machine)
You break normalisation here, but I don't think it would cause too many problems.

You would probably want to split that large table eventually. You'll probably want to put it on a separate hard disk, before thinking of a second server. Doing it with MySQL is the most convenient option. If it is capable, then go for it.
BUT
Everything depends on how your database is being used, really. Statistics.

How big can a MySQL database get before performance starts to degrade

At what point does a MySQL database start to lose performance?
Does physical database size matter?
Do number of records matter?
Is any performance degradation linear or exponential?
I have what I believe to be a large database, with roughly 15M records which take up almost 2GB. Based on these numbers, is there any incentive for me to clean the data out, or am I safe to allow it to continue scaling for a few more years?

The physical database size doesn't matter. The number of records don't matter.
In my experience the biggest problem that you are going to run in to is not size, but the number of queries you can handle at a time. Most likely you are going to have to move to a master/slave configuration so that the read queries can run against the slaves and the write queries run against the master. However if you are not ready for this yet, you can always tweak your indexes for the queries you are running to speed up the response times. Also there is a lot of tweaking you can do to the network stack and kernel in Linux that will help.
I have had mine get up to 10GB, with only a moderate number of connections and it handled the requests just fine.
I would focus first on your indexes, then have a server admin look at your OS, and if all that doesn't help it might be time to implement a master/slave configuration.

In general this is a very subtle issue and not trivial whatsoever. I encourage you to read mysqlperformanceblog.com and High Performance MySQL. I really think there is no general answer for this.
I'm working on a project which has a MySQL database with almost 1TB of data. The most important scalability factor is RAM. If the indexes of your tables fit into memory and your queries are highly optimized, you can serve a reasonable amount of requests with a average machine.
The number of records do matter, depending of how your tables look like. It's a difference to have a lot of varchar fields or only a couple of ints or longs.
The physical size of the database matters as well: think of backups, for instance. Depending on your engine, your physical db files on grow, but don't shrink, for instance with innodb. So deleting a lot of rows, doesn't help to shrink your physical files.
There's a lot to this issues and as in a lot of cases the devil is in the details.

The database size does matter. If you have more than one table with more than a million records, then performance starts indeed to degrade. The number of records does of course affect the performance: MySQL can be slow with large tables. If you hit one million records you will get performance problems if the indices are not set right (for example no indices for fields in "WHERE statements" or "ON conditions" in joins). If you hit 10 million records, you will start to get performance problems even if you have all your indices right. Hardware upgrades - adding more memory and more processor power, especially memory - often help to reduce the most severe problems by increasing the performance again, at least to a certain degree. For example 37 signals went from 32 GB RAM to 128GB of RAM for the Basecamp database server.

I'm currently managing a MySQL database on Amazon's cloud infrastructure that has grown to 160 GB. Query performance is fine. What has become a nightmare is backups, restores, adding slaves, or anything else that deals with the whole dataset, or even DDL on large tables. Getting a clean import of a dump file has become problematic. In order to make the process stable enough to automate, various choices needed to be made to prioritize stability over performance. If we ever had to recover from a disaster using a SQL backup, we'd be down for days.
Horizontally scaling SQL is also pretty painful, and in most cases leads to using it in ways you probably did not intend when you chose to put your data in SQL in the first place. Shards, read slaves, multi-master, et al, they are all really shitty solutions that add complexity to everything you ever do with the DB, and not one of them solves the problem; only mitigates it in some ways. I would strongly suggest looking at moving some of your data out of MySQL (or really any SQL) when you start approaching a dataset of a size where these types of things become an issue.
Update: a few years later, and our dataset has grown to about 800 GiB. In addition, we have a single table which is 200+ GiB and a few others in the 50-100 GiB range. Everything I said before holds. It still performs just fine, but the problems of running full dataset operations have become worse.

I would focus first on your indexes, than have a server admin look at your OS, and if all that doesn't help it might be time for a master/slave configuration.
That's true. Another thing that usually works is to just reduce the quantity of data that's repeatedly worked with. If you have "old data" and "new data" and 99% of your queries work with new data, just move all the old data to another table - and don't look at it ;)
-> Have a look at partitioning.

2GB and about 15M records is a very small database - I've run much bigger ones on a pentium III(!) and everything has still run pretty fast.. If yours is slow it is a database/application design problem, not a mysql one.

It's kind of pointless to talk about "database performance", "query performance" is a better term here. And the answer is: it depends on the query, data that it operates on, indexes, hardware, etc. You can get an idea of how many rows are going to be scanned and what indexes are going to be used with EXPLAIN syntax.
2GB does not really count as a "large" database - it's more of a medium size.

I once was called upon to look at a mysql that had "stopped working". I discovered that the DB files were residing on a Network Appliance filer mounted with NFS2 and with a maximum file size of 2GB. And sure enough, the table that had stopped accepting transactions was exactly 2GB on disk. But with regards to the performance curve I'm told that it was working like a champ right up until it didn't work at all! This experience always serves for me as a nice reminder that there're always dimensions above and below the one you naturally suspect.

Also watch out for complex joins. Transaction complexity can be a big factor in addition to transaction volume.
Refactoring heavy queries sometimes offers a big performance boost.

A point to consider is also the purpose of the system and the data in the day to day.
For example, for a system with GPS monitoring of cars is not relevant query data from the positions of the car in previous months.
Therefore the data can be passed to other historical tables for possible consultation and reduce the execution times of the day to day queries.

Performance can degrade in a matter of few thousand rows if database is not designed properly.
If you have proper indexes, use proper engines (don't use MyISAM where multiple DMLs are expected), use partitioning, allocate correct memory depending on the use and of course have good server configuration, MySQL can handle data even in terabytes!
There are always ways to improve the database performance.

It depends on your query and validation.
For example, i worked with a table of 100 000 drugs which has a column generic name where it has more than 15 characters for each drug in that table .I put a query to compare the generic name of drugs between two tables.The query takes more minutes to run.The Same,if you compare the drugs using the drug index,using an id column (as said above), it takes only few seconds.

Database size DOES matter in terms of bytes and table's rows number. You will notice a huge performance difference between a light database and a blob filled one. Once my application got stuck because I put binary images inside fields instead of keeping images in files on the disk and putting only file names in database. Iterating a large number of rows on the other hand is not for free.

No it doesnt really matter. The MySQL speed is about 7 Million rows per second. So you can scale it quite a bit

Query performance mainly depends on the number of records it needs to scan, indexes plays a high role in it and index data size is proportional to number of rows and number of indexes.
Queries with indexed field conditions along with full value would be returned in 1ms generally, but starts_with, IN, Between, obviously contains conditions might take more time with more records to scan.
Also you will face lot of maintenance issues with DDL, like ALTER, DROP will be slow and difficult with more live traffic even for adding a index or new columns.
Generally its advisable to cluster the Database into as many clusters as required (500GB would be a general benchmark, as said by others it depends on many factors and can vary based on use cases) that way it gives better isolation and gives independence to scale specific clusters (more suited in case of B2B)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008