MYISAM sharding vs using InnoDB - mysql

I have a table with very high insert rate and update rate as well as read rate. On average there are about 100 rows being inserted and updated per second. And there are about 1000 selects per second.
The table has about 100 million tuples. It is a relationship table so it only has about 5 fields. Three fields contain keys so they are indexed. All the fields are of integers.
I am thinking of sharding the data, however, it adds a lot of complexity, but does offer speed. The other alternative is to use innodb.
The database runs on a raid 1 of 256GB ssd with 32GB 1600mhz of RAM and i7 3770k over clocked to 4Ghz
The database freezes constantly at peak times where the queries can be as high as 200 rows being inserted or updated and 2500 selects per second
Could you guys please point into as what I should do?

Sharding is usually a good idea to distribute table size. Load problems should generally be addressed with a replicated data environment. In your case your problem is a) huge table and b) table level locking and c) crappy hardware.
InnoDB
If you can use one of the keys on your table as a primary key, InnoDB might be a good way to go since he'll let you do row-level locking which may reduce your queries from waiting on each other. A good test might be to replicate your table to a test server and try all your queries against him and see what the performance benefit is. InnoDB has a higher resource consumption rates then MyISAM, so keep that in mind.
Hardware
I'm sorry bud, but your hardware is crap for the performance you need. Twitter does 34 writes per second at 2.6k QPS. You can't be doing Twitter's volume and think a beefed up gaming desktop is going to cut it. Buy a $15k Dell with some SSD drives and you'll be able to burst 100k QPS. You're in the big times now. It's time to ditch the start-up gear and get yourself a nice server. You do not want to shard. It will be cheaper to upgrade your hardware, and frankly, you need to.
Sharding
Sharding is awesome for splitting up large tables. And that's it.
Let me be clear about the bad. Developing a sharded architecture sucks. You want to do everything possible to not shard. Upgrade hardware, buy multiple servers and set up replication, optimize your code, but for the love of God, do not shard. You are way below the performance line for sharding. When your pushing sustained 30k+ QPS, then we can talk sharding. Until that day, NO.
You can buy a medium-range server ($30k Dell PowerEdge) with 5TB of Fusion IO on 16 cores and 256 GB of RAM and he'll take you all the way to 200k QPS.
But if you refuse to listen to me and are going to shard anyway, then here's what you need to do.
Rule 1: Stay on the Same Shard (ie. Picking a Partition Rule)
Once you shard, you do not want to be accessing data from across multiple shards. You need to pick a partition rule that keeps your query on the same shard as much as possible. Distributing a query (rule 4) is incredibly painful in distributed data environments.
Rule 2: Build a Shard Map and Replicate it
Your code will need to be able to get to all shards. Create a shard map based on your partition rule that lets your code know where to go to get the data he wants.
Rule 3: Write a Query Wrapper for your Shards
You do not want to manually decide which shard to go to. Write a wrapper that does it for you. You will thank yourself down the road when you're writing code.
Rule 4: Auto-balance
You'll eventually need to balance your shards to keep performance optimal. Plan for this before-hand and write your code with the intention that you'll have some kron job which balances your shards for you.
Rule 4: Support Distributed Queries
You inevitably will need to break Rule 1. When that happens, you'll need a query wrapper that can pull data from multiple shards and aggregate (bring) it into one place. The more shards you have, the more likely this will need to be multi-threaded. In my shop, we call this a distributed query (ie. a query which runs on multiple shards).
Bad News: There is no code out there for doing distributed queries and aggregating results. Apache Hadoop tries, but he's terrible. So is HiveDB. A good query distributor is hard to architect, hard to write, hard to optimize. This is a problem billion-dollar a year companies deal with. I shit you not, but if you come up with a good wrapper for distributing queries across shards that supports sorting+limit clauses and scales well, you could be a millionaire over night. Selling it for $300,000? You would have a line outside your door a mile long.
My point here is sharding is hard and it is expensive. It takes a lot of work and you want to do everything humanly possible to not shard. If you must, follow the rules.

Related

Limit before sharding or partitioning a table

I am new to the database system design. After reading many articles, I am really getting confused on what is the limit till which we should have 1 table and not go for sharding or partitioning. I know that it is really hard to provide generic answer and things depend on factors like
size of row
kind of data (strings, blobs, etc)
active queries number
what kind of queries
indexes
read heavy/write heavy
the latency expected
But when someone ask that
what will you do if you have 1 billion data and million rows getting added everyday. The latency needs to be less than 5 ms for 4 read, 1 write and 2 update queries over such a big database, etc.
what will your choice if you have only 10 million rows but the updates and reads are high. The number of new rows added are not significant. High consistency and low latency are the requirement.
If the rows are less that a million and the row size is increasing by thousands then the choice is simple. But it gets trickier when the choice involves for million or billion of rows.
Note: I have not mentioned the latency number in my question. Please
answer according to the latency number which is acceptable to you. Also, we are talking about structured data.
I am not sure but I can add 3 specific questions:
Lets say that you choose sql database for amazon or any ecommerce order management system. The order numbers are increasing everyday by million. There are already 1 billion record. Now, assuming that there is no archival of data. There are high read queries more than thousand queries per second. And there are writes as well. The read:write ratio is 100:1
Let's take an example which smaller number now. Lets say that you choose a sql database for abc or any ecommerce order management system. The order numbers are increasing everyday by thousands. There are already 10 million record. Now, assuming that there is no archival of data. There are high read queries more than ten thousand queries per second. And there are writes as well. The read:write ratio is 10:1
3rd example: Free goodies distribution. We have 10 million goodies to be distributed. 1 goodies per user. High consistency and low latency is the aim. Lets assume that 20 million users already waiting for this free distribution and once the time starts, all of them will try to get the free goodies.
Note: In the whole question, the assumption is that we will go with
SQL solutions. Also, please neglect if the provided usecase doesn't make sense logically. The aim is to get the knowledge in terms of numbers.
Can someone please help with what are the benchmarks. Any practical numbers from the project you are currently working in which can tell that for such a big database with these many queries, this is the latency observed,. Anything which can help me justify the choice for the number of tables for the certain number of queries for particular latency.
Some answers for MySQL. Since all databases are limited by disk space, network latency, etc., other engines may be similar.
A "point query" (fetching one row using a suitable index) takes milliseconds regardless of the number of rows.
It is possible to write a SELECT that will take hours, maybe even days, to run. So you need to understand whether the queries are pathological like this. (I assume this is an example of high "latency".)
"Sharding" is needed when you cannot sustain the number of writes needed on a single server.
Heavy reads can be scaled 'infinitely' by using replication and sending the reads to Replicas.
PARTITIONing (especially in MySQL) has very few uses. More details: Partition
INDEXes are very important for performance.
For Data Warehouse apps, building and maintaining "Summary tables" is vital for performance at scale. (Some other engines have some built-in tools for such.)
INSERTing one million rows per day is not a problem. (Of course, there are schema designs that could make this a problem.) Rules of Thumb: 100/second is probably not a problem; 1000/sec is probably possible; it gets harder after that. More on high speed ingestion
Network latency is mostly determined by how close the client and server are. It takes over 200ms to reach the other side of the earth. On the other hand, if the client and server are in the same building, latency is under 1ms. On another hand, if you are referring to how long it takes too run a query, then here are a couple of Rules of Thumb: 10ms for a simple query that needs to hit an HDD disk; 1ms for SSD.
UUIDs and hashes are very bad for performance if the data is too big to be cached in RAM.
I have not said anything about read:write ratio because I prefer to judge reads and writes independently.
"Ten thousand reads per second" is hard to achieve; I suggest that very few apps really need such. Or they can find better ways to achieve the same goals. How fast can one user issue a query? Maybe one per second? How many users can be connected and active at the same time? Hundreds.
(my opinion) Most benchmarks are useless. Some benchmarks can show that one system is twice as fast as another. So what? Some benchmarks say that when you have more than a few hundred active connections, throughput stagnates and latency heads toward infinity. So what. After you have an app running for some time, capturing the actual queries is perhaps the best benchmark. But it still has limited uses.
Almost always a single table is better than splitting up the table (multiple tables; PARTITIONing; sharding). If you have a concrete example, we can discuss the pros and cons of the table design.
Size of row and kinds of data -- Large columns (TEXT/BLOB/JSON) are stored "off-record", thereby leading to [potentially] an extra disk hit. Disk hits are the most costly part of any query.
Active queries -- After a few dozen, the queries stumble over each other. (Think about a grocery store with lots of shoppers pushing carts -- with "too many" shoppers, each takes a long time to finish.)
When you get into large databases, they fall into a few different types; each with somewhat different characteristics.
Data Warehouse (sensors, logs, etc) -- appending to 'end' of the table; Summary Tables for efficient 'reports'; huge "Fact" table (optionally archived in chunks); certain "dimension tables".
Search (products, web pages, etc) -- EAV is problematical; FULLTEXT is often useful.
Banking, order processing -- This gets heavy into the ACID features and the need for crafting transactions.
Media (images and videos) -- How to store the bulky objects while making searching (etc) reasonably fast.
'Find nearest' -- Need a 2D index, either SPATIAL or some of the techniques here

How to increase database performance if there is 0.1M traffic

I am developing a site and I'm concerned about the performance.
In the current system there are transactions like adding 10,000 rows to a single table. It doesn't matter it took around 0.6 seconds to insert.
But I am worrying about what happens if there are 100,000 concurrent users and 1000 of the users want to add 10,000 rows to a single table at once.
How could this impact the performance compared to a single user? How can I improve these transactions if there is a large amount of traffic like in this situation?
When write speed is mandatory, the way we tackle it is getting quicker hard drives.
You mentioned transactions, that means you need your data durable (D of ACID). This requirement rules out MyISAM storage engine or any type of NoSQL so I'll focus the answer towards what goes on with relational databases.
The way it works is this: you get a set number of Input Output Operations per Second or IOPS per hard drive. Hard drives also have a metric called bandwith. The metric you are interested in is write speed.
Some crude calculation here would be this - Number of MB per second divided by number of IOPS = how much data you can squeeze per IOPS.
For mechanical drives, this magic IOPS number is anywhere between 150 and 300 - quite low. Given their bandwith of about 100 MB/sec, you get a real small number of writes and bandwith per write. This is where Solid State Drives kick in - their IOPS number starts at about 5 000 (some even go to 80 000) which is awesome for databases.
Connecting these drives in RAID gives you a super quick storage solution. If you are able to squeeze 10 000 inserts into one transaction, the disk will try to squeeze all 10k inserts through 1 IOPS.
Another strategy is partitioning your table and having multiple drives where MySQL stores the data.
This is as far as you can go with a single MySQL installation. There are strategies for distributing data to multiple MySQL nodes etc. but I assume that's out of scope of your question.
TL;DR: you need quicker disks.
If you are trying to scale for inserting millions of rows per second, you have bigger problems. That could add up to trillions of rows per month. That's hundreds of terabytes before the end of the month. Do you have a big enough disk farm for that? Can you afford enough SSDs for that.
Another thing. With a trillion rows, it is quite challenging to have any indexes other than a simple auto_increment. Without any indexes, how do you plan on accessing the data? A table scan of a trillion rows will take day(s).
Also, you said 100,000 users; you implied that they are connected simultaneously? That, too, is a challenge.
What are the users doing to generate 10K rows all at once? What about the network bandwidth?
Etc. Etc.
If you really have a task like this, Sharding is probably the only solution. And that is in addition to SSDs, RAID, IOPs, etc, etc.
Few stuff that you must consider both from software and hardware point.
Things must consider :
Go for SSD drive to have better IO.
Good to have 10GB of network, if you have that huge traffic.
Use mysql 5.6 or above, they made good improvement on performance over previous version.
Use bulk inserts, instead of sequential one, and even better if you can store all data in a file and use load_data_infile. This would be
20 times faster then regular insert.
Mysql provide multiple ways to scaleout. Its depend upon on your product requirement which way you want to go.

MySQL InnoDB big table: to shard or to add more RAM?

Folks, I'm a developer of a social game and there are already 700k players in the game, and about 7k new players are registered every day, about 5k players are constantly online.
The DB server is running on a pretty powerful hardware: 16 cores CPU, 24 Gb RAM, RAID-10 with BBU built on 4 SAS disks. I'm using Percona server(patched MySQL-5.1) and currently InnoDB buffer pool is 18Gb(although according to innotop only a few free buffers available). The DB server is performing pretty well(2k QPS, iostat %util is 10-15%, almost always 0 processes in "b" state in vmstat, loadavg is 5-6). However from time to time(every few minutes) I'm getting about 10-100 slow queries(where each may last about 5-6 seconds).
There is one big InnoDB table in the MySQL database which occupies the most space. It has about 300 millions rows, it's size is about 20 Gb. Of course, this table is gradually growing... I'm starting to worry it's affecting the overall performance of the database in a negative way. In the nearest future I'll have to do something about it, but I'm not sure what exactly.
Basically question boils down to whether to shard or simply add more RAM. The latter is simpler, of course. Looks like I can add up to 256 Gb RAM. But the question is whether I should invest more time implementing sharding instead since it's more scalable?
Sharding seems reasonable if you need to have all 300m+ rows. It may be a pain to change now but when your table grows and grows there will be a point when no amount of ram will solve your problem. With such massive amounts of data it may be worth using something like couch db as you could store documents of data rather than rows ie 1 document could contain all records for an individual user.
Sounds to me like your main database table could use some normalization. Does all your information belong in that one table, or can you split it out to smaller tables? Normalization may invoke a small performance hit now, but as your table grows, that will be overwhelmed by the extra processing involved in accessing a huge, monolithic table.
I'm getting about 10-100 slow queries(where each may last about 5-6 seconds).
Quote of a comment: Database is properly normalized. The database has many tables, one of them is really huge and has nothing to do with normalization.
When im reading this i would say it has to do with your queries.. has nothing to do with your hardware.. Average companies would dream about kind of server you have!
If you write bad queries doesn't matter how good your tables are normalized, it will be slow.
maybe you got something about this, its almost a similar question with an answer(database is slow and stuff like that).
Also thought about archiving some stuff? For example from those 300 million it started with ID 1 so is that ID still get used? if not why not archive it to a other database or table(i would recommend database). I also believe that not every 700k users are logged in every day(if you got respect! but i don't believe that).
You also said 'This table contains player specific items' what kind of specific items?
Another question, can you post some of your 'slow' queries?
You also considered about a caching system from some data? that maybe changed once a month, like gear other game stuff?

MySQL Partitioning / Sharding / Splitting - which way to go?

We have an InnoDB database that is about 70 GB and we expect it to grow to several hundred GB in the next 2 to 3 years. About 60 % of the data belong to a single table. Currently the database is working quite well as we have a server with 64 GB of RAM, so almost the whole database fits into memory, but we’re concerned about the future when the amount of data will be considerably larger. Right now we’re considering some way of splitting up the tables (especially the one that accounts for the biggest part of the data) and I’m now wondering, what would be the best way to do it.
The options I’m currently aware of are
Using MySQL Partitioning that comes with version 5.1
Using some kind of third party library that encapsulates the partitioning of the data (like hibernate shards)
Implementing it ourselves inside our application
Our application is built on J2EE and EJB 2.1 (hopefully we’re switching to EJB 3 some day).
What would you suggest?
EDIT (2011-02-11):
Just an update: Currently the size of the database is 380 GB, the data size of our "big" table is 220 GB and the size of its index is 36 GB. So while the whole table does not fit in memory any more, the index does.
The system is still performing fine (still on the same hardware) and we're still thinking about partitioning the data.
EDIT (2014-06-04):
One more update: The size of the whole database is 1.5 TB, the size of our "big" table is 1.1 TB. We upgraded our server to a 4 processor machine (Intel Xeon E7450) with 128 GB RAM.
The system is still performing fine.
What we're planning to do next is putting our big table on a separate database server (we've already done the necessary changes in our software) while simultaneously upgrading to new hardware with 256 GB RAM.
This setup is supposed to last for two years. Then we will either have to finally start implementing a sharding solution or just buy servers with 1 TB of RAM which should keep us going for some time.
EDIT (2016-01-18):
We have since put our big table in it's own database on a separate server. Currently the size ot this database is about 1.9 TB, the size of the other database (with all tables except for the "big" one) is 1.1 TB.
Current Hardware setup:
HP ProLiant DL 580
4 x Intel(R) Xeon(R) CPU E7- 4830
256 GB RAM
Performance is fine with this setup.
You will definitely start to run into issues on that 42 GB table once it no longer fits in memory. In fact, as soon as it does not fit in memory anymore, performance will degrade extremely quickly. One way to test is to put that table on another machine with less RAM and see how poor it performs.
First of all, it doesn't matter as much splitting out tables unless you also move some of the tables to a separate physical volume.
This is incorrect. Partioning (either through the feature in MySQL 5.1, or the same thing using MERGE tables) can provide significant performance benefits even if the tables are on the same drive.
As an example, let's say that you are running SELECT queries on your big table using a date range. If the table is whole, the query will be forced to scan through the entire table (and at that size, even using indexes can be slow). The advantage of partitioning is that your queries will only run on the partitions where it is absolutely necessary. If each partition is 1 GB in size and your query only needs to access 5 partitions in order to fulfill itself, the combined 5 GB table is a lot easier for MySQL to deal with than a monster 42 GB version.
One thing you need to ask yourself is how you are querying the data. If there is a chance that your queries will only need to access certain chunks of data (i.e. a date range or ID range), partitioning of some kind will prove beneficial.
I've heard that there is still some buggyness with MySQL 5.1 partitioning, particularly related to MySQL choosing the correct key. MERGE tables can provide the same functionality, although they require slightly more overhead.
Hope that helps...good luck!
If you think you're going to be IO/memory bound, I don't think partitioning is going to be helpful. As usual, benchmarking first will help you figure out the best direction. If you don't have spare servers with 64GB of memory kicking around, you can always ask your vendor for a 'demo unit'.
I would lean towards sharding if you don't expect 1 query aggregate reporting. I'm assuming you'd shard the whole database and not just your big table: it's best to keep entire entities together. Well, if your model splits nicely, anyway.
This is a great example of what can MySql partitioning do in a real-life example of huge data flows:
http://web.archive.org/web/20101125025320/http://www.tritux.com/blog/2010/11/19/partitioning-mysql-database-with-high-load-solutions/11/1
Hoping it will be helpful for your case.
A while back at a Microsoft ArcReady event, I saw a presentation on scaling patterns that might be useful to you. You can view the slides for it online.
I would go for MariaDB InnoDB + Partitions (either by key or by date, depending on your queries).
I did this and now I don't have any Database problems anymore.
MySQL can be replaced with MariaDB in seconds...all the database files stay the same.
First of all, it doesn't matter as much splitting out tables unless you also move some of the tables to a separate physical volume.
Secondly, it's not necessarily the table with the largest physical size that you want to move. You may have a much smaller table that gets more activity, while your big table remains fairly constant or only appends data.
Whatever you do, don't implement it yourselves. Let the database system handle it.
What does the big table do.
If you're going to split it, you've got a few options:
- Split it using the database system (don't know much about that)
- Split it by row.
- split it by column.
Splitting it by row would only be possible if your data can be separated easily into chunks. e.g. Something like Basecamp has multiple accounts which are completely separate. You could keep 50% of the accounts in one table and 50% in a different table on a different machine.
Splitting by Column is good for situations where the row size contains large text fields or BLOBS. If you've got a table with (for example) a user image and a huge block of text, you could farm the image into a completely different table. (on a different machine)
You break normalisation here, but I don't think it would cause too many problems.
You would probably want to split that large table eventually. You'll probably want to put it on a separate hard disk, before thinking of a second server. Doing it with MySQL is the most convenient option. If it is capable, then go for it.
BUT
Everything depends on how your database is being used, really. Statistics.

How big can a MySQL database get before performance starts to degrade

At what point does a MySQL database start to lose performance?
Does physical database size matter?
Do number of records matter?
Is any performance degradation linear or exponential?
I have what I believe to be a large database, with roughly 15M records which take up almost 2GB. Based on these numbers, is there any incentive for me to clean the data out, or am I safe to allow it to continue scaling for a few more years?
The physical database size doesn't matter. The number of records don't matter.
In my experience the biggest problem that you are going to run in to is not size, but the number of queries you can handle at a time. Most likely you are going to have to move to a master/slave configuration so that the read queries can run against the slaves and the write queries run against the master. However if you are not ready for this yet, you can always tweak your indexes for the queries you are running to speed up the response times. Also there is a lot of tweaking you can do to the network stack and kernel in Linux that will help.
I have had mine get up to 10GB, with only a moderate number of connections and it handled the requests just fine.
I would focus first on your indexes, then have a server admin look at your OS, and if all that doesn't help it might be time to implement a master/slave configuration.
In general this is a very subtle issue and not trivial whatsoever. I encourage you to read mysqlperformanceblog.com and High Performance MySQL. I really think there is no general answer for this.
I'm working on a project which has a MySQL database with almost 1TB of data. The most important scalability factor is RAM. If the indexes of your tables fit into memory and your queries are highly optimized, you can serve a reasonable amount of requests with a average machine.
The number of records do matter, depending of how your tables look like. It's a difference to have a lot of varchar fields or only a couple of ints or longs.
The physical size of the database matters as well: think of backups, for instance. Depending on your engine, your physical db files on grow, but don't shrink, for instance with innodb. So deleting a lot of rows, doesn't help to shrink your physical files.
There's a lot to this issues and as in a lot of cases the devil is in the details.
The database size does matter. If you have more than one table with more than a million records, then performance starts indeed to degrade. The number of records does of course affect the performance: MySQL can be slow with large tables. If you hit one million records you will get performance problems if the indices are not set right (for example no indices for fields in "WHERE statements" or "ON conditions" in joins). If you hit 10 million records, you will start to get performance problems even if you have all your indices right. Hardware upgrades - adding more memory and more processor power, especially memory - often help to reduce the most severe problems by increasing the performance again, at least to a certain degree. For example 37 signals went from 32 GB RAM to 128GB of RAM for the Basecamp database server.
I'm currently managing a MySQL database on Amazon's cloud infrastructure that has grown to 160 GB. Query performance is fine. What has become a nightmare is backups, restores, adding slaves, or anything else that deals with the whole dataset, or even DDL on large tables. Getting a clean import of a dump file has become problematic. In order to make the process stable enough to automate, various choices needed to be made to prioritize stability over performance. If we ever had to recover from a disaster using a SQL backup, we'd be down for days.
Horizontally scaling SQL is also pretty painful, and in most cases leads to using it in ways you probably did not intend when you chose to put your data in SQL in the first place. Shards, read slaves, multi-master, et al, they are all really shitty solutions that add complexity to everything you ever do with the DB, and not one of them solves the problem; only mitigates it in some ways. I would strongly suggest looking at moving some of your data out of MySQL (or really any SQL) when you start approaching a dataset of a size where these types of things become an issue.
Update: a few years later, and our dataset has grown to about 800 GiB. In addition, we have a single table which is 200+ GiB and a few others in the 50-100 GiB range. Everything I said before holds. It still performs just fine, but the problems of running full dataset operations have become worse.
I would focus first on your indexes, than have a server admin look at your OS, and if all that doesn't help it might be time for a master/slave configuration.
That's true. Another thing that usually works is to just reduce the quantity of data that's repeatedly worked with. If you have "old data" and "new data" and 99% of your queries work with new data, just move all the old data to another table - and don't look at it ;)
-> Have a look at partitioning.
2GB and about 15M records is a very small database - I've run much bigger ones on a pentium III(!) and everything has still run pretty fast.. If yours is slow it is a database/application design problem, not a mysql one.
It's kind of pointless to talk about "database performance", "query performance" is a better term here. And the answer is: it depends on the query, data that it operates on, indexes, hardware, etc. You can get an idea of how many rows are going to be scanned and what indexes are going to be used with EXPLAIN syntax.
2GB does not really count as a "large" database - it's more of a medium size.
I once was called upon to look at a mysql that had "stopped working". I discovered that the DB files were residing on a Network Appliance filer mounted with NFS2 and with a maximum file size of 2GB. And sure enough, the table that had stopped accepting transactions was exactly 2GB on disk. But with regards to the performance curve I'm told that it was working like a champ right up until it didn't work at all! This experience always serves for me as a nice reminder that there're always dimensions above and below the one you naturally suspect.
Also watch out for complex joins. Transaction complexity can be a big factor in addition to transaction volume.
Refactoring heavy queries sometimes offers a big performance boost.
A point to consider is also the purpose of the system and the data in the day to day.
For example, for a system with GPS monitoring of cars is not relevant query data from the positions of the car in previous months.
Therefore the data can be passed to other historical tables for possible consultation and reduce the execution times of the day to day queries.
Performance can degrade in a matter of few thousand rows if database is not designed properly.
If you have proper indexes, use proper engines (don't use MyISAM where multiple DMLs are expected), use partitioning, allocate correct memory depending on the use and of course have good server configuration, MySQL can handle data even in terabytes!
There are always ways to improve the database performance.
It depends on your query and validation.
For example, i worked with a table of 100 000 drugs which has a column generic name where it has more than 15 characters for each drug in that table .I put a query to compare the generic name of drugs between two tables.The query takes more minutes to run.The Same,if you compare the drugs using the drug index,using an id column (as said above), it takes only few seconds.
Database size DOES matter in terms of bytes and table's rows number. You will notice a huge performance difference between a light database and a blob filled one. Once my application got stuck because I put binary images inside fields instead of keeping images in files on the disk and putting only file names in database. Iterating a large number of rows on the other hand is not for free.
No it doesnt really matter. The MySQL speed is about 7 Million rows per second. So you can scale it quite a bit
Query performance mainly depends on the number of records it needs to scan, indexes plays a high role in it and index data size is proportional to number of rows and number of indexes.
Queries with indexed field conditions along with full value would be returned in 1ms generally, but starts_with, IN, Between, obviously contains conditions might take more time with more records to scan.
Also you will face lot of maintenance issues with DDL, like ALTER, DROP will be slow and difficult with more live traffic even for adding a index or new columns.
Generally its advisable to cluster the Database into as many clusters as required (500GB would be a general benchmark, as said by others it depends on many factors and can vary based on use cases) that way it gives better isolation and gives independence to scale specific clusters (more suited in case of B2B)