I am new to the database system design. After reading many articles, I am really getting confused on what is the limit till which we should have 1 table and not go for sharding or partitioning. I know that it is really hard to provide generic answer and things depend on factors like
size of row
kind of data (strings, blobs, etc)
active queries number
what kind of queries
indexes
read heavy/write heavy
the latency expected
But when someone ask that
what will you do if you have 1 billion data and million rows getting added everyday. The latency needs to be less than 5 ms for 4 read, 1 write and 2 update queries over such a big database, etc.
what will your choice if you have only 10 million rows but the updates and reads are high. The number of new rows added are not significant. High consistency and low latency are the requirement.
If the rows are less that a million and the row size is increasing by thousands then the choice is simple. But it gets trickier when the choice involves for million or billion of rows.
Note: I have not mentioned the latency number in my question. Please
answer according to the latency number which is acceptable to you. Also, we are talking about structured data.
I am not sure but I can add 3 specific questions:
Lets say that you choose sql database for amazon or any ecommerce order management system. The order numbers are increasing everyday by million. There are already 1 billion record. Now, assuming that there is no archival of data. There are high read queries more than thousand queries per second. And there are writes as well. The read:write ratio is 100:1
Let's take an example which smaller number now. Lets say that you choose a sql database for abc or any ecommerce order management system. The order numbers are increasing everyday by thousands. There are already 10 million record. Now, assuming that there is no archival of data. There are high read queries more than ten thousand queries per second. And there are writes as well. The read:write ratio is 10:1
3rd example: Free goodies distribution. We have 10 million goodies to be distributed. 1 goodies per user. High consistency and low latency is the aim. Lets assume that 20 million users already waiting for this free distribution and once the time starts, all of them will try to get the free goodies.
Note: In the whole question, the assumption is that we will go with
SQL solutions. Also, please neglect if the provided usecase doesn't make sense logically. The aim is to get the knowledge in terms of numbers.
Can someone please help with what are the benchmarks. Any practical numbers from the project you are currently working in which can tell that for such a big database with these many queries, this is the latency observed,. Anything which can help me justify the choice for the number of tables for the certain number of queries for particular latency.
Some answers for MySQL. Since all databases are limited by disk space, network latency, etc., other engines may be similar.
A "point query" (fetching one row using a suitable index) takes milliseconds regardless of the number of rows.
It is possible to write a SELECT that will take hours, maybe even days, to run. So you need to understand whether the queries are pathological like this. (I assume this is an example of high "latency".)
"Sharding" is needed when you cannot sustain the number of writes needed on a single server.
Heavy reads can be scaled 'infinitely' by using replication and sending the reads to Replicas.
PARTITIONing (especially in MySQL) has very few uses. More details: Partition
INDEXes are very important for performance.
For Data Warehouse apps, building and maintaining "Summary tables" is vital for performance at scale. (Some other engines have some built-in tools for such.)
INSERTing one million rows per day is not a problem. (Of course, there are schema designs that could make this a problem.) Rules of Thumb: 100/second is probably not a problem; 1000/sec is probably possible; it gets harder after that. More on high speed ingestion
Network latency is mostly determined by how close the client and server are. It takes over 200ms to reach the other side of the earth. On the other hand, if the client and server are in the same building, latency is under 1ms. On another hand, if you are referring to how long it takes too run a query, then here are a couple of Rules of Thumb: 10ms for a simple query that needs to hit an HDD disk; 1ms for SSD.
UUIDs and hashes are very bad for performance if the data is too big to be cached in RAM.
I have not said anything about read:write ratio because I prefer to judge reads and writes independently.
"Ten thousand reads per second" is hard to achieve; I suggest that very few apps really need such. Or they can find better ways to achieve the same goals. How fast can one user issue a query? Maybe one per second? How many users can be connected and active at the same time? Hundreds.
(my opinion) Most benchmarks are useless. Some benchmarks can show that one system is twice as fast as another. So what? Some benchmarks say that when you have more than a few hundred active connections, throughput stagnates and latency heads toward infinity. So what. After you have an app running for some time, capturing the actual queries is perhaps the best benchmark. But it still has limited uses.
Almost always a single table is better than splitting up the table (multiple tables; PARTITIONing; sharding). If you have a concrete example, we can discuss the pros and cons of the table design.
Size of row and kinds of data -- Large columns (TEXT/BLOB/JSON) are stored "off-record", thereby leading to [potentially] an extra disk hit. Disk hits are the most costly part of any query.
Active queries -- After a few dozen, the queries stumble over each other. (Think about a grocery store with lots of shoppers pushing carts -- with "too many" shoppers, each takes a long time to finish.)
When you get into large databases, they fall into a few different types; each with somewhat different characteristics.
Data Warehouse (sensors, logs, etc) -- appending to 'end' of the table; Summary Tables for efficient 'reports'; huge "Fact" table (optionally archived in chunks); certain "dimension tables".
Search (products, web pages, etc) -- EAV is problematical; FULLTEXT is often useful.
Banking, order processing -- This gets heavy into the ACID features and the need for crafting transactions.
Media (images and videos) -- How to store the bulky objects while making searching (etc) reasonably fast.
'Find nearest' -- Need a 2D index, either SPATIAL or some of the techniques here
Related
I am developing a site and I'm concerned about the performance.
In the current system there are transactions like adding 10,000 rows to a single table. It doesn't matter it took around 0.6 seconds to insert.
But I am worrying about what happens if there are 100,000 concurrent users and 1000 of the users want to add 10,000 rows to a single table at once.
How could this impact the performance compared to a single user? How can I improve these transactions if there is a large amount of traffic like in this situation?
When write speed is mandatory, the way we tackle it is getting quicker hard drives.
You mentioned transactions, that means you need your data durable (D of ACID). This requirement rules out MyISAM storage engine or any type of NoSQL so I'll focus the answer towards what goes on with relational databases.
The way it works is this: you get a set number of Input Output Operations per Second or IOPS per hard drive. Hard drives also have a metric called bandwith. The metric you are interested in is write speed.
Some crude calculation here would be this - Number of MB per second divided by number of IOPS = how much data you can squeeze per IOPS.
For mechanical drives, this magic IOPS number is anywhere between 150 and 300 - quite low. Given their bandwith of about 100 MB/sec, you get a real small number of writes and bandwith per write. This is where Solid State Drives kick in - their IOPS number starts at about 5 000 (some even go to 80 000) which is awesome for databases.
Connecting these drives in RAID gives you a super quick storage solution. If you are able to squeeze 10 000 inserts into one transaction, the disk will try to squeeze all 10k inserts through 1 IOPS.
Another strategy is partitioning your table and having multiple drives where MySQL stores the data.
This is as far as you can go with a single MySQL installation. There are strategies for distributing data to multiple MySQL nodes etc. but I assume that's out of scope of your question.
TL;DR: you need quicker disks.
If you are trying to scale for inserting millions of rows per second, you have bigger problems. That could add up to trillions of rows per month. That's hundreds of terabytes before the end of the month. Do you have a big enough disk farm for that? Can you afford enough SSDs for that.
Another thing. With a trillion rows, it is quite challenging to have any indexes other than a simple auto_increment. Without any indexes, how do you plan on accessing the data? A table scan of a trillion rows will take day(s).
Also, you said 100,000 users; you implied that they are connected simultaneously? That, too, is a challenge.
What are the users doing to generate 10K rows all at once? What about the network bandwidth?
Etc. Etc.
If you really have a task like this, Sharding is probably the only solution. And that is in addition to SSDs, RAID, IOPs, etc, etc.
Few stuff that you must consider both from software and hardware point.
Things must consider :
Go for SSD drive to have better IO.
Good to have 10GB of network, if you have that huge traffic.
Use mysql 5.6 or above, they made good improvement on performance over previous version.
Use bulk inserts, instead of sequential one, and even better if you can store all data in a file and use load_data_infile. This would be
20 times faster then regular insert.
Mysql provide multiple ways to scaleout. Its depend upon on your product requirement which way you want to go.
I have a table with very high insert rate and update rate as well as read rate. On average there are about 100 rows being inserted and updated per second. And there are about 1000 selects per second.
The table has about 100 million tuples. It is a relationship table so it only has about 5 fields. Three fields contain keys so they are indexed. All the fields are of integers.
I am thinking of sharding the data, however, it adds a lot of complexity, but does offer speed. The other alternative is to use innodb.
The database runs on a raid 1 of 256GB ssd with 32GB 1600mhz of RAM and i7 3770k over clocked to 4Ghz
The database freezes constantly at peak times where the queries can be as high as 200 rows being inserted or updated and 2500 selects per second
Could you guys please point into as what I should do?
Sharding is usually a good idea to distribute table size. Load problems should generally be addressed with a replicated data environment. In your case your problem is a) huge table and b) table level locking and c) crappy hardware.
InnoDB
If you can use one of the keys on your table as a primary key, InnoDB might be a good way to go since he'll let you do row-level locking which may reduce your queries from waiting on each other. A good test might be to replicate your table to a test server and try all your queries against him and see what the performance benefit is. InnoDB has a higher resource consumption rates then MyISAM, so keep that in mind.
Hardware
I'm sorry bud, but your hardware is crap for the performance you need. Twitter does 34 writes per second at 2.6k QPS. You can't be doing Twitter's volume and think a beefed up gaming desktop is going to cut it. Buy a $15k Dell with some SSD drives and you'll be able to burst 100k QPS. You're in the big times now. It's time to ditch the start-up gear and get yourself a nice server. You do not want to shard. It will be cheaper to upgrade your hardware, and frankly, you need to.
Sharding
Sharding is awesome for splitting up large tables. And that's it.
Let me be clear about the bad. Developing a sharded architecture sucks. You want to do everything possible to not shard. Upgrade hardware, buy multiple servers and set up replication, optimize your code, but for the love of God, do not shard. You are way below the performance line for sharding. When your pushing sustained 30k+ QPS, then we can talk sharding. Until that day, NO.
You can buy a medium-range server ($30k Dell PowerEdge) with 5TB of Fusion IO on 16 cores and 256 GB of RAM and he'll take you all the way to 200k QPS.
But if you refuse to listen to me and are going to shard anyway, then here's what you need to do.
Rule 1: Stay on the Same Shard (ie. Picking a Partition Rule)
Once you shard, you do not want to be accessing data from across multiple shards. You need to pick a partition rule that keeps your query on the same shard as much as possible. Distributing a query (rule 4) is incredibly painful in distributed data environments.
Rule 2: Build a Shard Map and Replicate it
Your code will need to be able to get to all shards. Create a shard map based on your partition rule that lets your code know where to go to get the data he wants.
Rule 3: Write a Query Wrapper for your Shards
You do not want to manually decide which shard to go to. Write a wrapper that does it for you. You will thank yourself down the road when you're writing code.
Rule 4: Auto-balance
You'll eventually need to balance your shards to keep performance optimal. Plan for this before-hand and write your code with the intention that you'll have some kron job which balances your shards for you.
Rule 4: Support Distributed Queries
You inevitably will need to break Rule 1. When that happens, you'll need a query wrapper that can pull data from multiple shards and aggregate (bring) it into one place. The more shards you have, the more likely this will need to be multi-threaded. In my shop, we call this a distributed query (ie. a query which runs on multiple shards).
Bad News: There is no code out there for doing distributed queries and aggregating results. Apache Hadoop tries, but he's terrible. So is HiveDB. A good query distributor is hard to architect, hard to write, hard to optimize. This is a problem billion-dollar a year companies deal with. I shit you not, but if you come up with a good wrapper for distributing queries across shards that supports sorting+limit clauses and scales well, you could be a millionaire over night. Selling it for $300,000? You would have a line outside your door a mile long.
My point here is sharding is hard and it is expensive. It takes a lot of work and you want to do everything humanly possible to not shard. If you must, follow the rules.
I am trying to apply for a job, which asks for the experiences on handling large scale data sets using relational database, like mySQL.
I would like to know which specific skill sets are required for handling large scale data using MySQL.
Handling large scale data with MySQL isn't just a specific set of skills, as there are a bazillion ways to deal with a large data set. Some basic things to understand are:
Column Indexes, how, why, and when they're used, and the pros and cons of using them.
Good database structure to balance between fast writes and easy reads.
Caching, leveraging several layers of caching and different caching technologies (memcached, redis, etc)
Examining MySQL queries to identify bottlenecks and understanding the MySQL internals to see how queries get planned an executed by the database server in order to increase query performance
Configuring the MySQL server to be able to handle a lot of concurrent connections, and access it's data fast. Hardware bottlenecks, and the advantages to using different technologies to speed up your hardware (for example, storing your MySQL data on a RAID5 Array to increase IO performance))
Leveraging built-in MySQL technology (like Replication) to off-load read traffic
These are just a few things that get thought about in regards to big data in MySQL. There's a TON more, which is why the company is looking for experience in the area. Knowing what to do, or having experience with things that have worked or failed for you is an absolutely invaluable asset to bring to a company that deals with high traffic, high availability, and high volume services.
edit
I would be remis if I didn't mention a source for more information. Check out High Performance MySQL. This is an incredible book, and has a plethora of information on how to make MySQL perform in all scenarios. Definitely worth the money, and the time spent reading it.
edit -- good structure for balanced writes and reads
With this point, I was referring to the topic of normalization / de-normalization. If you're familiar with DB design, you know that normalization is the separation of data as to reduce (eliminate) the amount of duplicate data you have about any single record. This is generally a fantastic idea, as it makes tables smaller, faster to query, easier to index (individually) and reduces the number of writes you have to do in order to create/update a new record.
There are different levels of normalization (as #Adam Robinson pointed out in the comments below) which are referred to as normal forms. Almost every web application I've worked with hasn't had much benefit beyond the 3NF (3rd Normal Form). Which the definition of, if you were to read that wikipedia link above, will probably make your head hurt. So in lamens (at the risk of dumbing it down too far...) a 3NF structure satisfies the following rules:
No duplicate columns within the same table.
Create different tables for each set related data. (Example: a Companies table which has a list of companies, and an Employees table which has a list of each companies' employees)
No sub-sets of columns which apply to multiple rows in a table. (Example: zip_code, state, and city is a sub-set of data which can be identified uniquely by zip_code. These 3 columns could be put in their own table, and referenced by the Employees table (in the previous example) by the zip_code). This eliminates large sets of duplication within your tables, so any change that is required to the city/state for any zip code is a single write operation instead of 1 write for every employee who lives in that zip code.
Each sub-set of data is moved to it's own table and is identified by it's own primary key (this is touched/explained in the example for #3).
Remove columns which are not fully dependent on the primary key. (An example here might be if your Employees table has start_date, end_date, and years_employed columns. The start_date and end_date are both unique and dependent on any single employee row, but the years_employed can be derived by subtracting start_date from end_date. This is important because as end-date increases, so does years_employed so if you were to update end_date you'd also have to update years_employed (2 writes instead of 1)
A fully normalized (3NF) database table structure is great, if you've got a very heavy write-load. If your server is doing a lot of writes, it's very easy to write small bits of data, especially when you're running fewer of them. The drawback is, all your reads become much more expensive, because you have to (typically) run a lot of JOIN queries when you're pulling data out. JOINs are typically expensive and harder to create proper indexes for when you're utilizing WHERE clauses that span the relationship and when sorting the result-sets If you have to perform a lot of reads (SELECTs) on your data-set, using a 3NF structure can cause you some performance problems. This is because as your tables grow you're asking MySQL to cram more and more table data (and indexes) into memory. Ideally this is what you want, but with big data-sets you're just not going to have enough memory to fit all of this at once. This is when MySQL starts to create temporary tables, and has to use the disk to load data and manipulate it. Once MySQL becomes reliant on the hard disk to serve up query results you're going to see a significant performance drop. This is less-so the case with solid state disks, but they are super expensive, and (imo) are not mature enough to use on mission critical data sets yet (i mean, unless you're prepared for them to fail and have a very fast backup recovery system in place...then use them and gonuts!).
This is the balancing part. You have to decide what kind of traffic the data you're reading/writing is going to be serving more of, and design that to be fast. In some instances, people don't mind writes being slow because they happen less frequently. In other cases, writes have to be very fast, and the reads don't have to be fast because the data isn't accessed that often (or at all, or even in real time).
Workloads that require a lot of reads benefit the most from a middle-tier caching layer. The idea is that your writes are still fast (because you're 'normal') and your reads can be slow because you're going to cache it (in memcached or something competitive to it), so you don't hit the database very frequently. The drawback here is, if your cache gets invalidated quickly, then the cache is not reducing the read load by a meaningful amount and that results in no added performance (and possibly even more overhead to check/invalidate the caches).
With workloads that have the requirement for high throughput in writes, with data that is read frequently, and can't be cached (constantly changes), you have to come up with another strategy. This could mean that you start to de-normalize your tables, by removing some of the normalization requirements you choose to satisfy, or something else. Instead of making smaller tables with less repetitive data, you make larger tables with more repetitive / redundant data. The advantage here is that your data is all in the same table, so you don't have to perform as many (or, any) JOINs to pull the data out. The drawback...writes are more expensive because you have to write in multiple places.
So with any given situation the developer(s) have to identify what kind of use the data structure is going to have to serve, and balance between any number of technologies and paradigms to achieve an acceptable solution that meets their needs. No two systems or solutions are the same which is why the employer is looking for someone with experience on how to deal with these large datasets. Finding these solutions is not something that can really be learned out of a book, it typically takes some experience in the field and experience with how different solutions performed.
I hope that helps. I know I rambled a bit, but it's really a lot of information. This is why DBAs make the big dollars (:
You need to know how to process the data in "chunks". That means instead of simply trying to manipulate the entire data set, you need to break it into smaller more manageable pieces. For example, if you had a table with 1 Billion records, a single update statement against the entire table would likely take a long time to complete, and may possibly bring the server to it's knees.
You could, however, issue a series of update statements within a loop that would update 20,000 records at a time. Each iteration of the loop you would increment your range/counters/whatever to identify the next set of records.
Also, you commit your changes at the end of each loop, thereby allowing you to stop the process and continue where you left off.
This is just one aspect of managing large data sets. You still need to know:
how to perform backups
proper indexing
database maintenance
You can raed/learn how to handle large dataset with MySQL But it is not equivalent to having actual experiences.
Straight and simple answer: Study about partitioned database and find appropriate MySQL data structure types for large scale datasets similar with the partitioned database architecture.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How many rows in a database are TOO MANY?
I am building the database scheme for an application that will have users, and each user will have many rows in relation tables such as 'favorites'.
Each user could have thousands of favorites, and there could be thousands of registered users (over time).
Given that users are never deleted, because that would either leave other entities orphaned, or have them deleted too (which isn't desired), and therefore these tables will keep growing forever, I was wondering if the resulting tables could be too big (eg: 1kk rows), and I should worry about this and do something like mark old and inactive users as deleted and remove the relations that only affect them (such as the favorites and other preferences).
Is this the way to go? Or can mysql easily handle 1kk rows in a table? Is there a known limit? Or is it fully hardware-dependant?
I agree with klennepette and Brian - with a couple of caveats.
If your data is inherently relational, and subject to queries that work well with SQL, you should be able to scale to hundreds of millions of records without exotic hardware requirements.
You will need to invest in indexing, query tuning, and making the occasional sacrifice to the relational model in the interests of speed. You should at least nod to performance when designing tables – preferring integers to strings for keys, for instance.
If, however, you have document-centric requirements, need free text search, or have lots of hierarchical relationships, you may need to look again.
If you need ACID transactions, you may run into scalability issues earlier than if you don't care about transactions (though this is still unlikely to affect you in practice); if you have long-running or complex transactions, your scalability decreases quite rapidly.
I'd recommend building the project from the ground up with scalability requirements in mind. What I've done in the past is set up a test environment populated with millions of records (I used DBMonster, but not sure if that's still around), and regularly test work-in-progress code against this database using load testing tools like Jmeter.
Millions of rows is fine, tens of millions of rows is fine - provided you've got an even remotely decent server, i.e. a few Gbs of RAM, plenty disk space. You will need to learn about indexes for fast retrieval, but in terms of MySQL being able to handle it, no problem.
Here's an example that demonstrates what can be achived using a well designed/normalised innodb schema which takes advantage of innodb's clustered primary key indexes (not available with myisam). The example is based on a forum with threads and has 500 million rows and query runtimes of 0.02 seconds while under load.
MySQL and NoSQL: Help me to choose the right one
It is mostly hardware dependant, but having that said MySQL scales pretty well.
I wouldn't worry too much about table size, if it does become an issue later on you can always use partitioning to ease the stress.
At what point does a MySQL database start to lose performance?
Does physical database size matter?
Do number of records matter?
Is any performance degradation linear or exponential?
I have what I believe to be a large database, with roughly 15M records which take up almost 2GB. Based on these numbers, is there any incentive for me to clean the data out, or am I safe to allow it to continue scaling for a few more years?
The physical database size doesn't matter. The number of records don't matter.
In my experience the biggest problem that you are going to run in to is not size, but the number of queries you can handle at a time. Most likely you are going to have to move to a master/slave configuration so that the read queries can run against the slaves and the write queries run against the master. However if you are not ready for this yet, you can always tweak your indexes for the queries you are running to speed up the response times. Also there is a lot of tweaking you can do to the network stack and kernel in Linux that will help.
I have had mine get up to 10GB, with only a moderate number of connections and it handled the requests just fine.
I would focus first on your indexes, then have a server admin look at your OS, and if all that doesn't help it might be time to implement a master/slave configuration.
In general this is a very subtle issue and not trivial whatsoever. I encourage you to read mysqlperformanceblog.com and High Performance MySQL. I really think there is no general answer for this.
I'm working on a project which has a MySQL database with almost 1TB of data. The most important scalability factor is RAM. If the indexes of your tables fit into memory and your queries are highly optimized, you can serve a reasonable amount of requests with a average machine.
The number of records do matter, depending of how your tables look like. It's a difference to have a lot of varchar fields or only a couple of ints or longs.
The physical size of the database matters as well: think of backups, for instance. Depending on your engine, your physical db files on grow, but don't shrink, for instance with innodb. So deleting a lot of rows, doesn't help to shrink your physical files.
There's a lot to this issues and as in a lot of cases the devil is in the details.
The database size does matter. If you have more than one table with more than a million records, then performance starts indeed to degrade. The number of records does of course affect the performance: MySQL can be slow with large tables. If you hit one million records you will get performance problems if the indices are not set right (for example no indices for fields in "WHERE statements" or "ON conditions" in joins). If you hit 10 million records, you will start to get performance problems even if you have all your indices right. Hardware upgrades - adding more memory and more processor power, especially memory - often help to reduce the most severe problems by increasing the performance again, at least to a certain degree. For example 37 signals went from 32 GB RAM to 128GB of RAM for the Basecamp database server.
I'm currently managing a MySQL database on Amazon's cloud infrastructure that has grown to 160 GB. Query performance is fine. What has become a nightmare is backups, restores, adding slaves, or anything else that deals with the whole dataset, or even DDL on large tables. Getting a clean import of a dump file has become problematic. In order to make the process stable enough to automate, various choices needed to be made to prioritize stability over performance. If we ever had to recover from a disaster using a SQL backup, we'd be down for days.
Horizontally scaling SQL is also pretty painful, and in most cases leads to using it in ways you probably did not intend when you chose to put your data in SQL in the first place. Shards, read slaves, multi-master, et al, they are all really shitty solutions that add complexity to everything you ever do with the DB, and not one of them solves the problem; only mitigates it in some ways. I would strongly suggest looking at moving some of your data out of MySQL (or really any SQL) when you start approaching a dataset of a size where these types of things become an issue.
Update: a few years later, and our dataset has grown to about 800 GiB. In addition, we have a single table which is 200+ GiB and a few others in the 50-100 GiB range. Everything I said before holds. It still performs just fine, but the problems of running full dataset operations have become worse.
I would focus first on your indexes, than have a server admin look at your OS, and if all that doesn't help it might be time for a master/slave configuration.
That's true. Another thing that usually works is to just reduce the quantity of data that's repeatedly worked with. If you have "old data" and "new data" and 99% of your queries work with new data, just move all the old data to another table - and don't look at it ;)
-> Have a look at partitioning.
2GB and about 15M records is a very small database - I've run much bigger ones on a pentium III(!) and everything has still run pretty fast.. If yours is slow it is a database/application design problem, not a mysql one.
It's kind of pointless to talk about "database performance", "query performance" is a better term here. And the answer is: it depends on the query, data that it operates on, indexes, hardware, etc. You can get an idea of how many rows are going to be scanned and what indexes are going to be used with EXPLAIN syntax.
2GB does not really count as a "large" database - it's more of a medium size.
I once was called upon to look at a mysql that had "stopped working". I discovered that the DB files were residing on a Network Appliance filer mounted with NFS2 and with a maximum file size of 2GB. And sure enough, the table that had stopped accepting transactions was exactly 2GB on disk. But with regards to the performance curve I'm told that it was working like a champ right up until it didn't work at all! This experience always serves for me as a nice reminder that there're always dimensions above and below the one you naturally suspect.
Also watch out for complex joins. Transaction complexity can be a big factor in addition to transaction volume.
Refactoring heavy queries sometimes offers a big performance boost.
A point to consider is also the purpose of the system and the data in the day to day.
For example, for a system with GPS monitoring of cars is not relevant query data from the positions of the car in previous months.
Therefore the data can be passed to other historical tables for possible consultation and reduce the execution times of the day to day queries.
Performance can degrade in a matter of few thousand rows if database is not designed properly.
If you have proper indexes, use proper engines (don't use MyISAM where multiple DMLs are expected), use partitioning, allocate correct memory depending on the use and of course have good server configuration, MySQL can handle data even in terabytes!
There are always ways to improve the database performance.
It depends on your query and validation.
For example, i worked with a table of 100 000 drugs which has a column generic name where it has more than 15 characters for each drug in that table .I put a query to compare the generic name of drugs between two tables.The query takes more minutes to run.The Same,if you compare the drugs using the drug index,using an id column (as said above), it takes only few seconds.
Database size DOES matter in terms of bytes and table's rows number. You will notice a huge performance difference between a light database and a blob filled one. Once my application got stuck because I put binary images inside fields instead of keeping images in files on the disk and putting only file names in database. Iterating a large number of rows on the other hand is not for free.
No it doesnt really matter. The MySQL speed is about 7 Million rows per second. So you can scale it quite a bit
Query performance mainly depends on the number of records it needs to scan, indexes plays a high role in it and index data size is proportional to number of rows and number of indexes.
Queries with indexed field conditions along with full value would be returned in 1ms generally, but starts_with, IN, Between, obviously contains conditions might take more time with more records to scan.
Also you will face lot of maintenance issues with DDL, like ALTER, DROP will be slow and difficult with more live traffic even for adding a index or new columns.
Generally its advisable to cluster the Database into as many clusters as required (500GB would be a general benchmark, as said by others it depends on many factors and can vary based on use cases) that way it gives better isolation and gives independence to scale specific clusters (more suited in case of B2B)