in mysql, does quantity of data in one table affect performance - mysql

Setting up a new database that has a comments table. I've been told to expect this table to get extremely large. I'm wondering if there is any particular reason why I wouldn't want to keep this table in the same database as the rest of the data for the site.

Quantity of data will certainly affect performance in any RDBMS, however, there's no reason that this table should exist in a separate database on the same server. If the table truly will become very large with a lot of insert activity, then you might want to consider an ETL process that keeps an indexed copy for fast selection (mostly because indexes can negatively affect inserts despite the performance gain on selects)

Keep the table in the same database for now, but it is well known that insert speed slows as the number of records increases.
There are some options to consider if the performance becomes unacceptable, for instance, if the data can be partitioned, split it across multiple tables in separate databases.

Related

Mysql what if too much data in a table

Data is increasing in one table everyday, it might lower the performance . I was thinking if I can create a trigger which move table A into A1 and create a new table A every a period of time, so that insert or update could be faster in table A. Is this the right way to save performance ? If not, what should I do ?
(for example, insert or update 1000 rows per second in table A, how is the performance after 3 years ?)
We are designing softwares for a factory. There are product lines which pcb boards are made on. We need to insert almost 60 pcb records per second for years. (1000 rows seem to be exaggerated)
First, you are talking about several terabytes for a single table. Is your disk that big? Yes, MySQL can handle that big a table.
Will it slow down? It depends on
The indexes. If you have 'random' indexes, the INSERTs will slow down to about 1 insert per disk hit. On a spinning HDD, that is only about 100 per second. SSD might be able to handle 1000/sec. Please provide SHOW CREATE TABLE.
Does the table have an AUTO_INCREMENT? If so, it needs to be BIGINT, not INT. But, if possible, get rid of it all together (to save space). Again, let's see the SHOW.
"Point" queries (load one row via an index) are mostly unaffected by the size of the table. They will be about twice as slow in a trillion-row table as in a million-row table. A point query will take milliseconds or tens of milliseconds; no big deal.
A table scan will take hours or days; hopefully you are not doing that.
A billion-row scan of part of the table will take days or weeks unless you are using the PRIMARY KEY or have a "covering" index. Let's see the queries and the SHOW.
The best technique is not to store the data. Summarize it as it arrives, save the summaries, then toss the raw data. (OK, you might store the raw in a csv file just in case you need to build a new summary table or fix a bug in an existing one.)
Having a few summary tables instead of the raw data would shrink the data to under 1TB and allow the relevant queries to run 10 times as fast. (OK, point queries would be only slightly faster.)
PARTITIONing (or otherwise splitting up the table)? It depends. Let's see the queries and the SHOW. In many situations, PARTITIONing does not speed up anything.
Will you be deleting or modifying existing rows? I hope not. That adds more dimensions of problems. If, on the other hand, you need to purge 'old' data, then that is an excellent use for PARTITIONing. For 3 years' worth of data, I would PARTITION BY RANGE(TO_DAYS(..)) and have monthly partitions. Then a monthly DROP PARTITION would be very fast.
Very Huge data may decrease the performance of server, So there is a way to handle this :
1) you have to create another table to store archive data ( old data ) using Archive storage mechanism . ( https://dev.mysql.com/doc/refman/8.0/en/archive-storage-engine.html )
2) create MySQL job/scheduler to move older records to archive table. schedule in timeslot
when server is maximum idle.
3) after moving older records to archive table, re-index the original table.
this will serve the purpose of performance.
It is unlikely that 1000 row tables perform sufficiently poorly that doing a table copy every once in a while is an overall net gain. And anyway, what would the new table have that the old one did not which would improve performance?
The key to having tables perform efficiently is intelligent table design and management of indexes. That is how zillion row tables are effective in geospatial work, library catalogs, astronomy, and how internet search engines find useful data, etc.
Each index defined does cause more mysql impact especially at row insert time. Assuming there are more reads than inserts, this is an advantage because most queries are rapidly completed thanks to a suitable index.
Indexes are best defined with a thorough understanding of the queries made against the table—both in quality and quantity. And, if there is any tendency for the nature of the queries to trend over months or years, then the indexes would need additions, modifications, or—yes—even deletions.
It seems to me there is something inherently wrong with the way you are using MySQL to begin with.
A database system is supposed to manage data that is required by your application in order for it to work. If you think flushing the table every so often is something acceptable, then that doesn't seem to be the case.
Perhaps you are better off just using log files. Split them by date, delete old ones if and when you decide they are no longer relevant or need the disk space. It's even safer to do that way from a recovery perspective.
If you need a better suggestion, then improve your question to include exactly what you are trying to accomplish so we can help you with it.

MySQL Performance of one vs. many tables

I know that MySQL usually handles tables with many rows well. However, I currently face a setting where one table will be read and written by multiple users (around 10) at the same time and it is quite possible that the table will contain 10 billion rows.
My setting is a MySQL database with an InnoDB storage engine.
I have heart of some projects where tables of that size would become less efficient and slower, also concerning indexes.
I do not like the idea of having multiple tables with exactly the same structure just to split rows. Main question: However, would this not solve the issue of having reduced performance due to such a large bunch of rows?
Additional question: What else could I do to work with such a large table? The number of rows itself is not diminishable.
I have heard of some projects where tables of that size would become less efficient and slower, also concerning indexes.
This is not typical. So long as your tables are appropriately indexed for the way you're using them, performance should remain reasonable even for extremely large tables.
(There is a very slight drop in index performance as the depth of a BTREE index increases, but this effect is practically negligible. Also, it can be mitigated by using smaller keys in your indexes, as this minimizes the depth of the tree.)
In some situations, a more appropriate solution may be partitioning your table. This internally divides your data into multiple tables, but exposes them as a single table which can be queried normally. However, partitioning places some specific requirements on how your table is indexed, and does not inherently improve query performance. It's mainly useful to allow large quantities of older data to be deleted from a table at once, by dropping older partitions from a table that's partitioned by date.

Query performance increase from deleting rows in SQL database?

I have a database with a single table that keeps track of user state. When I'm done handling the row, its no longer necessary to keep it in the database and can be deleted.
Now lets say I wanted to keep track of the row instead of deleting it (for historical purposes, analytics, etc). Would it be better to:
Leave the data in the same table and mark the row as 'used' (with an extra column or something like that)
Delete the row from the table and insert it into a separate table that is created only for historical purposes
For choice #1, I wonder if leaving the unnecessary rows in the database will start to affect query performance. (All of my queries are on indexed columns, so maybe this doesn't matter?)
For choice #2, I wonder if the constant deleting of rows will end up causing problems such as fragmentation?
Query performance will be better in the long run:
What is happening with forever inserts:
The table grows, indexes grow, index performance (lookup) is decreases with the size of the table, especially insert performance is hurt.
What is happening with delete:
Table pages get fragmented, so the deleted space is not re-used 100% as expected, more near 50% in MySQL. So the table still grows to about twice the size you might expect for your amount of data. The index gets fragmented and becomes lob sided: It contains your new data but also the structure for your old data. It depends on the structure of your data on how bad this gets. This situation however stabilizes at a certain performance. This performance point has 2 benefits:
1) The table is more limited in size, so potential full table scans are faster
2) Your performance is predictable.
Due to the fragmentation however this performance point is not equal to about twice your data amount, it tends to be a bit worse (benchmark it to see yourself). The benefit of the delete scenario is however since you have a smaller data set, that you might be able to rebuild your index once every reasonable period, thus improving your performance.
Alternatives
There are two alternatives you can look at to improve performance:
Switch to MariaDB: This gains about 8% performance on large datasets (my observation, dataset just about 200GB compressed data)
Look at partitioning: If you have a handy partitioning parameter, you can create a series of "small tables" for you and prevent logic for delete, rebuild and historic data management. This might give you the best performance profile.
If most of the table is flagged as deleted, you will be stumbling over them as you look for the non-deleted records. Adding is_deleted to many of the indexes is likely to help.
If you are deleting records purely on age, then PARTITION BY RANGE(TO_DAYS(...)) is an excellent way to build the table. The DROP TABLE is instantaneous and the ALTER TABLE ... REORGANIZE ... to create a new week (or month or ...) partition is also instantaneous. See my blog for details.
If you "move" records to another table, then the table will not shrink very fast due to fragmentation. If you have enough disk space, this is not a bug deal. If some queries need to see both current and archived records, use UNION ALL; it is pretty easy and efficient.

MySQL performance: many rows and columns (MyISAM)

Since I'm still in the beginning of my site design I figured now's a good time to ask this.
I know that one of the ways to optimize MySQL queries is to split your rows into seperate tables, however, that does have a few comfort issues.
What I'm considering is this: would querying a table consisting of around 1'000'000 rows and 150 columns using excellently designed indexes and getting only the needed columns from each query result in a much higher server load than splittiing the table into multiple ones, resulting in less collumns?
Big blob tables are a anti-pattern, never use them.
Normalized tables will run much much faster than a single blob.
InnoDB is optimized for many small tables that need to be joined.
Using a normalized table will save you many headaches besides:
Your data will be smaller, so more of it fits in memory.
You only store data in one place, so it cannot end up with inconsistent data.
MySQL only allows you to use one index per select per table, multiple tables means you get to use more indexes and get more speed.
Triggers on tables execute much faster.
Normalized tables are easier to maintain.
You have less indexes per table, so inserts are faster.
Indexes are smaller (fewer rows) and narrows (less columns) and will run much faster as a result.
If the data is static, you can pack the tables for greater efficiency. Here is the page in the reference manual

How to structure an extremely large table

This is more a conceptual question. It's inspired from using some extremely large table where even a simple query takes a long time (properly indexed). I was wondering is there is a better structure then just letting the table grow, continually.
By large I mean 10,000,000+ records that grows every day by something like 10,000/day. A table like that would hit 10,000,000 additional records every 2.7 years. Lets say that more recent records are accesses the most but the older ones need to remain available.
I have two conceptual ideas to speed it up.
1) Maintain a master table that holds all the data, indexed by date in reverse order. Create a separate view for each year that holds only the data for that year. Then when querying, and lets say the query is expected to pull only a few records from a three year span, I could use a union to combine the three views and select from those.
2) The other option would be to create a separate table for every year. Then, again using a union to combine them when querying.
Does anyone else have any other ideas or concepts? I know this is a problem Facebook has faced, so how do you think they handled it? I doubt they have a single table (status_updates) that contains 100,000,000,000 records.
The main RDBMS providers all have similar concepts in terms of partitioned tables and partitioned views (as well as combinations of the two)
There is one immediate benefit, in that the data is now split across multiple conceptual tables, so any query that includes the partition key within the query can automatically ignore any partition that the key would not be in.
From a RDBMS management perspective, having the data divided into seperate partitions allows operations to be performed at a partition level, backup / restore / indexing etc. This helps reduce downtimes as well as allow for far faster archiving by just removing an entire partition at a time.
There are also non relational storage mechanisms such as nosql, map reduce etc, but ultimately how it is used, loaded and data is archived become a driving factor in the decision of the structure to use.
10 million rows is not that large in the scale of large systems, partitioned systems can and will hold billions of rows.
Your second idea looks like partitioning.
I don't know how well it works, but there is support for partition in MySQL -- see, in its manual : Chapter 17. Partitioning
There is good scalability approach for this tables. Union is right way, but there is better way.
If your database engine supports "semantical partitioning", then you can split one table into partitions. Each partition will cover some subrange (say 1 partition per year). It will not affect anything in SQL syntax, except DDL. And engine will transparently run hidden union logic and partitioned index scans with all parallel hardware it has (CPU, I/O, storage).
For example Sybase allows up to 255 partitions, as it is limit of union. But you will never need keyword "union" in queries.
Often the best plan is to have one table and then use database partioning.
Or you can archive data and create a view for the archived and combined data and keep only the active data in the table most functions are referencing. You will have to have a good archiving stategy though (which is automated) or you can lose data or not get things done efficiently in moving them. This is typically more difficult to maintain.
What you're talking about is horizontal partitioning or sharding.