MySQL performance: many rows and columns (MyISAM) - mysql

Since I'm still in the beginning of my site design I figured now's a good time to ask this.
I know that one of the ways to optimize MySQL queries is to split your rows into seperate tables, however, that does have a few comfort issues.
What I'm considering is this: would querying a table consisting of around 1'000'000 rows and 150 columns using excellently designed indexes and getting only the needed columns from each query result in a much higher server load than splittiing the table into multiple ones, resulting in less collumns?

Big blob tables are a anti-pattern, never use them.
Normalized tables will run much much faster than a single blob.
InnoDB is optimized for many small tables that need to be joined.
Using a normalized table will save you many headaches besides:
Your data will be smaller, so more of it fits in memory.
You only store data in one place, so it cannot end up with inconsistent data.
MySQL only allows you to use one index per select per table, multiple tables means you get to use more indexes and get more speed.
Triggers on tables execute much faster.
Normalized tables are easier to maintain.
You have less indexes per table, so inserts are faster.
Indexes are smaller (fewer rows) and narrows (less columns) and will run much faster as a result.

If the data is static, you can pack the tables for greater efficiency. Here is the page in the reference manual

Related

Best methods to increase database performance?

Assuming that I have 20L records,
Approach 1: Hold all 20L records in a single table.
Approach 2: Make 20 tables and enter 1L into each.
Which is the best method to increase performance and why, or are there any other approaches?
Splitting a large table into smaller ones can give better performance -- it is called sharding when the tables are then distributed across multiple database servers -- but when you do it manually it is most definitely an antipattern.
What happens if you have 100 tables and you are looking for a row but you don't know which table has it? If you put index on the tables you'll need to do it 100 times. If somebody wants to join the data set he might need to include 100 tables in his join in some use cases. You'd need to invent your own naming conventions, document and enforce them yourself with no help from the database catalog. Backup and recovery and all the other maintenance tasks will be a nightmare....just don't do it.
Instead just break up the table by partitioning it. You get 100% of the performance improvement that you would have gotten from multiple tables but now the database is handling the details for you.
When looking for read time performance, indexes are a great way to improve the performance. However, having indexes can slow down the write time queries.
So if you are looking for a read time performance, prefer indexes.
Few things to keep in mind when creating the index
Try to avoid null values in the index
Cardinality of the columns matter. It's been observed that having a column with lower cardinality first gives better performance when compared to a column with higher cardinality
Sequence of the columns in index should match your where clause. For ex. you create a index on Col A and Col B but query on Col C, your index would not be used. So formulate your indexes according to your where clauses.
When in doubt if an index was used or not, use EXPLAIN to see which index was used.
DB indexes can be a tricky subject for the beginners but imagining it as a tree traversal helps visualize the path traced when reading the data.
The best/easiest is to have a unique table with proper indexes. On 100K lines I had 30s / query, but with an index I got 0.03s / query.
When it doesn't fit anymore you split tables (for me it's when I got to millions of lines).
And preferably on different servers.
You can then create a microservice accessing all servers and returning data to consumers like if there was only one database.
But once you do this you better not have joins, because it'll get messy replicating data on every databases.
I would stick to the first method.

MySQL Performance of one vs. many tables

I know that MySQL usually handles tables with many rows well. However, I currently face a setting where one table will be read and written by multiple users (around 10) at the same time and it is quite possible that the table will contain 10 billion rows.
My setting is a MySQL database with an InnoDB storage engine.
I have heart of some projects where tables of that size would become less efficient and slower, also concerning indexes.
I do not like the idea of having multiple tables with exactly the same structure just to split rows. Main question: However, would this not solve the issue of having reduced performance due to such a large bunch of rows?
Additional question: What else could I do to work with such a large table? The number of rows itself is not diminishable.
I have heard of some projects where tables of that size would become less efficient and slower, also concerning indexes.
This is not typical. So long as your tables are appropriately indexed for the way you're using them, performance should remain reasonable even for extremely large tables.
(There is a very slight drop in index performance as the depth of a BTREE index increases, but this effect is practically negligible. Also, it can be mitigated by using smaller keys in your indexes, as this minimizes the depth of the tree.)
In some situations, a more appropriate solution may be partitioning your table. This internally divides your data into multiple tables, but exposes them as a single table which can be queried normally. However, partitioning places some specific requirements on how your table is indexed, and does not inherently improve query performance. It's mainly useful to allow large quantities of older data to be deleted from a table at once, by dropping older partitions from a table that's partitioned by date.

MySQL: testing performance after dividing a table

I made a test to see if dividing an indexed large table will increase the performance.
Original Table: 20000 rows.
Sub Tables: 4x5000 rows.
The main Table is divided into 4 tables, all tables are indexed, in the test each sql query was executed 10000 times in a loop to measure more accurate query times.
When I search an indexed column in the table, I see no difference in performance and Query times are the same for the original (20000 Rows) table and the new (5000 rows) tables.
I tried the same test without indexing by deleting indexes for all tables, and the difference in performance was obvious, where searching in sub tables was 6 times faster than searching in the large table. But with indexing the performance was the same.
So do you think it is a waste of time to divide my tables into smaller ones?
Note: 20000 size is just for testing, my real data will be of the size of 100M or more.
Yes, it is a waste of time. Databases can easily handle millions of rows and 20,000 is relatively small. As you noticed, indexes make finding data fast. The size of the data doesn't affect the speed of lookups noticeably in most cases. Queries might take a few more milliseconds if the difference in size is 100 or 1000 times, but the scale you're working on would make no real difference.
What you have effectively done is reinvented Partitioning of Tables. I would not use your own sub-table scheme and focus on using partitioned tables would automatically mean that internally subtables are used and if you formulate your SQL appropriately, subtables would automatically be excluded from operations if not needed.
However, all the management of the partitions would be on the server itself, so that your client code can be kept simple and you still only have to deal with a single table.

sorting 1 billion rows by one varchar column in MYSQL quickly

I have 1 billion rows stored in MYSQL, I need to output them alphabetically by the a varchar column, what's the most efficient way of go about it. using other linux utilites like sort awk are allowed.
MySQL can deal with a billion rows. Efficiency depends on 3 main factors: Buffers, Indexes and Joins.
Some suggestions:
Try to fit data set you’re working with in memory
Processing in memory is so much faster and you have whole bunch of problems solved just doing so. Use multiple servers to host portions of data set. Store portion of data you’re going to work with in temporary table etc.
Prefer full table scans to index accesses
For large data sets full table scans are often faster than range scans and other types of index lookups. Even if you look at 1% or rows or less full table scan may be faster.
Avoid joins to large tables
Joining of large data sets using nested loops is very expensive. Try to avoid it. Joins to smaller tables is OK but you might want to preload them to memory before join so there is no random IO needed to populate the caches.
Be aware of MySQL limitations which requires you to be extra careful working with large data sets. In MySQL, a query runs as a single thread (with exeption of MySQL Cluster) and MySQL issues IO requests one by one for query execution, which means if single query execution time is your concern many hard drives and large number of CPUs will not help.
Sometimes it is good idea to manually split query into several, run in parallel and aggregate result sets.
You did not give much info on your setup or your dataset, but this should give you a couple of clues on what to watch out for. In my opinion having the (properly tuned) database sort this for you would be faster than doing it programmatically unless you have very specific needs not mentioned in your post.
Have you just tried indexing the column and dumping them out? I'd try that first to see if the performance was inadequate before going exotic.
It depends on how you define efficient. CPU/Memory/IO/Time/Coding Effort. What is important in this case?
"select * from big_table order by the_varchar_column" That is probably the most efficient use of developer resources. Adding an index might make it run a lot faster.

Maximum table size for a MySQL database

What is the maximum size for a MySQL table? Is it 2 million at 50GB? 5 million at 80GB?
At the higher end of the size scale, do I need to think about compressing the data? Or perhaps splitting the table if it grew too big?
I once worked with a very large (Terabyte+) MySQL database. The largest table we had was literally over a billion rows.
It worked. MySQL processed the data correctly most of the time. It was extremely unwieldy though.
Just backing up and storing the data was a challenge. It would take days to restore the table if we needed to.
We had numerous tables in the 10-100 million row range. Any significant joins to the tables were too time consuming and would take forever. So we wrote stored procedures to 'walk' the tables and process joins against ranges of 'id's. In this way we'd process the data 10-100,000 rows at a time (Join against id's 1-100,000 then 100,001-200,000, etc). This was significantly faster than joining against the entire table.
Using indexes on very large tables that aren't based on the primary key is also much more difficult. Mysql stores indexes in two pieces -- it stores indexes (other than the primary index) as indexes to the primary key values. So indexed lookups are done in two parts: First MySQL goes to an index and pulls from it the primary key values that it needs to find, then it does a second lookup on the primary key index to find where those values are.
The net of this is that for very large tables (1-200 Million plus rows) indexing against tables is more restrictive. You need fewer, simpler indexes. And doing even simple select statements that are not directly on an index may never come back. Where clauses must hit indexes or forget about it.
But all that being said, things did actually work. We were able to use MySQL with these very large tables and do calculations and get answers that were correct.
About your first question, the effective maximum size for the database is usually determined by operating system, specifically the file size MySQL Server will be able to create, not by MySQL Server itself. Those limits play a big role in table size limits. And MyISAM works differently from InnoDB. So any tables will be dependent on those limits.
If you use InnoDB you will have more options on manipulating table sizes, resizing the tablespace is an option in this case, so if you plan to resize it, this is the way to go. Give a look at The table is full error page.
I am not sure the real record quantity of each table given all necessary information (OS, Table type, Columns, data type and size of each and etc...) And I am not sure if this info is easy to calculate, but I've seen simple table with around 1bi records in a couple cases and MySQL didn't gave up.