I have a table in a MariaDB database with 125 million rows that is used to store results from automated data analyses. A few thousand rows are written at random times each day. Queries on the table happen several times a day, and could return a few thousand or a few million results. The read performance is most important, because that happens in front of the user.
Currently I use MyISAM, which works generally well, except when the result set gets above a few thousand rows. I have indexes on the columns that are used for querying. The query cache helps, but it's rare that a user will perform the same search more than once.
I'm sure there are many optimization techniques I could use (I'd love to hear those too!), but the most basic question is with a table of this size and this usage pattern, what is the best MariaDB storage engine for the table?
The index implementation between the Engines is different enough to prevent answering your question without seeing the actual query and SHOW CREATE TABLE.
InnoDB is the fastest engine for virtually all use cases. How much RAM do you have? How big is your table? Part of what I am fishing for is whether your queries are CPU-bound or I/O-bound.
Depending on the details of the query, taking advantage of InnoDB's "clustered" PRIMARY KEY can give a boost. This is not directly available in MyISAM.
What are you doing with the thousand or million rows? Summarizing? If so, then there is an approach that may make the queries run 10 times as fast: http://mysql.rjweb.org/doc.php/summarytables
Related
I know that MySQL usually handles tables with many rows well. However, I currently face a setting where one table will be read and written by multiple users (around 10) at the same time and it is quite possible that the table will contain 10 billion rows.
My setting is a MySQL database with an InnoDB storage engine.
I have heart of some projects where tables of that size would become less efficient and slower, also concerning indexes.
I do not like the idea of having multiple tables with exactly the same structure just to split rows. Main question: However, would this not solve the issue of having reduced performance due to such a large bunch of rows?
Additional question: What else could I do to work with such a large table? The number of rows itself is not diminishable.
I have heard of some projects where tables of that size would become less efficient and slower, also concerning indexes.
This is not typical. So long as your tables are appropriately indexed for the way you're using them, performance should remain reasonable even for extremely large tables.
(There is a very slight drop in index performance as the depth of a BTREE index increases, but this effect is practically negligible. Also, it can be mitigated by using smaller keys in your indexes, as this minimizes the depth of the tree.)
In some situations, a more appropriate solution may be partitioning your table. This internally divides your data into multiple tables, but exposes them as a single table which can be queried normally. However, partitioning places some specific requirements on how your table is indexed, and does not inherently improve query performance. It's mainly useful to allow large quantities of older data to be deleted from a table at once, by dropping older partitions from a table that's partitioned by date.
DB:Mysql 5.6, Innodb,
explain result:
the real data:
I'm confused where does the 16462900 come from. When I set 6 wave_no, the rows in explain result is 6:
The value of rows in the EXPLAIN output is an estimate of the number of rows that will be examined.
It's just an estimate, based on the calculated statistics.
References:
http://dev.mysql.com/doc/refman/5.6/en/explain-output.html
http://dev.mysql.com/doc/refman/5.6/en/innodb-persistent-stats.html
Use this 'composite' index to improve performance:
INDEX(com_uid, exchange_state, wave_no)
And remove the FORCE.
The statistics are sometimes that far off. This can especially happen if there are TEXT or BLOB columns, which are stored elsewhere, thereby messing with the arithmetic. Don't worry about it.
You could do ANALYZE TABLE to recalculate the stats, but that might not improve the stats.
My college said this because of table fragmentation, you can search it from Google, here is one .
MySQL tables, including MyISAM and InnoDB, two of the most common types, experience fragmentation as data is inserted and deleted randomly. Fragmentation can leave large holes in your table, blocks which must be read when scanning the table. Optimizing your table can therefore make full table scans and range scans more efficient.
Here is the article in official site.
I am trying to build a database that would contain a large number of records, each with a lot of columns(fields) - maybe around 200-300 fields total for all tables. Let's say that I would have in a few years about 40.000.000 to 60.000.000 records.
I plan to normalize the database, so I will have a lot of tables (about 30-40) -> and lots of joins for queries.
Database will be strictly related to US, meaning that queries will be based on the 50 states alone (if a query is made, it won't allow to search/insert/etc in multiple states, but just one).
What can I do to have better performance?
Someone came with the idea to have all the states in different table structures, meaning I will have 50 tables * the 30-40 for the data (about 200 tables)! Should I even consider this type of approach?
The next idea was to use partitioning based on the US 50 states. How about this?
Any other way?
The best optimization is determined by the queries you run, not by your tables' structure.
If you want to use partitioning, this can be a great optimization, if the partitioning scheme supports the queries you need to optimize. For instance, you could partition by US state, and that would help queries against data for a specific state. MySQL supports "partition pruning" so that the query would only run against the specific partition -- but only if your query mentions a specific value for the column you used as the partition key.
You can always check whether partition pruning is effective by using EXPLAIN PARTITIONS:
EXPLAIN PARTITIONS
SELECT ... FROM MyTable WHERE state = 'NY';
That should report that the query uses a single partition.
Whereas if you need to run queries by date for example, then the partitioning wouldn't help; MySQL would have to repeat the query against all 50 partitions.
EXPLAIN PARTITIONS
SELECT ... FROM MyTable WHERE date > '2013-05-01';
That would list all partitions. There's a bit of overhead to query all partitions, so if this is your typical query, you should probably use range partitioning by date.
So choose your partition key with the queries in mind.
Any other optimization technique follows a similar pattern -- it helps some queries, possibly to the disadvantage of other queries. So be sure you know which queries you need to optimize for, before you decide on the optimization method.
Re your comment:
Certainly there are many databases that have 40 million rows or more, but have good performance. They use different methods, including (in no particular order):
Indexing
Partitioning
Caching
Tuning MySQL configuration variables
Archiving
Increasing hardware capacity (e.g. more RAM, solid state drives, RAID)
My point above is that you can't choose the best optimization method until you know the queries you need to optimize. Furthermore, the best choice may be different for different queries, and may even change over time as data or traffic grows. Optimization is an continual process, because you won't know where your bottlenecks are until after you see how your data grows and the query traffic your database receives.
I was optimizing a 3 GB table as a MEMORY table in order to do some analysis on it, and I was curious if adding indexes even help a MEMORY table. Since the data is all in memory anyway, is this just redundant?
No, they're not redundant.
Yes, continue to use indexes.
The speed of access to a memory table on smaller tables with a non-indexed column may seem almost identical to the indexed ones due to how fast full table scans can be in memory, but as the table grows or as you join them together to make larger result sets there will be a difference.
Regardless of the storage method the engine uses (disk/memory), proper indexes will improve performance as long the storage engine supports them. How the indexes are implemented may vary, but I know they are implemented in the table types MEMORY, INNODB, and MyISAM. BTW: The default method for indexes in MEMORY tables is with a hash instead of a B-Tree.
Also, I generally don't recommend coding to your storage engine. What's a memory table today may need to changed to innodb tomorrow--the SQL and schema should stand on it's own.
No, indexing has little to do with data access speed. An index reorganizes data in order to optimize specific queries.
For example if you add a balanced binary tree index to a one-million-row column, you will be able to find the item you want in about 20 read operations, instead of a average half million.
So placing that million rows in memory, which is 100x faster than the disk, will speed a brute force search by 100x. Adding the index will further improve the speed by a factor of twenty-five thousand by allowing the DB to perform a smarter search instead of a merely faster search.
Things are more complicated than this, because other factors get into play, and you rarely get such large a benefit from an index. Smarter searches are also slower on a one-by-one basis: those 20 index seeks cost much more than 20 brute force seeks. Then there's index maintenance, etc.
But my suggestion is to keep the data in memory if you can -- and index them.
I'm trying to fine-tune my MySQL server so I check my settings, analyzing slow-query log, and simplify my queries if possible.
Sometimes it is enough if I am indexing correctly, sometimes not. I've read somewhere (please correct me if this is stupidity) that more indexes than I need make the same effect, like if I don't have any of indexes.
How many indexes are enough? You can say it depends on hundreds of factors, but I'm curious about how can I clean up my mysql-slow.log enough to reduce server load.
Furthermore, I saw some "interesting" log entries like this:
# Query_time: 0 Lock_time: 0 Rows_sent: 22 Rows_examined: 44
SELECT * FROM `categories` ORDER BY `orderid` ASC;
The table in question contains exactly 22 rows, index set in orderid. Why is this query showing up in the log after all? Why examine 44 rows if it only contains 22?
The amount of indexing and the line of doing too much will depend on a lot of factors. On small tables like your "categories" table you usually don't want or need an index and it can actually hurt performance. The reason being is that it takes I/O (i.e. time) to read an index and then more I/O and time to retrieve the records associated with the matched rows. An exception being when you only query the columns contained within the index.
In your example you are retrieving all the columns and with only 22 rows and it may be faster to just do a table scan and sort those instead of using the index. The optimizer may/should be doing this and ignoring the index. If that is the case, then the index is just taking up space with no benefit. If your "categories" table is accessed often, you may want to consider pinning it into memory so the db server keeps it accessible without having to goto the disk all the time.
When adding indexes you need to balance out disk space, query performance, and the performance of updating and inserting into the tables. You can get away with more indexes on tables that are static and don't change much as opposed to tables with millions of updates a day. You'll start feeling the affects of index maintenance at that point. What is acceptable in your environment though is and can only be determined by you and your organization.
When doing your analysis, be sure to generate/update your table and index statistics so that you can be assured of accurate calculations.
As a general rule, you should have indexes on all primary keys (you don't have a choice in that), all foreign keys, and any other fields you commonly use to fetch rows.
For example, if I commonly look up users by username, I would have that indexed, even if user ID was the primary key.
How many indexes depends entirely on the queries your running, what kinds of joins are being done (if any), the kind of data stored in the table and how big the tables are (as well as many other factors). There's really no exact science to it. The greatest tool in your arsenal for figuring out how to optimize a query is explain. Using explain you can find out what kind of joins are being down, what possible keys could be used and which key (if any) was used as well as how many rows were examined for each table in the join.
Using this information you can decide how to key your tables and/or modify your queries to make them more efficient. The syntax for explain is very simple.
EXPLAIN SELECT * FROM `categories` ORDER BY `orderid` ASC;
Note, explain does not actually run the query. So if you're using this to debug a query that takes 5 minutes to run, explain will still be very fast.
You do need to be careful when adding indexes though as they do cause inserts and updates to go slower and on very large tables this performance hit can become noticeable. Especially if that same table is used for a lot of reads. While adding a lot of indexes generally won't kill the performance of a query, you should still only add them as yo
Also keep in mind that MySQL will use a maximum of one index per select statement (although if you are using a join, it can also use one for each join). So indexing just because is a waste of disk space and will slow the database down on writes. If you commonly use a where statement on two columns, do one index containing both of those columns, it will be significantly faster than indexing just one alone.
An index can speed up a SELECT query, but it will slow down INSERT/UPDATE/DELETE queries because they need to update the index as well, not just the row.
This is just personal opinion (I've got no facts to back it up), but I think that if there is a query that is taking a long time and an index would speed it up - go for it! "Too many" indexes would be if you added indexes that didn't do any good (e.g. there were no queries it would speed up). For example, a silly thing to do would be to place an index on every column "just because".
There's no magic number for the "best" number of indexes. The basic rule is this: add indexes for queries that are used often and/or need to run quickly.
Having "too many" indexes shouldn't slow down queries, but it each index added adds a small amount of time to add/update items in the db (since it modifies the indices as well), and a small amount of space. However, if you're just adding indexes as required, this is probably not a big concern.