Fast at selects/joining MySQL storage engines? - mysql

I have some very large databases (some up to 150M rows) I'm working with & after initially inserting the data there isn't much INSERT's going on; just a lot of SELECT's & usage of JOINS.
I've been messing around with InfoBright a lot (the community version) & whilst I believe it is a good engine, I personally have been having some problems with it getting it to run like it should (fast).
So I was wondering if anyone else could recommend any other fast free storage engine for MySQL?
I'm just now checking out tokudb; is there anything else out there to check out as well?

You should look at InfiniDB too. http://infinidb.org/ (one of the fastest)
There are a lot of considerations you need to make before benchmarking any engine. Hardware stuff like multicore processors, memory, configuration. Design stuff related to your schema etc etc. and how all this impacts the engine performance.
Do check this blog out for how they do benchmarking of engines (it names other engine types) - http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/
Note that this comparison is for a star schema design. If a columnar db engine doesn't suit your requirements, you can look into XtraDB , which is an extended version of InnoDB (not the fastest, but is ACID compliant).
ps - Always track the properties (important to you) of each engine - like referential integrity checks, ACID compliance etc. Sometimes these limitations can be bigger deal breakers as compared to a 10% increase in query performance

Have you looked at Sphinx at all? While it is a search engine, it also supports query-less searches, which is similar to standard SELECT queries with indexes. I found it to be a huge help when dealing with large datasets. It's very fast, and is used heavily in high-traffic forums who are up in the millions (or hundreds of millions) of posts arena.
There is also a plugin for MySQL called SphinxSE which allows it to act as a MySQL storage engine which makes integration very easy to set up. You build your indexes by supplying the indexer program a query, and then once it's all set up, you can query it as if it was a normal table.
http://sphinxsearch.com/docs/2.0.1/sphinxse-overview.html (note, I haven't used it much since pre 1.0)

Besides taking into consideration which DBMS you use, you should also focus on optimizing your tables, indices and queries.
Whenever you have multiple joins, join first on the most selective relation and then on the less selective.
Analyze your query execution plans.
Create indices on columns that are hit often in your QEPs.

Brett -
When using Infobright, you get the best performance gains by:
1) Utilizing the Knowledge Grid as much as possible
2) reducing joins
3) creating 'lookup'
Since the Knowledge-Grid is in-memory, you can kill off a lot of query time just by adding additional filters. Also, consider using a nested select instead of a join. By doing so, you can use an already-created knowledge node (instead of generating a pack-to-pack node on the fly).
If you have some queries that you think should be faster, post them, and I can help with potentially modifying the query to make it run faster.
Cheers,
Jeff

Related

How to improve "select min(my_col)" query in MySQL without adding and index

The query below takes about a minute to run on my MySQL instance (running on a fairly beefy machine with 64G memory, 2T disc, 2.30Ghz CPU with 8 cores and 16 logical, and the query is running on localhost). This same query runs in less than a second on a SQL Server database I have access to. Unfortunately, I do not have access to the SQL Server host or the DBA, etc.
select min(visit_start_date)
from visit_occurrence;
The table has been set to ENGINE=MyISAM and default-storage-engine=INNODB and innodb_buffer_pool_size=16G are set in my.ini.
Is there some configuration I could be missing that would cause this query to run so slowly on MySQL? How can I fix it?
I have a large number of tables and queries I will need to support so I would really like to be able to fix this issue globally rather than having to create indexes everywhere I have slow queries.
The SQL Server database does not seem to have an index on the column being queried as shown below.
EDIT:
Untagged MS Sql Server, I had tagged it hoping for the help of our MS Sql Server colleagues with information that Sql Server had some way of structuring data and/or queries that would make this type of query run faster on that platform v other such as MySql
Removed image of code to more closely conform with community standards
You never know if there is a magic go-faster button if you don't ask (ENGINE=MyISAM is sometimes kind of like a magic go-faster button for some queries in MySql). I'm kind of fishing for a potential hardware or clustering solution here. Is Apache Ignite a potential solution here?
Thanks again to the community for all of your support and help. I hope this fixes most of the issues that have been raised for this post.
SECOND EDIT:
Is the partitioning/sharding described in the links below a potential solution here?
https://user3141592.medium.com/how-to-scale-mysql-42ebd2841fa6
https://dev.mysql.com/doc/refman/8.0/en/partitioning-overview.html
THIRD EDIT: A note on community standards.
Part of our community standards is explicitly to be welcoming, inclusive, and to be nice.
https://stackoverflow.blog/2018/04/26/stack-overflow-isnt-very-welcoming-its-time-for-that-to-change/?fbclid=IwAR1gr6r2qmXs506SAV3H_h6H8LoFy3mlXucfa-fqiiEXMHUR3aF_tdoZGsw
https://meta.stackexchange.com/questions/240839/the-new-new-be-nice-policy-code-of-conduct-updated-with-your-feedback).
The MS Sql Server tag was used here as one of the systems I'm comparing is MS Sql Server. We're really working with very limited information here. I have two systems: My MySql system, which is knowable as I'm running it, and the MS Sql Server running the same database in someone else's system that I have very little information about (all I have is a read only sql prompt). I am comparing apples and oranges: The same query runs well on the orange (MS Sql Server) and does not run well on the apple (My MySql instance). I'd like to know why so I can make an informed decision about how to get my queries to run in a reasonable amount of time. How do I get my apple to look like an orange? Do I switch to MS Sql Server? Do I need to deploy on different hardware? Is the other system running some kind of in memory caching system on top of their database instance? Most of these possibilities would require a non trivial amount of time to explore and validate. So yes, I would like help from MS Sql Server experts that might know if there are caching options, transactional v warehouse options, etc. that could be set that would make a world of difference, that would be magic go-fast buttons.
The magic go-fast button comment was perhaps a little bit condescending.
The picture showing the indexes was shown as I was just trying to make the point that the other system does not seem to have an index on the column being queried. I this case a picture was worth a thousand words.
If the table says ENGINE=MyISAM, then that is what counts. In almost all cases, this is a bad choice. innodb_buffer_pool_size=16G is not relevant except that it robs memory from MyISAM.
default-storage-engine=INNODB is relevant only when creating a table explicitly specifying the ENGINE=.
Are some of your tables MyISAM and some are InnoDB? How much RAM do you have?
Most performance solutions necessarily involve an INDEX. Please explain why you can't afford an index. It could turn that query into less than 10ms, regardless of the number of rows in the table.
Sorry, but I don't accept "rather than having to create indexes everywhere I have slow queries".
Changing tables from MyISAM to InnoDB will, in some cases help with performance. Suggest you change the engine as you add the indexes.
Show us some more queries, we can help you decide what indexes are needed. select min(visit_start_date) from visit_occurrence; needs INDEX(date); other queries may not be so trivial. Do not fall into the trap of "indexing every column".
More
In MySQL...
A single connection only uses one core, so more cores only helps when you have more connections. (Some tiny exceptions exist in MySQL 8.0.)
Partitioning rarely helps with performance; do use that without getting advice. (PS: BY RANGE is perhaps the only useful variant.)
Replication is for read-scaling (and backup and ...)
Sharding is for write-scaling. It requires a bunch of extra architectural things -- such as routing queries to the appropriate servers. (MariaDB has Spider and FederatedX as possible tools.) In any case, sharding is a non-trivial undertaking.
Clustering is for HA (High Availability, auto-failover, etc), while helping some with read and write scaling. Cf: Galera, InnoDB Cluster.
Hardware is rarely more than a temporary solution to performance issues.
Caching leads to potentially inconsistent results, so beware. Also, consider my mantra "don't bother putting a cache in front of a cache".
(I can advise further on any of these topics.)
Whether in MyISAM or InnoDB. or even SQL Server, your query
select min(visit_start_date) from visit_occurrence;
can be satisfied almost instantaneously by this index, because it uses a so-called loose index scan.
CREATE INDEX visit_start_date ON visit_occurrence (visit_start_date);
A query with an aggregate function like MIN() is always a GROUP BY query. But if the GROUP BY clause isn't present in the SQL statement, the server groups by the entire table.
You mentioned a query that can be satisfied immediately when using MyISAM. That's SELECT COUNT(*) FROM whatever_table. Behind the scenes MyISAM keeps table metadata showing the total number of rows in the table, so that query comes back right away. The transactional storage engine InnoDB doesn't do that. It supports so much concurrency that its designers didn't include the total row count in their metadata, because it would be wrong in so many circumstances that it wasn't worth the risk.
Index design isn't a black art. But it is an art informed by the kind of measurements we get from EXPLAIN (or ANALYZE or EXPLAIN ANALYZE). A basic truth of database-driven apps (in any make of database server) is that indexing needs to be revisited as the app grows. The good news: changing, adding, or dropping indexes doesn't change your data.

Optimizing mysql/postgresql for create and update

As far as I know most of the RDBMS packages are built keeping in mind 99% of the queries will be select queries. However, I am in a situation where we have at least 50 % of the queries as create/update queries. Since we also need persistence, we can not go for NoSQL solutions. Essentially, whenever there is an update it should be immediately stored permanently. So, I was wondering if the performance with MySQL will be hampered because of that. Our current MySQL engine is InnoDb. Is any other MySQL engine more preferable? I plan to use Amazon RDS so my focus is MySQL; but just out of curiousity I would like to know if postgresql can help in this.
N.B. - Just to give an idea of the scale, we are talking about create/update queries on tables with at least a million entries within a couple of months of going into production.
If your working set fits in memory, your inserts and updates will tend to be quite fast. Partitioning can help here, as others have mentioned. Most NoSQL solutions have persistence so you shouldn't exclude them outright. Cassandra has a storage model specifically tuned for writes and might be worth a look.
If you go with MySQL, there are tuning parameters to trade some durability for insert performance, and various other hardware and software settings:
https://serverfault.com/questions/118504/how-to-improve-mysql-insert-and-update-performance
You can probably expect around 100 inserts / sec using full durability on standard disks. If that's not going to cut it, setup benchmarks and start tweaking parameters or get ready for some re-architecting. Benchmark testing is important using realistic amounts of data in your tables. It's much better to find a problem now than to discover it 6 months down the road when your tables start to fill in. Synthetic data is fine, just make sure the indexed fields are distributed similarly.
Having as few as possible indexes increases speed of inserts and updates, because all indexes have to get updated when inserting/updating rows to the tables.
But of course, keep in mind that some indexes might increase your updates as weel.

How is mysql different from oracle performance-wise?

I've started a new job where I'm working with MySQL instead of Oracle. What are some things that I might have to "unlearn" from using Oracle? What are some things that might make Oracle SQL go faster, but might be bad under MySQL (and vice-versa)?
In particular, is it better for MySQL code to commit less frequently (as is the case for Oracle)?
Storage Engines are the key. MyISAM and InnoDB will perform very differently, for example (not least because MyISAM is non-transactional and can therefore skip a lot of consistency logic).
So what is faster will often depend on the storage engine being used.
As you've started a new job, the application you're maintaining likely has a different architecture. In optimisation, architecture is much more important than micro-optimisations, so that is likely to make more difference.
Sure of course, the techniques for optimisation will be different, but that would be the case with any product.
However, MySQL (InnoDB) and Oracle operate in a fundamentally similar way - small transactions are generally going to perform less well than the same number of operations in large transactions if you have durability enabled, as each transaction requires a fdatasync (possibly slightly less if group commit is enabled and works).

Converting MyISAM to InnoDB. Beneficial? Consequences?

We're running a social networking site that logs every member's action (including visiting other member's pages); this involves a lot of writes to the db. These actions are stored in a MyISAM table and since something is starting to tax the CPU, my first thought was that it's the table locking of MyISAM that is causing this stress on the CPU.
There are only reads and writes, no updates to this table. I think the balance between reads and writes is about 50/50 for this table, would InnoDB therefore be a better option?
If I want to change the table to InnoDB and we don't use foreign key constraints, transactions or fulltext indexes - do I need to worry about anything?
Notwithstanding any benefits / drawbacks of its use, which are discussed in other threads ( MyISAM versus InnoDB ), migration is a nontrivial process.
Consider
Functionally testing all components which talk to the database if possible - difference engines have different semantics
Running as much performance testing as you can - some things may improve, others may be much worse. A well-known example is SELECT COUNT(*) on a large table.
Checking that all your code will handle deadlocks gracefully - you can get them without explicit use of transactions
Estimate how much space usage you'll get by converting - test this in a non-production environment.
You will doubtless need to change things in a large software platform; this is ok, but seeing as you (hopefully) have a lot of auto-test coverage, change should be acceptable.
PS: If "Something is starting to tax the CPU", then you should a) Find out what, in a non-production environment, b) Try various options to reduce it, in a non-production environment. You should not blindly start doing major things like changing database engines when you haven't fully analysed the problem.
All performance testing should be done in a non-production environment, with production-like data and on production-grade hardware. Otherwise it is difficult to interpret results correctly.
With regards to other potential migration problems:
1) Space - InnoDB tables often require more disk space, though the Barracuda file format for new versions of InnoDB have narrowed the difference. You can get a sense for this by converting a recent backup of the tables and comparing the size. Use "show table status" to compare the data length.
2) Full text search - only on MyISAM
3) GIS/Spatial datatypes - only on MyISAM
On performance, as the other answers and the referenced answer indicate, it depends on your workload. MyISAM is much faster for full table scans. InnoDB tends to be much faster for highly concurrent access. InnoDB can also be much faster if your lookups are based on the primary key.
Another performance issue is that MyISAM can always keep a row count, since it only does table level locking. So, if you're frequently trying to get the row count for a very large table, it may be much slower with InnoDB. Search the Internet if you need a workaround for this, as I've seen several proposed.
Depending on the size of the table(s), you may also need to update your MySQL config file. At the very least, you may want to shift bytes from key_buffer to innodb_buffer_pool_size. You won't get a fair comparison if you leave the database as being optimized for MyISAM. Read up on all the innodb_* configuration properties.
I think it's quite possible that switching to InnoDB would improve performance, but In my experience, you can't really be sure until you try it. If I were you, I would set up a test environment on the same server, convert to InnoDB and run a benchmark.
From my experience, MyISAM tables are only useful for text indexing where you need good performance with searches on big text, but you still don't need a full fledged search engine like Solr or ElasticSearch.
If you want to switch to InnoDB but want to keep indexing your text in a MyISAM table, I suggest you take a look at this: http://blog.lavoie.sl/2013/05/converting-myisam-to-innodb-keeping-fulltext.html
Also: InnoDB supports live atomic backups using innobackupex from Percona. This is godsent when dealing with production servers.

How to predict MySQL tipping points?

I work on a big web application that uses a MySQL 5.0 database with InnoDB tables. Twice over the last couple of months, we have experienced the following scenario:
The database server runs fine for weeks, with low load and few slow queries.
A frequently-executed query that previously ran quickly will suddenly start running very slowly.
Database load spikes and the site hangs.
The solution in both cases was to find the slow query in the slow query log and create a new index on the table to speed it up. After applying the index, database performance returned to normal.
What's most frustrating is that, in both cases, we had no warning about the impending doom; all of our monitoring systems (e.g., graphs of system load, CPU usage, query execution rates, slow queries) told us that the database server was in good health.
Question #1: How can we predict these kinds of tipping points or avoid them altogether?
One thing we are not doing with any regularity is running OPTIMIZE TABLE or ANALYZE TABLE. We've had a hard time finding a good rule of thumb about how often (if ever) to manually do these things. (Since these commands LOCK tables, we don't want to run them indiscriminately.) Do these scenarios sound like the result of unoptimized tables?
Question #2: Should we be manually running OPTIMIZE or ANALYZE? If so, how often?
More details about the app: database usage pattern is approximately 95% reads, 5% writes; database executes around 300 queries/second; the table used in the slow queries was the same in both cases, and has hundreds of thousands of records.
The MySQL Performance Blog is a fantastic resource. Namely, this post covers the basics of properly tuning InnoDB-specific parameters.
I've also found that the PDF version of the MySQL Reference Manual to be essential. Chapter 7 covers general optimization, and section 7.5 covers server-specific optimizations you can toy with.
From the sound of your server, the query cache may be of IMMENSE value to you.
The reference manual also gives you some great detail concerning slow queries, caches, query optimization, and even disk seek analysis with indexes.
It may be worth your time to look into multi-master replication, allowing you to lock one server entirely and run OPTIMIZE/ANALYZE, without taking a performance hit (as 95% of your queries are reads, the other server could manage the writes just fine).
Section 12.5.2.5 covers OPTIMIZE TABLE in detail, and 12.5.2.1 covers ANALYZE TABLE in detail.
Update for your edits/emphasis:
Question #2 is easy to answer. From the reference manual:
OPTIMIZE:
OPTIMIZE TABLE should be used if you have deleted a large part of a table or if you have made many changes to a table with variable-length rows. [...] You can use OPTIMIZE TABLE to reclaim the unused space and to defragment the data table.
And ANALYZE:
ANALYZE TABLE analyzes and stores the key distribution for a table. [...] MySQL uses the stored key distribution to decide the order in which tables should be joined when you perform a join on something other than a constant. In addition, key distributions can be used when deciding which indexes to use for a specific table within a query.
OPTIMIZE is good to run when you have the free time. MySQL optimizes well around deleted rows, but if you go and delete 20GB of data from a table, it may be a good idea to run this. It is definitely not required for good performance in most cases.
ANALYZE is much more critical. As noted, having the needed table data available to MySQL (provided with ANALYZE) is very important when it comes to pretty much any query. It is something that should be run on a common basis.
Question #1 is a bit more of a trick. I would watch the server very carefully when this happens, namely disk I/O. My bet would be that your server is thrashing either your swap or the (InnoDB) caches. In either case, it may be query, tuning, or load related. Unoptimized tables could cause this. As mentioned, running ANALYZE can immensely help performance, and will likely help out too.
I haven't found any good way of predicting MySQL "tipping points" -- and I've run into a few.
Having said that, I've found tipping points are related to table size. But not merely raw table size, rather how big the "area of interest" is to a query. For example, in a table of over 3 million rows and about 40 columns, about three-quarters integers, most queries that would easily select a portion of them based on indices are fast. However, when one value in a query on one indexed column means two-thirds of the rows are now "interesting", the query is now about 5-times slower than normal. Lesson: try to arrange your data so such a scan isn't necessary.
However, such behaviour now gives you a size to look for. This size will be heavily dependant on your server setup, the MySQL server variables and the table's schema and data.
Similarly, I've seen reporting queries run in reasonable time (~45 seconds) if the period is two weeks, but take half-an-hour if the period is extended to four weeks.
Use slow query log that will help you to narrow down the queries you want to optimize.
For time critical queries it sometimes better to keep stable plan by using hints.
It sounds like you have a frustrating situation and maybe not the best code review process and development environment.
Whenever you add a new query to your code you need to check that it has the appropriate indexes ready and add those with the code release.
If you don't do that your second option is to constantly monitor the slow query log and then go beat the developers; I mean go add the index.
There's an option to enable logging of queries that didn't use an index which would be useful to you.
If there are some queries that "works and stops working" (but are "using and index") then it's likely that the query wasn't very good in the first place (low cardinality in the index; inefficient join; ...) and the first rule of evaluating the query carefully when it's added would apply.
For question #2 - On InnoDB "analyze table" is basically free to run, so if you have bad join performance it doesn't hurt to run it. Unless the balance of the keys in the table are changing a lot it's unlikely to help though. It almost always comes down to bad queries. "optimize table" rebuilds the InnoDB table; in my experience it's relatively rare that it helps enough to be worth the hassle of having the table unavailable for the duration (or doing the master-master failover stuff while it's running).