MEMSQL vs. MySQL - mysql

I need to start off by pointing out that by no means am I a database expert in any way. I do know how to get around to programming applications in several languages that require database backends, and am relatively familiar with MySQL, Microsoft SQL Server and now MEMSQL - but again, not an expert at databases so your input is very much appreciated.
I have been working on developing an application that has to cross reference several different tables. One very simple example of an issue I recently had, is I have to:
On a daily basis, pull down 600K to 1M records into a temporary table.
Compare what has changed between this new data pull and the old one. Record that information on a separate table.
Repopulate the table with the new records.
Running #2 is a query similar to:
SELECT * FROM (NEW TABLE) LEFT JOIN (OLD TABLE) ON (JOINED FIELD) WHERE (OLD TABLE.FIELD) IS NULL
In this case, I'm comparing the two tables on a given field and then pulling the information of what has changed.
In MySQL (v5.6.26, x64), my query times out. I'm running 4 vCPUs and 8 GB of RAM but note that the rest of my configuration is default configuration (did not tweak any parameters).
In MEMSQL (v5.5.8, x64), my query runs in about 3 seconds on the first try. I'm running the exact same virtual server configuration with 4 vCPUs and 8 GB of RAM, also note that the rest of my configuration is default configuration (did not tweak any parameters).
Also, in MEMSQL, I am running a single node configuration. Same thing for MySQL.
I love the fact that using MEMSQL allowed me to continue developing my project, and I'm coming across even bigger cross-table calculation queries and views that I can run that are running fantastically on MEMSQL... but, in an ideal world, i'd use MySQL. I've already come across the fact that I need to use a different set of tools to manage my instance (i.e.: MySQL Workbench works relatively well with a MEMSQL server but I actually need to build views and tables using the open source SQL Workbench and the mysql java adapter. Same thing for using the Visual Studio MySQL connector, works, but can be painful at times, for some reason I can add queries but can't add table adapters)... sorry, I'll submit a separate question for that :)
Considering both virtual machines are exactly the same configuration, and SSD backed, can anyone give me any recommendations on how to tweak my MySQL instance to run big queries like the one above on MySQL? I understand I can also create an in-memory database but I've read there might be some persistence issues with doing that, not sure.
Thank you!

The most likely reason this happens is because you don't have index on your joined field in one or both tables. According to this article:
https://www.percona.com/blog/2012/04/04/join-optimizations-in-mysql-5-6-and-mariadb-5-5/
Vanilla MySQL only supports nested loop joins, that require the index to perform well (otherwise they take quadratic time).
Both MemSQL and MariaDB support so-called hash join, which does not require you to have indexes on the tables, but consumes more memory. Since your dataset is negligibly small for modern RAM sizes, that extra memory overhead is not noticed in your case.
So all you need to do to address the issue is to add indexes on joined field in both tables.
Also, please describe the issues you are facing with the open source tools when connect to MemSQL in a separate question, or at chat.memsql.com, so that we can fix it in the next version (I work for MemSQL, and compatibility with MySQL tools is one of the priorities for us).

Related

How to improve "select min(my_col)" query in MySQL without adding and index

The query below takes about a minute to run on my MySQL instance (running on a fairly beefy machine with 64G memory, 2T disc, 2.30Ghz CPU with 8 cores and 16 logical, and the query is running on localhost). This same query runs in less than a second on a SQL Server database I have access to. Unfortunately, I do not have access to the SQL Server host or the DBA, etc.
select min(visit_start_date)
from visit_occurrence;
The table has been set to ENGINE=MyISAM and default-storage-engine=INNODB and innodb_buffer_pool_size=16G are set in my.ini.
Is there some configuration I could be missing that would cause this query to run so slowly on MySQL? How can I fix it?
I have a large number of tables and queries I will need to support so I would really like to be able to fix this issue globally rather than having to create indexes everywhere I have slow queries.
The SQL Server database does not seem to have an index on the column being queried as shown below.
EDIT:
Untagged MS Sql Server, I had tagged it hoping for the help of our MS Sql Server colleagues with information that Sql Server had some way of structuring data and/or queries that would make this type of query run faster on that platform v other such as MySql
Removed image of code to more closely conform with community standards
You never know if there is a magic go-faster button if you don't ask (ENGINE=MyISAM is sometimes kind of like a magic go-faster button for some queries in MySql). I'm kind of fishing for a potential hardware or clustering solution here. Is Apache Ignite a potential solution here?
Thanks again to the community for all of your support and help. I hope this fixes most of the issues that have been raised for this post.
SECOND EDIT:
Is the partitioning/sharding described in the links below a potential solution here?
https://user3141592.medium.com/how-to-scale-mysql-42ebd2841fa6
https://dev.mysql.com/doc/refman/8.0/en/partitioning-overview.html
THIRD EDIT: A note on community standards.
Part of our community standards is explicitly to be welcoming, inclusive, and to be nice.
https://stackoverflow.blog/2018/04/26/stack-overflow-isnt-very-welcoming-its-time-for-that-to-change/?fbclid=IwAR1gr6r2qmXs506SAV3H_h6H8LoFy3mlXucfa-fqiiEXMHUR3aF_tdoZGsw
https://meta.stackexchange.com/questions/240839/the-new-new-be-nice-policy-code-of-conduct-updated-with-your-feedback).
The MS Sql Server tag was used here as one of the systems I'm comparing is MS Sql Server. We're really working with very limited information here. I have two systems: My MySql system, which is knowable as I'm running it, and the MS Sql Server running the same database in someone else's system that I have very little information about (all I have is a read only sql prompt). I am comparing apples and oranges: The same query runs well on the orange (MS Sql Server) and does not run well on the apple (My MySql instance). I'd like to know why so I can make an informed decision about how to get my queries to run in a reasonable amount of time. How do I get my apple to look like an orange? Do I switch to MS Sql Server? Do I need to deploy on different hardware? Is the other system running some kind of in memory caching system on top of their database instance? Most of these possibilities would require a non trivial amount of time to explore and validate. So yes, I would like help from MS Sql Server experts that might know if there are caching options, transactional v warehouse options, etc. that could be set that would make a world of difference, that would be magic go-fast buttons.
The magic go-fast button comment was perhaps a little bit condescending.
The picture showing the indexes was shown as I was just trying to make the point that the other system does not seem to have an index on the column being queried. I this case a picture was worth a thousand words.
If the table says ENGINE=MyISAM, then that is what counts. In almost all cases, this is a bad choice. innodb_buffer_pool_size=16G is not relevant except that it robs memory from MyISAM.
default-storage-engine=INNODB is relevant only when creating a table explicitly specifying the ENGINE=.
Are some of your tables MyISAM and some are InnoDB? How much RAM do you have?
Most performance solutions necessarily involve an INDEX. Please explain why you can't afford an index. It could turn that query into less than 10ms, regardless of the number of rows in the table.
Sorry, but I don't accept "rather than having to create indexes everywhere I have slow queries".
Changing tables from MyISAM to InnoDB will, in some cases help with performance. Suggest you change the engine as you add the indexes.
Show us some more queries, we can help you decide what indexes are needed. select min(visit_start_date) from visit_occurrence; needs INDEX(date); other queries may not be so trivial. Do not fall into the trap of "indexing every column".
More
In MySQL...
A single connection only uses one core, so more cores only helps when you have more connections. (Some tiny exceptions exist in MySQL 8.0.)
Partitioning rarely helps with performance; do use that without getting advice. (PS: BY RANGE is perhaps the only useful variant.)
Replication is for read-scaling (and backup and ...)
Sharding is for write-scaling. It requires a bunch of extra architectural things -- such as routing queries to the appropriate servers. (MariaDB has Spider and FederatedX as possible tools.) In any case, sharding is a non-trivial undertaking.
Clustering is for HA (High Availability, auto-failover, etc), while helping some with read and write scaling. Cf: Galera, InnoDB Cluster.
Hardware is rarely more than a temporary solution to performance issues.
Caching leads to potentially inconsistent results, so beware. Also, consider my mantra "don't bother putting a cache in front of a cache".
(I can advise further on any of these topics.)
Whether in MyISAM or InnoDB. or even SQL Server, your query
select min(visit_start_date) from visit_occurrence;
can be satisfied almost instantaneously by this index, because it uses a so-called loose index scan.
CREATE INDEX visit_start_date ON visit_occurrence (visit_start_date);
A query with an aggregate function like MIN() is always a GROUP BY query. But if the GROUP BY clause isn't present in the SQL statement, the server groups by the entire table.
You mentioned a query that can be satisfied immediately when using MyISAM. That's SELECT COUNT(*) FROM whatever_table. Behind the scenes MyISAM keeps table metadata showing the total number of rows in the table, so that query comes back right away. The transactional storage engine InnoDB doesn't do that. It supports so much concurrency that its designers didn't include the total row count in their metadata, because it would be wrong in so many circumstances that it wasn't worth the risk.
Index design isn't a black art. But it is an art informed by the kind of measurements we get from EXPLAIN (or ANALYZE or EXPLAIN ANALYZE). A basic truth of database-driven apps (in any make of database server) is that indexing needs to be revisited as the app grows. The good news: changing, adding, or dropping indexes doesn't change your data.

how to tune MySQL 5.6 database

We're developing an enterprise product that uses MySQL 5.6 on Windows to store reports generated by multiple clients. Our database contains approximately 20 tables, each table containing ranging from a few hundred thousand to some million records. All tables have more than 10 columns with a combination of numeric and textual data. All tables use innodb engine with numeric field as primary key. The tables are indexed on another numeric field, different than the primary key.
There are about 10 connections used to merge new data into the database. The data is viewed via a web console. There is as such no limit on the number of instances on viewing the data. We also don't have any reference/foreign key in tables so we don't use joins.
We haven't created stored procedure for fetching data. Does the store procedure really improve performance?
While searching solution on the internet, I found that if we changes the values of innodb_buffer_pool_size, read_rnd_buffer_size, sort_buffer_size etc fields in my.ini/my.cnf files, then we can improve performance as well as minimize memory requirements of mysql. But I am not confident about it because I don't know what should be the proper values of it and what are the side effect of it. Currently I kept the default configuration. Please let me know which values can be changed in the configuration file to improve performance and minimize memory requirements without any side effect.
I would also like to know some other ways to optimize & fine tune MySQL engine that would boost the performance & use optimum resources.
Minimum software/hardware requirement of our product is :
OS : Windows 2000 SP4 Professional / Server and later.
RAM : 1 GB
CPU : 1 GHz.
There are no predefined values for defined for performance in any database and if they are present then they are already used in default configurations.
So first time think about schema design. If your schema is designed well then you have already resolved half the issues.
From your description schema looks good as you already have applied indexes,no joins etc.
First try your application with default setting provided by Mysql engine.
Then carry out performance test of your application's critical/widely used/resource intensive workflow. (definition of these terms varies per application.)
If you find performance bottlenecks on database side then try below things,
Check heavy queries used by application. optimize them
if queries cannot be optimized then make them faster using indexes
carry out regular tuning of database (fragmentations,runstats etc.)
if you see bufferpools shortage,sort overflow,read buffers then tune them accordingly which is not one time job.
Performance tuning is very difficult job, requires lot of patience and expertise.
And Yes, generally stored procedures helps to improve performance. apply them whenever you can move heavy code logic into database.
Hope this was useful.

Project specific: PostgreSQL or MySQL need advice

I am making a web app using the Django framework and would like some opinion regarding which database to use.
PostgreSQL works very nicely with Django and I think (please do correct me if I'm wrong) MySQL requires a bit more time and effort to work with Django.
My database in a single table will have around 60 million entries and it does read and write per request once but will sometimes require two reads making it I guess in a sense more read heavy.
Total expected DB size: around 10 tables each with around ~50 million entries.
My question is that will PostgreSQL suffice for having such a large number of entries while performing both read and writes or should I switch to MySQL because I heard MySQL is more advantageous to read heavy tasks.
Thank You.
Both MySQL and PostgreSQL are free to download and install. Install then, tune the servers for the expected load, insert 100 million rows of random data, and take some measurements.
PostgreSQL, when configured correctly for your hardware, will perform fine. (PostgreSQL's default settings are very conservative.) Its query optimizer and indexing options are far superior to MySQL.

Run analytics on huge MySQL database

I have a MySQL database with a few (five to be precise) huge tables. It is essentially a star topology based data warehouse. The table sizes range from 700GB (fact table) to 1GB and whole database goes upto 1 terabyte. Now I have been given a task of running analytics on these tables which might even include joins.
A simple analytical query on this database can be "find number of smokers per state and display it in descending order" this requirement could be converted in a simple query like
select state, count(smokingStatus) as smokers
from abc
having smokingstatus='current smoker'
group by state....
This query (and many other of same nature) takes a lot of time to execute on this database, time taken is in order of tens of hours.
This database is also heavily used for insertion which means every few minutes there are thousands of rows getting added.
In such a scenario how can I tackle this querying problem?
I have looked in Cassandra which seemed easy to implement but I am not sure if it is going to be as easy for running analytical queries on the database especially when I have to use "where clause and group by construct"
Have Also looked into Hadoop but I am not sure how can I implement RDBMS type queries. I am not too sure if I want to right away invest in getting at least three machines for name-node, zookeeper and data-nodes!! Above all our company prefers windows based solutions.
I have also thought of pre-computing all the data in a simpler summary tables but that limits my ability to run different kinds of queries.
Are there any other ideas which I can implement?
EDIT
Following is the mysql environment setup
1) master-slave setup
2) master for inserts/updates
3) slave for reads and running stored procedures
4) all tables are innodb with files per table
5) indexes on string as well as int columns.
Pre-calculating values is an option but since requirements for this kind of ad-hoc aggregated values keeps changing.
Looking at this from the position of attempting to make MySQL work better rather than positing an entirely new architectural system:
Firstly, verify what's really happening. EXPLAIN the queries which are causing issues, rather than guessing what's going on.
Having said that, I'm going to guess as to what's going on since I don't have the query plans. I'm guessing that (a) your indexes aren't being used correctly and you're getting a bunch of avoidable table scans, (b) your DB servers are tuned for OLTP, not analytical queries, (c) writing data while reading is causing things to slow down greatly, (d) working with strings just sucks and (e) you've got some inefficient queries with horrible joins (everyone has some of these).
To improve things, I'd investigate the following (in roughly this order):
Check the query plans, make sure the existing indexes are being used correctly - look at the table scans, make sure the queries actually make sense.
Move the analytical queries off the OLTP system - the tunings required for fast inserts and short queries are very different to those for the sorts of queries which potentially read most of a large table. This might mean having another analytic-only slave, with a different config (and possibly table types - I'm not sure what the state of the art with MySQL is right now).
Move the strings out of the fact table - rather than having the smoking status column with string values of (say) 'current smoker', 'recently quit', 'quit 1+ years', 'never smoked', push these values out to another table, and have the integer keys in the fact table (this will help the sizes of the indexes too).
Stop the tables from being updated while the queries are running - if the indexes are moving while the query is running I can't see good things happening. It's (luckily) been a long time since I cared about MySQL replication, so I can't remember if you can batch up the writes to the analytical query slave without too much drama.
If you get to this point without solving the performance issues, then it's time to think about moving off MySQL. I'd look at Infobright first - it's open source/$$ & based on MySQL, so it's probably the easiest to put into your existing system (make sure the data is going to the InfoBright DB, then point your analytical queries to the Infobright server, keep the rest of the system as it is, job done), or if Vertica ever releases its Community Edition. Hadoop+Hive has a lot of moving parts - its pretty cool (and great on the resume), but if it's only going to be used for the analytic portion of you system it may take more care & feeding than other options.
1 TB is not that big. MySQL should be able to handle that. At least simple queries like that shouldn't take hours! Can't be very helpful without knowing the larger context, but I can suggest some questions that you might ask yourself, mostly related to how you use your data:
Is there a way you can separate the reads and writes? How many read so you do per day and how many writes? Can you live with some lag, e.g write to a new table each day and merge it to the existing table at the end of the day?
What are most of your queries like? Are they mostly aggregation queries? Can you do some partial aggregation beforehand? Can you pre-calculate number of new smokers every day?
Can you use hadoop for the aggregation process above? Hadoop is kinda good at that stuff. Basically use hadoop just for daily or batch processing and store the results into the DB.
On the DB side, are you using InnoDB or MyISAM? Are the indices on String columns? Can you make it ints etc.?
Hope that helps
MySQL is have a serious limitation what prevent him to be able to perform good on such scenarious. The problem is a lack of parralel query capability - it can not utilize multiple CPUs in the single query.
Hadoop has an RDMBS like addition called Hive. It is application capable of translate your queries in Hive QL (sql like engine) into the MapReduce jobs. Since it is actually small adition on top of Hadoop it inherits its linear scalability
I would suggest to deploy hive alongside MySQL, replicate daily data to there and run heavy aggregations agains it. It will offload serious part of the load fro MySQL. You still need it for the short interactive queries, usually backed by indexes. You need them since Hive is iherently not-interactive - each query will take at least a few dozens of seconds.
Cassandra is built for the Key-Value type of access and does not have scalable GroupBy capability build-in. There is DataStax's Brisk which integrate Cassandra with Hive/MapReduce but it might be not trivial to map your schema into Cassandra and you still not get flexibility and indexing capabiilties of the RDBMS.
As a bottom line - Hive alongside MySQL should be good solution.

performance effect of joining tables from different databases

I have a web site using a database named lets say "site1". I am planning to put another site on the same server which will also use some of the tables from "site1".
So should I use three different databases like "site1" (for first site specific data), "site2" (for second site specific data), and "general" (for common tables). In which there will be join statements between databases general and site1 and site2. Or should I put all tables in one database?
Which is the best practice to do?
How performances differ in each situation?
I am using MySQL. So how is the situation especially for MySQL?
Thanks in advance...
From the performance point of view, there won't be ANY difference. Just keep your indexes in place and you will not notice whether you are using single DB or multiple DBs.
Apart from performance, there are 2 small implications that I can think of:
1. You can not have foreign keys across DBs.
2. Partitioning tables in DB based on their usage or based on applications can help you manage permissions in easy way.
I can speak from recent personal experience. I have some old mysql queries in some PHP code that worked fine with a relatively small database, but as it grew the query got slower and slower.
I have freeradius running mysql in its own database along with another management php app that I wrote. The freeradius table is > 1.5 million rows. I was attempting to join tables from my app's database to the freeradius database. I can say for sure 1.5 million rows is too many. Running some queries locked up my app altogether. I ended up having to re-write portions of my php app to do things differently (ie not joining 2 tables from different database). I also indexed the radius accounting table on some key fields and optimized some queries (mysql EXPLAIN statement is wonderful to help with this). Things are MUCH faster now.
I will definitely be hesitant to join 2 tables from different databases in the future unless really really necessary.