Equivalent of oracle's Parallel option in Mysql - mysql

In oracle we can create a table and insert data and select it with parallel option.
Is there any similar option in mysql. I am migrating from oracle to mysql and my system has more select and less data change, so any option to select parallely is what i am seeking for.
eg: Lets consider my table has 1 million rows and if i use parallel(5) option then five threads are running the same query with limit and fetching approximately 200K each and as final result i get 1 million record in 1/5th of usual time.

In short, the answer is no.
The MySQL server is designed to execute concurrent user sessions in parallel, but not to execute one given user session in several parts in parallel.
This is a personal opinion, but I would refrain from wanting to apply optimizations up front, making assumptions about how the RDBMS works. Better measure the query first, and see if the response time is a real concern or not, and only then investigate possible optimizations.
"Premature optimization is the root of all evil." (Donald Knuth)

Queries within MySQL are always run parallel. If you want to run different queries simultaneously through your program, however, you would need to open different connections through workers that your program would have async access to.
You could also run tasks through creating events or using delayed inserts, however I don't think that applies very well here. Something else to consider:
Generally, some operations are guarded between individual query
sessions (called transactions). These are supported by InnoDB
backends, but not MyISAM tables (but it supports a concept called
atomic operations). There are various level of isolation which differ
in which operations are guarded from each other (and thus how
operations in one parallel transactions affect another) and in their
performance impact. - Holger Just
He also mentions the MySQL transcations page, which breifly goes over the different engine types available to MySQL (MyISAM being faster, but not as reliable):
MySQL Transcations

Related

MySQL query synchronization/locking question

I have a quick question that I can't seem to find online, not sure I'm using the right wording or not.
Do MySql database automatically synchronize queries or coming in at around the same time? For example, if I send a query to insert something to a database at the same time another connection sends a query to select something from a database, does MySQL automatically lock the database while the insert is happening, and then unlock when it's done allowing the select query to access it?
Thanks
Do MySql databases automatically synchronize queries coming in at around the same time?
Yes.
Think of it this way: there's no such thing as simultaneous queries. MySQL always carries out one of them first, then the second one. (This isn't exactly true; the server is far more complex than that. But it robustly provides the illusion of sequential queries to us users.)
If, from one connection you issue a single INSERT query or a single UPDATE query, and from another connection you issue a SELECT, your SELECT will get consistent results. Those results will reflect the state of data either before or after the change, depending on which query went first.
You can even do stuff like this (read-modify-write operations) and maintain consistency.
UPDATE table
SET update_count = update_count + 1,
update_time = NOW()
WHERE id = something
If you must do several INSERT or UPDATE operations as if they were one, you'll need to use the InnoDB engine, and you'll need to use transactions. The transaction will block SELECT operations while it is in progress. Teaching you to use transactions is beyond the scope of a Stack Overflow answer.
The key to understanding how a modern database engine like InnoDB works is Multi-Version Concurrency Control or MVCC. This is how simultaneous operations can run in parallel and then get reconciled into a consistent "view" of the database when fully committed.
If you've ever used Git you know how you can have several updates to the same base happening in parallel but so long as they can all cleanly merge together there's no conflict. The database works like that as well, where you can begin a transaction, apply a bunch of operations, and commit it. Should those apply without conflict the commit is successful. If there's trouble the transaction is rolled back as if it never happened.
This ability to juggle multiple operations simultaneously is what makes a transaction-capable database engine really powerful. It's an important component necessary to meet the ACID standard.
MyISAM, the original engine from MySQL 3.0, doesn't have any of these features and locks the whole database on any INSERT operation to avoid conflict. It works like you thought it did.
When creating a database in MySQL you have your choice of engine, but using InnoDB should be your default. There's really no reason at all to use MyISAM as any of the interesting features of that engine (e.g. full-text indexes) have been ported over to InnoDB.

What data quantity is considered as too big for MySQL?

I am looking for a free SQL database able to handle my data model. The project is a production database working in a local network not connected to the internet without any replication. The number of application connected at the same times would be less than 10.
The data volume forecast for the next 5 years are:
3 tables of 100 millions rows
2 tables of 500 millions rows
20 tables with less than 10k rows
My first idea was to use MySQL, but I have found around the web several articles saying that MySQL is not designed for big database. But, what is the meaning of big in this case?
Is there someone to tell me if MySQL is able to handle my data model?
I read that Postgres would be a good alternative, but require a lot of hours for tuning to be efficient with big tables.
I don't think so that my project would use NOSQL database.
I would know if someone has some experience to share with regarding MySQL.
UPDATE
The database will be accessed by C# software (max 10 at the same times) and web application (2-3 at the same times),
It is important to mention that only few update will be done on the big tables, only insert query. Delete statements will be only done few times on the 20 small tables.
The big tables are very often used for select statement, but the most often in the way to know if an entry exists, not to return grouped and ordered batch of data.
I work for Percona, a company that provides consulting and other services for MySQL solutions.
For what it's worth, we have worked with many customers who are successful using MySQL with very large databases. Terrabytes of data, tens of thousands of tables, tables with billions of rows, transaction load of tens of thousands of requests per second. You may get some more insight by reading some of our customer case studies.
You describe the number of tables and the number of rows, but nothing about how you will query these tables. Certainly one could query a table of only a few hundred rows in a way that would not scale well. But this can be said of any database, not just MySQL.
Likewise, one could query a table that is terrabytes in size in an efficient way. It all depends on how you need to query it.
You also have to set specific goals for performance. If you want queries to run in milliseconds, that's challenging but doable with high-end hardware. If it's adequate for your queries to run in a couple of seconds, you can be a lot more relaxed about the scalability.
The point is that MySQL is not a constraining factor in these cases, any more than any other choice of database is a constraining factor.
Re your comments.
MySQL has referential integrity checks in its default storage engine, InnoDB. The claim that "MySQL has no integrity checks" is a myth often repeated over the years.
I think you need to stop reading superficial or outdated articles about MySQL, and read some more complete and current documentation.
MySQLPerformanceBlog.com
High Performance MySQL, 3rd edition
MySQL 5.6 manual
MySQL has a two important (and significantly different) database engines - MyISAM and InnoDB. A limits depends on usage - MyISAM is nontransactional - there is relative fast import, but it is too simple (without own memory cache) and JOINs on tables higher than 100MB can be slow (due too simple MySQL planner - hash joins is supported from 5.6). InnoDB is transactional and is very fast on operations based on primary key - but import is slower.
Current versions of MySQL has not good planner as Postgres has (there is progress) - so complex queries are usually much better on PostgreSQL - and really simple queries are better on MySQL.
Complexity of PostgreSQL configuration is myth. It is much more simple than MySQL InnoDB configuration - you have to set only five parameters: max_connection, shared_buffers, work_mem, maintenance_work_mem and effective_cache_size. Almost all is related to available memory for Postgres on server. Usually work for 5 minutes. On my experience a databases to 100GB is usually without any problems on Postgres (probably on MySQL too). There are two important factors - how speed you expect and how much memory and how fast IO you have.
With large databases you have to have a experience and knowledges for any database technology. All is fast when you are in memory, and when ratio database size/memory is higher, then much more work you have to do to get good results.
First of all, MySQLs table size is only limited by the allowed file size limit of your OS which is I. The terra bytes on any modern OS. That would pose no problems. Most important are questions like this:
What kind of queries will you run?
Are the large table records updated frequently or basically archives for history data?
What is your hardware budget?
What is the kind of query speed you need?
Are you familiar with table partitioning, archive tables, config tuning?
How fast do you need to write (expected inserts per second)
What language will you use to connect to the db (Java, .net, Ruby etc)
What platform are you most familiar with?
Will you run queries which might cause table scans such like '%something%' which would have to go through every single row and take forever
MySQL is used by Facebook, google, twitter and others with large tables and 100,000,000 is not much in the age of social media. MySQL has very little drawbacks (even though I prefer postgresql in most cases) like altering large tables by adding a new index for example. That might send your company in a couple days forced vacation if you don't have a replica in the meantime. Is there a reason why NoSQL is not an option? Sometimes hybrid approaches are a good choice like having your relational business logic in MySQL and huge statistical tables in a NoSQL database like MongoDb which can scale by adding new servers in minutes (MySQL can too but it's more complicated). Now MongoDB can have a indexed column which can be searched by in blistering speed.
Bejond the bottom line: you need to answer the above questions first to make a very informed decision. If you have huge tables and only search on indexed keys almost any database will do - if you expect many changes to the structure down the road you want to use a different approach.
Edit:
Based on your update you just posted I doubt you would run into problems.

When does a slow MySQL query on a given connection affect other connections?

I think I have a basic understanding of this, but am hoping that someone can give me more details as I am interested in learning more about database performance.
Lets say I have a very large database, with many millions of entries, the database supports many connections. Doing simple queries on the database will be slow as there's so much data. I'm trying to understand exactly when a query on a given connection starts to have a direct effect on the performance of queries running on other connections.
If one connection locks some elements, I understand that that will hold up queries running the other connections that need those elements . For example doing:
SELECT FOR UPDATE
will lock what you are selecting.
What happens when you do something simple like:
SELECT COUNT(*) FROM myTable
lets say we have a table with a billion rows so running the count is going to take some time (running on innodb). Will it affect queries running on other connections?
What if you select a large amount of data using SELECT and JOIN, like:
SELECT * FROM myTable1 JOIN myTable2 ON myTable1.id = myTable2.id;
does having a join lock anything for other queries?
I'm finding it hard to know which queries will have a direct effect on the performance of queries running on other connections.
Thanks
There are different angles:
Row locking: this shouldn't happen if you tune your architecture, so you should forget about it
Real performances issues and bottleneck. In our case, collateral effects.
About this second point, the problem is mainly divided in 3 areas:
Disk reads
Memory usage (buffer)
CPU usage.
About disk reads: the more data (in bytes) you will retrieve, the more the harddrive is going to be busy and slowdown any other activity using it. Reduce the size of selected rows to avoid disk overhead.
About memory usage: mysql manages an internal buffer, that can get stuck in some situations. I don't know enough about it to give you a proper answer, but I know this is definetly something you should keep an eye on.
About cpu usage: basically the cpu will get busy when it
has to calculate (joins, preparing statements, arithmetics...)
has to do all the peripheric stuff: moving bytes from disk to memory for instance.
Optimize your queries to reduce cpu overhead. (sounds silly but, well, it always turns out to be the problem anyway...)
So, now when to know when there's a collateral effect? By profiling your hardware...
How to profile?
absolute profiling: use SHOW INNODB STATUS or SHOW PROFILE to get useful informations about main mysql harddrive, cpu and memory watches.
relative profiling: use your favorite OS profiler. Under windows xp for instance, you can use the great perfmon.exe and watch for PRIVATE BYTES and VIRTUAL BYTES of the mysql process. I say relative, because afterall if a query is time consuming on your computer, it might not be on the NASA system...
Hope it helps, regards.
This is a very general question, so giving a precise answer is difficult.
You can think of the database as a pool of shared resources; especially because the underlying hardware your database runs on has physical limits. Most often the reason you see something like a select query that causes a performance impact on other queries it's because they're all competing for using those underlying physical resources like Disk IO or RAM access or CPU time and there isn't enough to go around.
So the actual results you wil see depend heavily on your database's physical hardware, and the configuration settings.
For instance in your select examples the variables might be: Is the data the query needs already in RAM? Can it look up the rows efficiently by an index? If it does have to do IO, how many other queries are asking to read data from disk? Are you using a secondary index and have to do multiple reads? Is the database doing read-ahead to buffer other pages? Is the query causing sequential or random io? Are any updates holding locks on the data? How much read IO can physical hardware support?
You would have to answer all those questions for all queries currently executing to know if they're going to affect performance of others queries.
This is why DBAs exist. Busy databases are complex system, and it's all about the interaction of a great many different operations, all with thousands of possible variables affecting them.
So what you generally do is optimize the things you can control as well as you know how (hardware, mysql configuration, schema and indexes) then start measuring the system as it runs to understand what is actually going on.
So in your case, I would say that it's infinitely more helpful to focus on simply optimizing your queries individually. The faster they execute, the less resources they are probably using and the less change they will impact others. Then you learn to analyze the system. Just look at one thing that's slow and ask "why is this slow?" Then fix it. That's the optimization process.
However, in the first case you wrote with SELECT ... FOR UPDATE explicit locks can and will be big performance issues. Be careful with those.
Read queries are only affected by isolation levels of other queries. They themselves do not block the table ever.
Isolation levels are designated transactional safety modes. If another query that uses locking does not allow dirty reads your reads will be held until the other query finishes writing or unlocks.
MVCC is a mechanism that allows databases to create a new version of the data when they need to update or delete. Which means that when you start a read on the current version of the data, it data won't get tainted by future updates/deletes.
When you start a write on current data despite the data being currently read by another process, you're in fact writing the new stuff somewhere else and marking them as the newest version. Which in the end means no blocking for the writing process (at least not because of the reading process).

Run analytics on huge MySQL database

I have a MySQL database with a few (five to be precise) huge tables. It is essentially a star topology based data warehouse. The table sizes range from 700GB (fact table) to 1GB and whole database goes upto 1 terabyte. Now I have been given a task of running analytics on these tables which might even include joins.
A simple analytical query on this database can be "find number of smokers per state and display it in descending order" this requirement could be converted in a simple query like
select state, count(smokingStatus) as smokers
from abc
having smokingstatus='current smoker'
group by state....
This query (and many other of same nature) takes a lot of time to execute on this database, time taken is in order of tens of hours.
This database is also heavily used for insertion which means every few minutes there are thousands of rows getting added.
In such a scenario how can I tackle this querying problem?
I have looked in Cassandra which seemed easy to implement but I am not sure if it is going to be as easy for running analytical queries on the database especially when I have to use "where clause and group by construct"
Have Also looked into Hadoop but I am not sure how can I implement RDBMS type queries. I am not too sure if I want to right away invest in getting at least three machines for name-node, zookeeper and data-nodes!! Above all our company prefers windows based solutions.
I have also thought of pre-computing all the data in a simpler summary tables but that limits my ability to run different kinds of queries.
Are there any other ideas which I can implement?
EDIT
Following is the mysql environment setup
1) master-slave setup
2) master for inserts/updates
3) slave for reads and running stored procedures
4) all tables are innodb with files per table
5) indexes on string as well as int columns.
Pre-calculating values is an option but since requirements for this kind of ad-hoc aggregated values keeps changing.
Looking at this from the position of attempting to make MySQL work better rather than positing an entirely new architectural system:
Firstly, verify what's really happening. EXPLAIN the queries which are causing issues, rather than guessing what's going on.
Having said that, I'm going to guess as to what's going on since I don't have the query plans. I'm guessing that (a) your indexes aren't being used correctly and you're getting a bunch of avoidable table scans, (b) your DB servers are tuned for OLTP, not analytical queries, (c) writing data while reading is causing things to slow down greatly, (d) working with strings just sucks and (e) you've got some inefficient queries with horrible joins (everyone has some of these).
To improve things, I'd investigate the following (in roughly this order):
Check the query plans, make sure the existing indexes are being used correctly - look at the table scans, make sure the queries actually make sense.
Move the analytical queries off the OLTP system - the tunings required for fast inserts and short queries are very different to those for the sorts of queries which potentially read most of a large table. This might mean having another analytic-only slave, with a different config (and possibly table types - I'm not sure what the state of the art with MySQL is right now).
Move the strings out of the fact table - rather than having the smoking status column with string values of (say) 'current smoker', 'recently quit', 'quit 1+ years', 'never smoked', push these values out to another table, and have the integer keys in the fact table (this will help the sizes of the indexes too).
Stop the tables from being updated while the queries are running - if the indexes are moving while the query is running I can't see good things happening. It's (luckily) been a long time since I cared about MySQL replication, so I can't remember if you can batch up the writes to the analytical query slave without too much drama.
If you get to this point without solving the performance issues, then it's time to think about moving off MySQL. I'd look at Infobright first - it's open source/$$ & based on MySQL, so it's probably the easiest to put into your existing system (make sure the data is going to the InfoBright DB, then point your analytical queries to the Infobright server, keep the rest of the system as it is, job done), or if Vertica ever releases its Community Edition. Hadoop+Hive has a lot of moving parts - its pretty cool (and great on the resume), but if it's only going to be used for the analytic portion of you system it may take more care & feeding than other options.
1 TB is not that big. MySQL should be able to handle that. At least simple queries like that shouldn't take hours! Can't be very helpful without knowing the larger context, but I can suggest some questions that you might ask yourself, mostly related to how you use your data:
Is there a way you can separate the reads and writes? How many read so you do per day and how many writes? Can you live with some lag, e.g write to a new table each day and merge it to the existing table at the end of the day?
What are most of your queries like? Are they mostly aggregation queries? Can you do some partial aggregation beforehand? Can you pre-calculate number of new smokers every day?
Can you use hadoop for the aggregation process above? Hadoop is kinda good at that stuff. Basically use hadoop just for daily or batch processing and store the results into the DB.
On the DB side, are you using InnoDB or MyISAM? Are the indices on String columns? Can you make it ints etc.?
Hope that helps
MySQL is have a serious limitation what prevent him to be able to perform good on such scenarious. The problem is a lack of parralel query capability - it can not utilize multiple CPUs in the single query.
Hadoop has an RDMBS like addition called Hive. It is application capable of translate your queries in Hive QL (sql like engine) into the MapReduce jobs. Since it is actually small adition on top of Hadoop it inherits its linear scalability
I would suggest to deploy hive alongside MySQL, replicate daily data to there and run heavy aggregations agains it. It will offload serious part of the load fro MySQL. You still need it for the short interactive queries, usually backed by indexes. You need them since Hive is iherently not-interactive - each query will take at least a few dozens of seconds.
Cassandra is built for the Key-Value type of access and does not have scalable GroupBy capability build-in. There is DataStax's Brisk which integrate Cassandra with Hive/MapReduce but it might be not trivial to map your schema into Cassandra and you still not get flexibility and indexing capabiilties of the RDBMS.
As a bottom line - Hive alongside MySQL should be good solution.

MySQL - Concurrent SELECTS - one client waits for another?

I have the following scenario:
I have a database with a particular MyISAM table of about 4 million rows. I use stored procedures (MySQL Version 5.1) and one in particular to search through these rows on various criteria. This table has several indexes on it, and the queries through this stored procedure are normally very fast ( <1s). Basically I use a prepared statement and create and execute some dynamic SQL in this search sp. After executing the prepared statement, I perform "DEALLOCATE PREPARED stmt;"
Most of the queries run in under a second (I use LIMIT to get just 15 rows at any time). However, there are some rare queries which take longer to run (say 2-3s). I have optimized the searched table as far as I can.
I have developed a web application and I can run and see the results of the fast queries in under a second on my development machine.
However, if I open two browser instances and do a simultaneous search (against the development machine), one with the longer running query, and the other with the faster query, the results are returned at the same time, i.e. it seems as if the fast query waits for the slower query to finish before returning the results. i.e. both queries will take 2-3 seconds...
Is there a reason for this? Because I thought that MyISAM handles SELECTS irrespective of one another and currently this is not the behaviour I am experiencing...
Thanks in advance!
Tim
This is just due to you doing it from the same machine, if the searches were coming from two different machines they would go at the same time. Would you really like one person to be able to bog down your MySQL server just by opening a bunch of browser windows and hitting refresh?
That is right. Each select query on a MyISAM table locks the entire table until it is finished. Their excuse is that this achieves "a very high read throughput". Switching to innoDB will allow concurrent reads.