I want to use Galera cluster in our production environment, but i have some concerns;
Each table must have at least one explicit primary key defined.
Each table must run under InnoDB or XtraDB storage engine.
Chunk up your big transaction in batches. For example, rather than having one transaction insert 100,000 rows, break it up into smaller chunks of e.g., insert 1000 rows per transaction.
Your application can tolerate non-sequential auto-increment values.
Schema changes are handled differently.
Handle hotspots/Galera deadlocks by sending writes to a single node.
I will like some clarification for all aforementioned points.Also we have over 600 databases in production, can galera work in this Environment??
Thanks
That is a LOT to handle in one shot. There are two issues, table creation (invloves Schema, see point 5) and applications that use those tables. I'll try:
1)Each table must have at least one explicit primary key defined.
When you are creating a table, you can't have any table that DOES NOT have a primary key. Tables are created with fields and INDEXES. One of those indexes must be declared as PRIMARY KEY.
2)Each table must run under InnoDB or XtraDB storage engine.
When tables are created, the must have ENGINE=InnoDB or ENGINE=XtraDB. Galera does not handle the default MyISAM type tables
3)Chunk up your big transaction in batches. For example, rather than
having one transaction insert 100,000 rows, break it up into smaller
chunks of e.g., insert 1000 rows per transaction.
This is not related to your schema, but your application. Try not to have an application that INSERTs a lot of data in one transaction. Note that this will work, but is risky. This is NOT a requirement, but a suggestion.
4)Your application can tolerate non-sequential auto-increment values.
With a cluster, you can have multiple servers being updated. If a field is auto-incremented, each cluster member could be trying to increment the same field. Your application should NEVER EVER assume that the next ID is related to the previous ID. For auto-increment fields, do not IMPOSE a value, let the DB handle it.
5)Schema changes are handled differently.
The Schema is the description of the tables and indexes and not the transactions that add, delete or retrieve information. You have multiple servers, so a Schema change has to be handled with care, so that all servers do catch up.
6)Handle hotspots/Galera deadlocks by sending writes to a single node.
This is both application and DB related. A deadlock is a condition where 2 different parts of an app try to get a value (ValueA), as the DB to lock it so it can be changed, and then try to get another value (ValueB) for the same use. If another part tries to First Lock ValueB , then ValueA, we have a deadlock, Because each app has locked the next value of the other app. To avoid this, it's best tp write to only one server in the cluster and use the other servers for reading. Do note that you can still have deadlocks in your applications. But you can avoid Galera creating the situation.
Related
I want to use Galera cluster in our production environment, but i have some concerns;
Each table must have at least one explicit primary key defined.
Each table must run under InnoDB or XtraDB storage engine.
Chunk up your big transaction in batches. For example, rather than having one transaction insert 100,000 rows, break it up into smaller chunks of e.g., insert 1000 rows per transaction.
Your application can tolerate non-sequential auto-increment values.
Schema changes are handled differently.
Handle hotspots/Galera deadlocks by sending writes to a single node.
I will like some clarification for all aforementioned points.Also we have over 600 databases in production, can galera work in this Environment??
Thanks
That is a LOT to handle in one shot. There are two issues, table creation (invloves Schema, see point 5) and applications that use those tables. I'll try:
1)Each table must have at least one explicit primary key defined.
When you are creating a table, you can't have any table that DOES NOT have a primary key. Tables are created with fields and INDEXES. One of those indexes must be declared as PRIMARY KEY.
2)Each table must run under InnoDB or XtraDB storage engine.
When tables are created, the must have ENGINE=InnoDB or ENGINE=XtraDB. Galera does not handle the default MyISAM type tables
3)Chunk up your big transaction in batches. For example, rather than
having one transaction insert 100,000 rows, break it up into smaller
chunks of e.g., insert 1000 rows per transaction.
This is not related to your schema, but your application. Try not to have an application that INSERTs a lot of data in one transaction. Note that this will work, but is risky. This is NOT a requirement, but a suggestion.
4)Your application can tolerate non-sequential auto-increment values.
With a cluster, you can have multiple servers being updated. If a field is auto-incremented, each cluster member could be trying to increment the same field. Your application should NEVER EVER assume that the next ID is related to the previous ID. For auto-increment fields, do not IMPOSE a value, let the DB handle it.
5)Schema changes are handled differently.
The Schema is the description of the tables and indexes and not the transactions that add, delete or retrieve information. You have multiple servers, so a Schema change has to be handled with care, so that all servers do catch up.
6)Handle hotspots/Galera deadlocks by sending writes to a single node.
This is both application and DB related. A deadlock is a condition where 2 different parts of an app try to get a value (ValueA), as the DB to lock it so it can be changed, and then try to get another value (ValueB) for the same use. If another part tries to First Lock ValueB , then ValueA, we have a deadlock, Because each app has locked the next value of the other app. To avoid this, it's best tp write to only one server in the cluster and use the other servers for reading. Do note that you can still have deadlocks in your applications. But you can avoid Galera creating the situation.
I have a MySql database hosted on a webserver which has a set of tables with data in it. I am distributing my front end application which is build using HTML5 / Javascript /CS3.
Now when multiple users tries to make an insert/update into one of the tables at the same time is it going to create a conflict or will it handle the locking of the table for me automatically example when one user is using, it will lock the table for him and then let the rest follow in a queue once the user finishes it will release the lock and then give it to the next in the queue ? Is this going to happen or do i need to handle the case in mysql database
EXAMPLE:
When a user wants to make an insert into the database he calls a php file located on a webserver which has an insert command to post data into the database. I am concerned if two or more people make an insert at the same time will it make the update.
mysqli_query($con,"INSERT INTO cfv_postbusupdate (BusNumber, Direction, StopNames, Status, comments, username, dayofweek, time) VALUES (".trim($busnum).", '".trim($direction3)."', '".trim($stopname3)."', '".$status."', '".$comments."', '".$username."', '".trim($dayofweek3)."', '".trim($btime3)."' )");
MySQL handles table locking automatically.
Note that with MyISAM engine, the entire table gets locked, and statements will block ("queue up") waiting for a lock to be released.
The InnoDB engine provides more concurrency, and can do row level locking, rather than locking the entire table.
There may be some cases where you want to take locks on multiple MyISAM tables, if you want to maintain referential integrity, for example, and you want to disallow other sessions from making changes to any of the tables while your session does its work. But, this really kills concurrency; this should be more of an "admin" type function, not really something a concurrent application should be doing.
If you are making use of transactions (InnoDB), the issue your application needs to deal with is the sequence in which rows in which tables are locked; it's possible for an application to experience "deadlock" exceptions, when MySQL detects that there are two (or more) transactions that can't proceed because each needs to obtain locks held by the other. The only thing MySQL can do is detect that, and the only recovery MySQL can do for this is to choose one of the transactions to be the victim, that's the transaction that will get the "deadlock" exception, because MySQL killed it, to allow at least one of the transactions to proceed.
I don't have a testing environment for this yet. But before I think too much about solutions I'd like to know if people think this would be a problem.
I will have 10-20 java processes connected to a MySql db via JDBC. Each will be inserting unique records, all going to the same table. The rate of inserts will be on the order of 1000's per second.
My expectation is that some process will attempt to insert and encounter a table lock while another process is inserting, and this will result in a JDBC exception and that insert to fail.
Clearly if you increase the insert rate sufficiently there eventually will be a point where some buffer somewhere fills up faster than it can be emptied. When such a buffer hits its maximum capacity and can't contain any more data some of your insert statements will have to fail. This will result in an exception being thrown at the client.
However, assuming you have high-end hardware I don't imagine this should happen with 1000 inserts per second, but it does depend on the specific hardware, how much data there is per insert, how many indexes you have, what other queries are running on the system simultaneously, etc.
Regardless of whether you are doing 10 inserts per second or 1000 you still shouldn't blindly assume that every insert will succeed - there's always a chance that an insert will fail because of some network communication error or some other problem. Your application should be able to correctly handle the situation where an insert fails.
Use InnoDB as it supports reads and writes at the same time. MyISAM will usually lock the table during the insert, but give preference to SELECT statements. This can cause issues if you're doing reporting or visualization of the data while trying to do inserts.
If you have a natural primary key (no auto_increment), using it will help avoid some deadlock issues. (Newer versions have fixed this.)
http://www.mysqlperformanceblog.com/2007/09/26/innodb-auto-inc-scalability-fixed/
You might also want to see if you can queue your writes in memory and send them to the database in batches. Lots of little inserts will be much slower than doing batches in transactions.
Good presentation on getting the most out of the MySQL Connector/J JDBC driver:
http://assets.en.oreilly.com/1/event/21/Connector_J%20Performance%20Gems%20Presentation.pdf
What engine do you use? That can make a difference.
http://dev.mysql.com/doc/refman/5.5/en/concurrent-inserts.html
I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?
I'm rather new to working with multiple threads in a database (most of my career has been spent on the frontend).
Today I tried testing out a simple php app I wrote to store values in a mysql db using ISAM tables emulating transactions using Table Locking.
I just wrote a blog post on the procedure Here:
Testing With JMeter
From my results my simple php app appears to keep the transactional integrity intact (as seen from the data in my csv files being the same as the data I re-extracted from the database):
CSV Files:
Query Of Data for Both Users After JMeter Test Run:
Am I right in my assumption that the transactional data integrity is intact?
How do you test for concurrency?
Why not use InnoDB and get the same effect without manual table locks?
Also, what are you protecting against? Consider two users (Bill and Steve):
Bill loads record 1234
Steve loads record 1234
Steve changes record 1234 and submits
Bill waits a bit, then updates the stale record 1234 and submits. These changes clobber Bill's.
Table locking doesn't offer any higher data integrity than the native MyISAM table locking. MyISAM will natively lock the table files when required to stop data corruption.
In fact, the reason to use InnoDB over MyISAM is that it will do row locking instead of table locking. It also supports transactions. Multiple updates to different records won't block each other and complex updates to multiple records will block until the transaction is complete.
You need to consider the chance that two updates to the same record will happen at the same time for your application. If it's likely, table/row locking doesn't block the second update, it only postpones it until the first update completes.
EDIT
From what I remember, MyISAM has a special behavior for inserts. It doesn't need to lock the table at all for an insert as it's just appending to the end of the table. That may not be true for tables with unique indexes or non-autoincrement primary keys.