Acquiring Lock with retry mechanism with low Latency - mysql

I am working on a data processing product and there are concurrent users that ask for the data to work on in a particular state.
For example, users can ask to assign me a data-id where status is `Not_Assigned` or `In_Review`.
As there could be concurrent requests and everyone should get a unique Id, I thought of using database locking, but in this case, I have a problem of retrial [In case of a thread is not able to acquire lock transaction will fail] and control goes back to application and application will retry to fetch a new Id so End-user will face higher latency. Can someone guide me on a better approach to solve this or guide me on how you solved a similar problem?
For reference, my sample data will look like below.
Data_id | Status | UserId
1 | Not_Assigned | NULL
2 | REVIEW | 1
3 | DONE | 2
4 | Not_Assigned | NULL
5 | Not_Assigned | NULL
So in case two users come and ask for data with Not_Assigned state they should get unique id's from (1,4,5) that I can handle with adding a lock on DB.

If you use an AUTO_INCREMENT column for assigning new ids, there is never any concurrency problems.
If you use some other mechanism, be sure to use InnoDB, START TRANSACTION, COMMIT and, when necessary, SELECT ... FOR UPDATE.
Always check for errors after any queries. Errors (eg, concurrency issues) may occur while inside a transaction. Plan on rerunning the entire transaction.

Related

SQL: Update a row if a condition on other rows is met, in concurrent situation

I'm working on an application in a concurrent situation where multiple instances of the application concurrently update rows in database.
Each application instance creates an update event in update event table, an update event can have status of either IN_PROGRESS/NEW/CANCELED.
I want to create a query to update an update event if:
no update event on the same itemId with status = IN_PROGRESS
no update event on the same itemId with status = NEW and timestamp > current update event time stamp.
Table:
UpdateId | itemId | status | time_stamp
1 | 1 | IN_PROGRESS | 1.1
2 | 1 | NEW | 1.2
3 | 1 | NEW | 1.3
4 | 1 | NEW | 1.4
With update 1, 2, 3, 4 as above basically I want 2 to wait until 1 is done, if 3, 4 come then 2 -> canceled. Same for 3.
Something like:
Update UPDATE_EVENT SET status = IN_PROGRESS IF {
SELECT count (*) FROM UPDATE_EVENT where status=IN_PROGRESS & itemId=item1 = 0
&&
SELECT count (*) FROM UPDATE_EVENT where status=NEW & timestamp > updateId_abc123.timestamp = 0
} WHERE updateId=abc123
The updates are not very frequent, also latency is not an issue.
Any ides on how I can build the query and is it thread safe?
The main question is how frequently and what performance requirements do you have over this process. There are a shortcut and a very long way.
The very long way would require you to use an ordered/single thread processor that would receive the requests and queue them. Use a stream processor and other ideas to control these requests. Using a stream processor would scale very well if you have a large number of updates in a show time.
For smaller applications, it is possible to check a concurrency isolation level. Concurrency use some locking mechanism to ensure the first one to start the transaction will finish it and only after that other instance would be able to do their changes too.
Both are not quick solutions and would require you to read a bit about them. How to set the isolation level on your SGBD, in the application code, etc.

Is it good to merge two database table?

We are working on ticket booking platform, where user selects the number of tickets, fills the attendee forms and makes the payment. On the database level, we are storing transaction entry for a single transaction in a table and multiple attendee entries in different table. So there is one to many relation between transaction table and attendee table.
Transaction Table:
txnId | order id | buyer name | buyer email | amount | txn_status | attendee json | ....
Attendee Table:
attendeeId | order id | attendee name | attende email | ......
Now you might be thinking "Why I have attendee json in transaction table?". Well the answer is, when user initiates the transaction, we store attendee data in json and we mark the transaction as INITIATED. After successful transaction, the same transaction will be marked as SUCCESS and attendee json will be saved in Attendee table. Plus, we use this json data to show attendee deatils to the organizer on dashboard , this way we saved a database hit on attendee table. And attendee json is not queryable that's why we had attendee table to fire required queries.
Question: Now for some reason we are thinking of merging these tables and removing the json column. And suppose if a transaction initiated for 4 attendees, we are thinking of creating four transaction entries. And we have algorithm to show these entries as a single on dashboard. How is it going to effect the performance if I go for this approach? What will be your suggestions?
Now table will look like this:
txnId | order id | buyer name | buyer email | amount | txn_status | attendee name | attendee email ....
1 | 123 | abc | abc#abc.com | 100 | SUCCESS | xyz | xyz#xyz.com....
2 | 123 | abc | abc#abc.com | 100 | SUCCESS | pqr | pqr#pqr.com....
Normalization attempts to organize the database to minimize redundancy. The technique you're using is called denormalization and it's used to try and optimize reading tables by adding redundant data to avoid joins. It's hotly debated when denormalization is appropriate.
In your case, there should be no performance issue with having two tables and a simple join so long as your foreign keys are indexed.
I would go so far as to say you should eliminate the attendee json column as it's redundant and likely to fall out of sync causing bugs. The attendee table will need UPDATE, INSERT and DELETE triggers to keep it up to date slowing down writing to the table. Many databases have built in JSON functions which can create JSON very quickly. At minimum move the cached JSON to the attendee table.
In addition, you have order id in both the attendee and txn table hinting at another data redundancy. buyer name and buyer email suggest that should also be split off into another table avoiding gumming up the txn table with too much information.
Rule of thumb is to work towards normalization unless you have solid data otherwise. Use indexes as indicated by using EXPLAIN. Then only denormalize only as much as you need to make the database perform as you need. Even then, consider putting a cache on the application side instead.
You might be able to cheaply squeak some performance out of your database now, but you're mortgaging your future. What happens when you want to add a feature that has to do with attendee information and nothing to do with transactions? Envision yourself explaining this to a new developer...
You get attendee information from the transaction table... buyer information, too. But a single attendee may be part of multiple transactions, so you need to use DISTINCT or GROUP BY... which will slow everything down. Also they might have slightly different information, so you have to use insert complicated mess here to figure that all out... which will slow everything down. Why is it this way? Optimization, of course! Welcome to the company!

Should I delete my MySQL records, or should I have a flag "is_deleted"?

For some reason, somebody told me never to delete any MySQL records. Just flag it with deleted.
For example, I'm building a "follow" social network, like Twitter.
+-------------+------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| user_id | int(11) | NO | | NULL | |
| to_user_id | int(11) | NO | | NULL | |
+-------------+------------+------+-----+---------+----------------+
User 1 follows User 2...
So if one user stops following someone, should I delete this record? Or should I create a column for is_deleted ?
This is a concept called "soft delete". Google for that term to find more. But marking with a flag is only one option - you could also actually perform the delete, but have a trigger which stores a copy in a history table. This way you won't have to update all of your select functions to specifically filter out the deleted records. Also, you won't have as much load on your table as you have to scan through the additional records littering your table.
Generalizing about the larger concept of "you should never delete records" would (and should) probably get this question closed as Not Constructive, but you've given a specific scenario:
User 1 follows User 2...
So if one user stops following someone, should I delete this record?
Or should I create a column for is_deleted ?
The answer in your case depends on whether, after an unfollow, you ever again need to know that User 1 followed User 2. Some made-up, possibly silly, examples where this might be the case:
if it was desirable to change the text User 1 sees when electing to follow User 2 from "Follow User 2" to "Follow User 2 again? Really? Didn't you learn your lesson?"
if you wanted to show User 2 a graph of who (or, in aggregate, how many) followers they've had over time
If you don't need functionality that relies on the past state of users following each other, then it's safe to delete the records. No need to take on the complexity of soft delete when you ain't gonna need it.
I wouldn't say, "never delete any MySQL records". It depends. If you want to keep track of user interactions you could do this by the use of delete flags. You could even create a seperate logging table which tracks each action like "follow" and "unfollow" with the appropriate user id's and timestamps, which gives you more information in the end.
It's up to you and depends on which data you want to store. And please consider the privacy of your users. If they want their data explicitly deleted, then do so.
I have always been a fan of creating a blnDeleted field and using that instead of deleting a record. It is much easier to recover or add that data back in if you leave it in the database.
You may think you will never need the data again, but it is possible. Even for something as simple as tracking unsubscribes or something like that.

Isolation level required for reliable de/increments on a single field

Imagine we have a table as follows,
+----+---------+--------+
| id | Name | Bunnies|
+----+---------+--------+
| 1 | England | 1000 |
| 2 | Russia | 1000 |
+----+---------+--------+
And we have multiple users removing bunnies, for a specified period, such as 2 hours. (So minimum 0 bunnies, max 1000 bunnies, bunnies are returned, not added by users)
I'm using two basic transaction queries like
BEGIN;
UPDATE `BunnyTracker` SET `Bunnies`=`Bunnies`+1 where `id`=1;
COMMIT;
When someone returns a bunny and,
BEGIN;
UPDATE `BunnyTracker` SET `Bunnies`=`Bunnies`-1 where `id`=1 AND `Bunnies` > 0;
COMMIT;
When someone attempts to take a bunny. I'm assuming those queries will implement some sort of atomicity under the hood
It's imperative that users cannot take more bunnies than each country has, (ie. -23 bunnies if 23 users transact concurrently)
My issue is, how do I maintain ACID safety in this case, while being able to concurrently add/increment/decrement the bunnies field, while staying within the bounds (0-1000)
I could set the isolation level to serialized, but I'm worried that would kill performance.
Any tips?
Thanks in advance
I believe you need to implement some additional logic to prevent concurrent increment and decrement transactions from both reading the same initial value.
As it stands, if Bunnies = 1, you could have simultaneous increment and decrement transactions that both read the initial value of 1. If the increment then completes first, its results will be ignored, since the decrement has already read the initial value of 1 and will decrement the value to 0. Whichever of these operations completes last would effectively cancel the other operation.
To resolve this issue, you need to implement a locking read using SELECT ... FOR UPDATE, as
described here. For example:
BEGIN;
SELECT `Bunnies` FROM `BunnyTracker` where `id`=1 FOR UPDATE;
UPDATE `BunnyTracker` SET `Bunnies`=`Bunnies`+1 where `id`=1;
COMMIT;
Although it looks to the users like multiple transactions occur simultaneously within the DB they are actually sequential. (E.g. entries get written to the redo/transaction logs one at a time).
Would it therefore work for you to put a constraint on the table "bunnies >= 0" and catch the failure of a transaction which attempts to breach that constraint?

Database - Designing an "Events" Table

After reading the tips from this great Nettuts+ article I've come up with a table schema that would separate highly volatile data from other tables subjected to heavy reads and at the same time lower the number of tables needed in the whole database schema, however I'm not sure if this is a good idea since it doesn't follow the rules of normalization and I would like to hear your advice, here is the general idea:
I've four types of users modeled in a Class Table Inheritance structure, in the main "user" table I store data common to all the users (id, username, password, several flags, ...) along with some TIMESTAMP fields (date_created, date_updated, date_activated, date_lastLogin, ...).
To quote the tip #16 from the Nettuts+ article mentioned above:
Example 2: You have a “last_login”
field in your table. It updates every
time a user logs in to the website.
But every update on a table causes the
query cache for that table to be
flushed. You can put that field into
another table to keep updates to your
users table to a minimum.
Now it gets even trickier, I need to keep track of some user statistics like
how many unique times a user profile was seen
how many unique times a ad from a specific type of user was clicked
how many unique times a post from a specific type of user was seen
and so on...
In my fully normalized database this adds up to about 8 to 10 additional tables, it's not a lot but I would like to keep things simple if I could, so I've come up with the following "events" table:
|------|----------------|----------------|---------------------|-----------|
| ID | TABLE | EVENT | DATE | IP |
|------|----------------|----------------|---------------------|-----------|
| 1 | user | login | 2010-04-19 00:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 1 | user | login | 2010-04-19 02:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | created | 2010-04-19 00:31:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | activated | 2010-04-19 02:34:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | approved | 2010-04-19 09:30:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | login | 2010-04-19 12:00:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | created | 2010-04-19 12:30:00 | 127.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | impressed | 2010-04-19 12:31:00 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:01 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:02 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:03 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:04 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 15 | user_ads | clicked | 2010-04-19 12:31:05 | 127.0.0.2 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | blocked | 2010-04-20 03:19:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
| 2 | user | deleted | 2010-04-20 03:20:00 | 217.0.0.1 |
|------|----------------|----------------|---------------------|-----------|
Basically the ID refers to the primary key (id) field in the TABLE table, I believe the rest should be pretty straightforward. One thing that I've come to like in this design is that I can keep track of all the user logins instead of just the last one, and thus generate some interesting metrics with that data.
Due to the growing nature of the events table I also thought of making some optimizations, such as:
#9: Since there is only a finite number of tables and a finite (and predetermined) number of events, the TABLE and EVENTS columns could be setup as ENUMs instead of VARCHARs to save some space.
#14: Store IPs as UNSIGNED INTs with INET_ATON() instead of VARCHARs.
Store DATEs as TIMESTAMPs instead of DATETIMEs.
Use the ARCHIVE (or the CSV?) engine instead of InnoDB / MyISAM.
Only INSERTs and SELECTs are supported, and data is compressed on the fly.
Overall, each event would only consume 14 (uncompressed) bytes which is okay for my traffic I guess.
Pros:
Ability to store more detailed data (such as logins).
No need to design (and code for) almost a dozen additional tables (dates and statistics).
Reduces a few columns per table and keeps volatile data separated.
Cons:
Non-relational (still not as bad as EAV):
SELECT * FROM events WHERE id = 2 AND table = 'user' ORDER BY date DESC();
6 bytes overhead per event (ID, TABLE and EVENT).
I'm more inclined to go with this approach since the pros seem to far outweigh the cons, but I'm still a little bit reluctant... Am I missing something? What are your thoughts on this?
Thanks!
#coolgeek:
One thing that I do slightly
differently is to maintain an
entity_type table, and use its ID in
the object_type column (in your case,
the 'TABLE' column). You would want to
do the same thing with an event_type
table.
Just to be clear, you mean I should add an additional table that maps which events are allowed in a table and use the PK of that table in the events table instead of having a TABLE / EVENT pair?
#ben:
These are all statistics derived from
existing data, aren't they?
The additional tables are mostly related to statistics but I the data doesn't already exists, some examples:
user_ad_stats user_post_stats
------------- ---------------
user_ad_id (FK) user_post_id (FK)
ip ip
date date
type (impressed, clicked)
If I drop these tables I've no way to keep track of who, what or when, not sure how views can help here.
I agree that it ought to be separate,
but more because it's fundamentally
different data. What someone is and
what someone does are two different
things. I don't think volatility is so
important.
I've heard it both ways and I couldn't find anything in the MySQL manual that states that either one is right. Anyway, I agree with you that they should be separated tables because they represent kinds of data (with the added benefit of being more descriptive than a regular approach).
I think you're missing the forest for
the trees, so to speak.
The predicate for your table would be
"User ID from IP IP at time DATE
EVENTed to TABLE" which seems
reasonable, but there are issues.
What I meant for "not as bad as EAV" is that all records follow a linear structure and they are pretty easy to query, there is no hierarchical structure so all queries can be done with a simple SELECT.
Regarding your second statement, I think you understood me wrong here; the IP address is not necessarily associated with the user. The table structure should read something like this:
IP address (IP) did something
(EVENT) to the PK (ID) of the
table (TABLE) on date (DATE).
For instance, in the last row of my example above it should read that IP 217.0.0.1 (some admin), deleted the user #2 (whose last known IP is 127.0.0.2) at 2010-04-20 03:20:00.
You can still join, say, user events
to users, but you can't implement a
foreign key constraint.
Indeed, that's my main concern. However I'm not totally sure what can go wrong with this design that couldn't go wrong with a traditional relational design. I can spot some caveats but as long as the app messing with the database knows what it is doing I guess there shouldn't be any problems.
One other thing that counts in this argument is that I will be storing much more events, and each event will more than double compared to the original design, it makes perfect sense to use the ARCHIVE storage engine here, the only thing is it doesn't support FKs (neither UPDATEs or DELETEs).
I highly recommend this approach. Since you're presumably using the same database for OLTP and OLAP, you can gain significant performance benefits by adding in some stars and snowflakes.
I have a social networking app that is currently at 65 tables. I maintain a single table to track object (blog/post, forum/thread, gallery/album/image, etc) views, another for object recommends, and a third table to summarize insert/update activity in a dozen other tables.
One thing that I do slightly differently is to maintain an entity_type table, and use its ID in the object_type column (in your case, the 'TABLE' column). You would want to do the same thing with an event_type table.
Clarifying for Alix - Yes, you maintain a reference table for objects, and a reference table for events (these would be your dimension tables). Your fact table would have the following fields:
id
object_id
event_id
event_time
ip_address
It looks like a pretty reasonable design, so I just wanted to challenge a few of your assumptions to make sure you had concrete reasons for what you're doing.
In my fully normalized database this
adds up to about 8 to 10 additional
tables
These are all statistics derived from existing data, aren't they? (Update: okay, they're not, so disregard following.) Why wouldn't these simply be views, or even materialized views?
It may seem like a slow operation to gather those statistics, however:
proper indexing can make it quite fast
it's not a common operation, so the speed doesn't matter all that much
eliminating redundant data might make other common operations fast and reliable
I've come up with a table schema that
would separate highly volatile data
from other tables subjected to heavy
reads
I guess you're talking about how the user (just to pick one table) events, which would be pretty volatile, are separated from the user data. I agree that it ought to be separate, but more because it's fundamentally different data. What someone is and what someone does are two different things.
I don't think volatility is so important. The DBMS should already allow you to put the log file and database file on separate devices, which accomplishes the same thing, and contention shouldn't be an issue with row-level locking.
Non-relational (still not as bad as
EAV)
I think you're missing the forest for the trees, so to speak.
The predicate for your table would be "User ID from IP IP at time DATE EVENTed to TABLE" which seems reasonable, but there are issues. (Update: Okay, so it's sort of kinda like that.)
You can still join, say, user events to users, but you can't implement a foreign key constraint. That's why EAV is generally problematic; whether or not something is exactly EAV doesn't really matter. It's generally one or two lines of code to implement a constraint in your schema, but in your app it could be dozens of lines of code, and if the same data is accessed in multiple places by multiple apps, it can easily multiply to thousands of lines of code. So, generally, if you can prevent bad data with a foreign key constraint, you're guaranteed that no app will do that.
You might think that events aren't so important, but, as an example, ad impressions are money. I would definitely want to catch any bugs relating to ad impressions as early in the design process as possible.
Further comment
I can spot some caveats but as long as
the app messing with the database
knows what it is doing I guess there
shouldn't be any problems.
And with some caveats you can make a very successful system. With a proper system of constraints, you get to say, "if any app messing with the database doesn't know what it's doing, the DBMS will flag an error." That may require a more time and money than you've got, so something simpler that you can have is probably better than something more perfect that you can't. C'est la vie.
I can't add a comment to Ben's answer, so two things...
First, it would be one thing to use views in a standalone OLAP/DSS database; it's quite another to use them in your transaction database. The High Performance MySQL people recommend against using views where performance matters
WRT data integrity, I agree, and that's another advantage to using a star or snowflake with 'events' as the central fact table (as well as using multiple event tables, like I do). But you cannot design a referential integrity scheme around IP addresses