Converting processes from mysql to Redis - mysql

I'm coming from mysql, trying to wrap my head around redis. Some things were very obvious but a couple have me stumped. How would you go about implementing such things in redis?
First
I have a sort of first come/first serve reservation system. When a user goes to a specific page, it queries the table below and returns the first badge where reservedby = 0, it then updates reservedby with the users id. If the user doesn't complete the process within 15 minutes, reservedby is reset to 0. If the user completes the process, I delete the row from the table and store the badge with the user data. Order is important, the higher on the list a badge is, the better, so if I were to remove it instead of somehow marking it reserved, it would need to go back in on the top if the process isn't completed with 15 minutes.
id | badge | reservedby
------------------------
240 | abc | 4249
241 | bbb | 0
242 | rrr | 0
Second
I have a set of data that doesn't change very often but is queried a lot. When a page loads, it populates a select box with each color, when you select a color, the corresponding sm and lg are displayed.
id | color | sm | lg
---------------------------
1 | blue | 1 | 5
2 | red | 3 | 10
3 | yellow | 7 | 8
Lastly
As far as storing user data, what I'm doing is INCR users and then taking that value and hmset user:<INCR users value> badge "aab" joindate "10/30/2013" etc, is that typically how it should be done?

in reversed order
yes thats how you increment IDs in Redis, there is no automated feature.
depending on how frequent does the table change, if its once a month consider serving a static json file to client and let client-side deal with the rest of the code.
consider using a ZSET and try to keep unique values at scores, OR a LIST with json values.
IMHO your reservation system internals kind of sucks, i could easily reserve your sites offers simply , by sending multiple http requests, other production websites have a limited offer and the user who pays first has the room, until the payment is processed on success it just keeps going, on failure it reverts back.

Related

Keep certain rows as constant in a MySQL table

I have a situation where I have a table, for example:
| id | type |
------------------
| 0 | Complete |
| 1 | Zone |
Now, I always want my database to be populated with these values, but additionally users should be able to CRUD their own custom types beyond these. For example, a user might decide they want a "Partial Zone" type:
| id | type |
---------------------
| 0 | Complete |
| 1 | Zone |
| 2 | Partial Zone |
This is all fine. But I don't want anyone to be able to delete/modify the first and second rows.
This seems like it should be so simple, but is there a common strategy for handling this case that ensures that these rows go unaffected? Should I put a lock column on the table and only lock these two values when I initially populate the database on application setup? Is there something much more obvious and elegant that I am missing?
Unless I'm missing something, you should be able to just add a third column to your table for the user ID/owner of the record. For the Complete and Zone records, the owner could be e.g. user 0, which would correspond to an admin. In your deletion logic, just check the ID column and do not allow admin records to be deleted by anyone from the application.
If this won't work, you could also consider having two tables, one for system records which cannot be deleted, and another one for user created records. You would have to possibly always take a union of the two tables when you query.

ruby on rails what is vs what was architecture without papertrail

Concise explanation
There is a row in the database which shows the current state of 'Umbrella', forged from the Model 'Product'.
You want to access the complete history of what you deem to be relevant changes to Umbrella, involving related models, quickly and painlessly.
The problem is that paper trail doesn't bring in the beef when the events table is tens of thousands of rows long, and you can't truncate it as it contains important history, and its performance is woeful as it has to parse thousands of lines of YAML to find 'relevant' changes.
Background reading done, still no idea what the problem is called
This seems like something basic to me but I see no mention of others tackling it beyond using papertrail, thus I don't know what its non-proprietarily commonly referred to as, if at all. "ruby on rails what is vs what was architecture without papertrail" was the best title I could think of. I'm creating a one to many relationship between models and time?
Have read "A!!! Design Patterns in Ruby, 2007" which references gang of four's design patterns, no mention of this problem?
Have tried "paper trail" gem but it doesn't quite solve it
The problem
Assuming you have Products, Companies and Categories, and
Product: id, name, price, barcode, (also company_id and category_id)
Company: id, name, registered_company_number
Category: id, name, some_immutable_field
Company has many Products
Category has many Products
And you need to see history of each Product, including changes on itself such as price, changes to which company it belongs to, changes to company name, same thing for categories, such as:
date | event | company name | cmp | category | cat | name | price
| | | id | name | id | |
------|---------------|--------------|-----|----------|-----|----------|------
jan11 | created | megacorp | 1 | outdoors | 101 | umbrella | 10
feb11 | cat change | megacorp | 1 | fashion | 102 | umbrella | 10
mar11 | cat rename | megacorp | 1 | vogue | 102 | umbrella | 10
apr11 | cmp rename | megacorp inc | 1 | vogue | 102 | umbrella | 10
may11 | cmp change | ultra & sons | 2 | vogue | 102 | umbrella | 12
jul11 | cmp change | megacorp | 1 | vogue | 102 | umbrella | 12
note that whilst umbrella was with ultra & sons, megacorp inc changed its name back to megacorp, but we don't show that in this history as its not relevant to this product. (The name change of company 1 happens in jun11, but is not shown)
This can be accomplished with papertrail, but the code to do it is either very complex, long and procedural; or if written 'elegantly' in the way papertrail intended, very very slow as it makes many db calls to what is currently a very bloated events table.
Why paper trail is not the right solution here
Paper trail stores all changes in YAML, the database table is polymorphic and stores a lot of data from many different model. This table and thus this gem seems to be suited to identify who did what changes... but to use it for history like I need to use it, its like a god table that stores all information about what was and has too much responsibility.
The history I am after does not care about all changes to an object, only certain fields. (But we still need to record all the small changes, just not include them in the history of products, so we can't just not-record these things as paper trail has its regular duties identifying who did what, it cannot be optimised solely for this purpose). Pulling this information requires getting all records where the item_type is Product, where the item_id is of the currently being viewed product_id, then parsing the YAML, and seeing if we are interested in the changes (is a field changed, which is a field we are interested in seeing the changes to?). Then doing the same for every category and company that product has been associated with in its lifetime, but only keeping the changes which occur in the windows for which product has been associated to said category/company.
Paper trail can be turned off quite easily... so if one of your devs were to disable it in the code somewhere as an optimisation whilst some operations were to be run, but forget to write the code to turn it back on, no history recorded. And because paper trail is more of a man on the loop than man in the loop, if its not running you might not notice (then have to write overly complex code which catches all the possible scenarios with holey data). A solution which enforces the saving of history is required.
Half baked solution
Conceptually I think that the models should be split between that which persists and that which changes. I am surprised this is not something baked into rails from the ground up, but then there are some issues with it:
Product: id, barcode
Product_period: id, name, price, product_id, start_date, (also company_id and product_id)
Company: id, registered_company_number
Company_period: id, name, company_id, start_date
Category: id, some_immutable_field
Category_period: id, name, category_id, start_date
Every time the price of the product, or the company_id of the product changes, a new row is added to product_period which records the beginning of a new era where the umbrella now costs $11, along with the start_date (well, time) that this auspicious period begins.
Thus in the product model, all calls to things which are immutable or we only care about what the most recent value is, remain as they are; whereas things which change and we care, have methods which to an outsider user (or existing code) appear to be operating on product model, but in fact make a call to most recent product_period for this product and get the latest values there.
This solves the problem superficially, but its a little long winded, and it still has the problem that you have to poke around through company_period and category_period selecting relevant entries (as in the company/category experiences a change and it is during a time when product was associated with it) rather than something more elegant.
At least the MySQL will run faster and there is more freedom to make indexes, and there is no longer thousands of YAML parses bogging it down.
On the quest to write more readable code, are these improvements sufficient? What do other people do? Does this have a name? Is there a more elegant solution or just a quagmire of trade offs?
There are a bunch of other versioning and history gems for rails (I contributed to the first one, 10 years ago!) - find them here, https://www.ruby-toolbox.com/categories/Active_Record_Versioning
They all have different methods for storing like you suggest above, and some are configurable. I also don't agree with the polymorphic god table for all users, but it's not too slow if you have decent indexes.

Advice on avoiding duplicate inserts when the data is repetitive and I don't have a timestamp

Details
This is a rather weird scenario. I'm trying to store records of sales from a service that I have no control over. I am just visiting a URL and storing the json it returns. It returns the last 25 sales of an item, sorted by cost, and it appears that the values will stay there for a max of 10hrs. The biggest problem is these values don't have timestamps so I can't very accurately infer how long items have been on the list and if they are duplicates.
Example:
Say I check this url at 1pm and I get these results
+--------+----------+-------+
| Seller | Category | Price |
+--------+----------+-------+
| Joe | A | 1000 |
| Mike | A | 1500 |
| Sue | B | 2000 |
+--------+----------+-------+
At 2pm i get the values and they are:
+--------+----------+-------+
| Seller | Category | Price |
+--------+----------+-------+
| Joe | A | 1000 |
| Sue | B | 2000 |
+--------+----------+-------+
This would imply that Mike's sale was over 10 hrs ago and the value timed out
At 3pm:
+--------+----------+-------+
| Seller | Category | Price |
+--------+----------+-------+
| Joe | A | 1000 |
| Joe | A | 1000 |
| Sue | B | 2000 |
+--------+----------+-------+
This implies that Joe made 1 sale of $1000 sometime in the past 10 hours, but has also made another sale at the same price since we last checked.
My Goal:
I'd like to be able to store each unique sale in the database once, but allow multiple sales if they do occur(I'm ok w/ only allowing only 1 sale per day if the original plan is too complicated). I realize w/o a timestamp and the potential of 25+ sales causing a value to disappear early, the results aren't going to be 100% accurate, but I'd like to be able to get an at least approximate idea of the sales occurring.
What I've done so far:
So far, I've made a table that has the aforementioned columns as well as a DATETIME of when I insert it into my db and then my own string version of the day it was inserted (YYYYMMDD). I made the combo of the Seller, Category, Price, and My YYYYMMDD date my primary key. I contemplated just searching for entries less than 10hrs old prior to insert, but I'm doing this operation on about 50k entries per hour so i'm afraid of that being too much of a load for the system(I don't know however, MySql is not my forte). What I'm currently doing is I've set the rule that I'm ok w/ only allowing the recording of 1 sale per day(this is done by my pk being the combo of the values i mentioned above), but i discovered that a sale made at 10pm will end up w/ a duplicate added the next day at 1am because the value hasn't time out yet and it's considered unique once again because the date has changed.
What would you do?
I'd love any ideas on how you'd go about achieving something like this. I'm open to all suggestions and I'm ok if the solution results in a seller only having 1 unique sale per day.
Thanks alot folks, I've been staring this problem down for a week now and I feel it's time to get some fresh eyes on it. Any comments are appreciated!
Update - While toying around w/ the thought that I basically want to disable entries for a given pseudo pk (seller-category-price) into the database for 10 hrs each time, it occured to me, what if i had a two stage insert process. Any time I got unqiue values I could put them in a stage one table that stores the data plus a time stamp of entry. If a duplicate tries to get inserted, I just ignore it. After 10hrs, I move those values from the stage 1 table to my final values table thus re-allowing entry for a duplicate sale after 10 hours. I think this would even allow multiple sales w/ overlapping time w/ just a bit of a delay. Say sales occured at 1pm and 6pm, the 1pm entry would be in the 1st stage table until 11pm and then once it got moved, the 6pm entry would be recorded, just 5 hours late(unfortunately the value would end up w/ a 5hr off insert date too which could move a sale to the next day, but i'm ok with that). This avoids the big issue i feared of querying the db on every insert for duplicates. The only thing it complicates is live viewing of the data, but i think doing a query from 2 different tables shouldn't be too bad. What do you guys and gals think? See any flaws in this method?
The problem is less about how to store the data than how to recognize which records are distinct in the first place (despite the fact there is no timestamp or transaction ID to distinguish them). If you can distinguish logically distinct records, then you can create a distinct synthetic ID or timestamp, or do whatever you prefer to store the data.
The approach I would recommend is to sample the URL frequently. If you can consistently harvest the data considerably faster than it is updated, you will be able to determine which records have been observed before by noting the sequence of records that precede them.
Assuming the fields in each record have some variability, it would be very improbable for the same sequence of 5 or 10 or 15 records to occur in a 10-hour period. So as long as you sample the data quickly enough to that only a fraction of the 25 records are rolled over each time, your conclusion would be very confident. This is similar to how DNA is sequenced in a "shotgun" algorithm.
You can determine how frequent the samples need to be by just taking samples and measuring how often you don't see enough prior records -- dial the sample frequency up or down.

Database design and query optimization/general efficiency when joining 6 tables in mySQL

I have 6 tables. These are simplified for this example.
user_items
ID | user_id | item_name | version
-------------------------------------
1 | 123 | test | 1
data
ID | name | version | info
----------------------------
1 | test | 1 | info
data_emails
ID | name | version | email_id
------------------------
1 | test | 1 | 1
2 | test | 1 | 2
emails
ID | email
-------------------
1 | email#address.com
2 | second#email.com
data_ips
ID | name | version | ip_id
----------------------------
1 | test | 1 | 1
2 | test | 1 | 2
ips
ID | ip
--------
1 | 1.2.3.4
2 | 2.3.4.5
What I am looking to achieve is the following.
The user (123) has the item with name 'test'. This is the basic information we need for a given entry.
There is data in our 'data' table and the current version is 1 as such the version in our user_items table is also 1. The two tables are linked together by the name and version. The setup is like this as a user could have an item for which we dont have data, likewise there could be an item for which we have data but no user owns..
For each item there are also 0 or more emails and ips associated. These can be the same for many items so rather than duplicate the actual email varchar over and over we have the data_emails and data_ips tables which link to the emails and ips table respectively based on the email_id/ip_id and the respective ID columns.
The emails and ips are associated with the data version again through the item name and version number.
My first query is is this a good/well optimized database setup?
My next query and my main question is joining this complex data structure.
What i had was:
PHP
- get all the user items
- loop through them and get the most recent data entry (if any)
- if there is one get the respective emails
- get the respective ips
Does that count as 3 queries or essentially infinite depending on the number of user items?
I was made to believe that the above was inefficient and as such I wanted to condense my setup into using one query to get the same data.
I have achieved that with the following code
SELECT user_items.name,GROUP_CONCAT( emails.email SEPARATOR ',' ) as emails, x.ip
FROM user_items
JOIN data AS data ON (data.name = user_items.name AND data.version = user_items.version)
LEFT JOIN data_emails AS data_emails ON (data_emails.name = user_items.name AND data_emails.version = user_items.version)
LEFT JOIN emails AS emails ON (data_emails.email_id = emails.ID)
LEFT JOIN
(SELECT name,version,GROUP_CONCAT( the_ips.ip SEPARATOR ',' ) as ip FROM data_ips
LEFT JOIN ips as the_ips ON data_ips.ip_id = the_ips.ID )
x ON (x.name = data.name AND x.version = user_items.version)
I have done loads of reading to get to this point and worked tirelessly to get here.
This works as I require - this question seeks to clarify what are the benefits of using this instead?
I have had to use a subquery (I believe?) to get the ips as previously it was multiplying results (I believe based on the complex joins). How this subquery works I suppose is my main confusion.
Summary of questions.
-Is my database setup well setup for my usage? Any improvements would be appreciated. And any useful resources to help me expand my knowledge would be great.
-How does the subquery in my sql actually work - what is the query doing?
-Am i correct to keep using left joins - I want to return the user item, and null values if applicable to the right.
-Am I essentially replacing a potentially infinite number of queries with 2? Does this make a REAL difference? Can the above be improved?
-Given that when i update a version of an item in my data table i know have to update the version in the user_items table, I now have a few more update queries to do. Is the tradeoff off of this setup in practice worthwhile?
Thanks to anyone who contributes to helping me get a better grasp of this !!
Given your data layout, and your objective, the query is correct. If you've only got a small amount of data it shouldn't be a performance problem - that will change quickly as the amount of data grows. However when you ave a large amount of data there are very few circumstances where you should ever see all your data in one go, implying that the results will be filtered in some way. Exactly how they are filtered has a huge impact on the structure of the query.
How does the subquery in my sql actually work
Currently it doesn't work properly - there is no GROUP BY
Is the tradeoff off of this setup in practice worthwhile?
No - it implies that your schema is too normalized.

Whether to merge avatar and profile tables?

I have two tables:
Avatars:
Id | UserId | Name | Size
-----------------------------------------------
1 | 2 | 124.png | Large
2 | 2 | 124_thumb.png | Thumb
Profiles:
Id | UserId | Location | Website
-----------------------------------------------
1 | 2 | Dallas, Tx | www.example.com
These tables could be merged into something like:
User Meta:
Id | UserId | MetaKey | MetaValue
-----------------------------------------------
1 | 2 | location | Dallas, Tx
2 | 2 | website | www.example.com
3 | 2 | avatar_lrg | 124.png
4 | 2 | avatar_thmb | 124_thumb.png
This to me could be a cleaner, more flexible setup (at least at first glance). For instance, if I need to allow a "user status message", I can do so without touching the database.
However, the user's avatars will be pulled far more than their profile information.
So I guess my real questions are:
What king of performance hit would this produce?
Is merging these tables just a really bad idea?
This is almost always a bad idea. What you are doing is a form of the Entity Attribute Value model. This model is sometimes necessary when a system needs a flexible attribute system to allow the addition of attributes (and values) in production.
This type of model is essentially built on metadata in lieu of real relational data. This can lead to referential integrity issues, orphan data, and poor performance (depending on the amount of data in question).
As a general matter, if your attributes are known up front, you want to define them as real data (i.e. actual columns with actual types) as opposed to string-based metadata.
In this case, it looks like users may have one large avatar and one small avatar, so why not make those columns on the user table?
We have a similar type of table at work that probably started with good intentions, but is now quite the headache to deal with. This is because it now has 100s of different "MetaKeys", and there is no good documentation about what is allowed and what each does. You basically have to look at how each is used in the code and figure it out from there. Thus, figure out how you will document this for future developers before you go down that route.
Also, to retrieve all the information about each user it is no longer a 1-row query, but an n-row query (where n is the number of fields on the user). Also, once you have that data, you have to post-process each of those based on your meta-key to get the details about your user (which usually turns out to be more of a development effort because you have to do a bunch of String comparisons). Next, many databases only allow a certain number of rows to be returned from a query, and thus the number of users you can retrieve at once is divided by n. Last, ordering users based on information stored this way will be much more complicated and expensive.
In general, I would say that you should make any fields that have specialized functionality or require ordering to be columns in your table. Since they will require a development effort anyway, you might as well add them as an extra column when you implement them. I would say your avatar pics fall into this category, because you'll probably have one of each, and will always want to display the large one in certain places and the small one in others. However, if you wanted to allow users to make their own fields, this would be a good way to do this, though I would make it another table that can be joined to from the user table. Below are the tables I'd suggest. I assume that "Status" and "Favorite Color" are custom fields entered by user 2:
User:
| Id | Name |Location | Website | avatarLarge | avatarSmall
----------------------------------------------------------------------
| 2 | iPityDaFu |Dallas, Tx | www.example.com | 124.png | 124_thumb.png
UserMeta:
Id | UserId | MetaKey | MetaValue
-----------------------------------------------
1 | 2 | Status | Hungry
2 | 2 | Favorite Color | Blue
I'd stick with the original layout. Here are the downsides of replacing your existing table structure with a big table of key-value pairs that jump out at me:
Inefficient storage - since the data stored in the metavalue column is mixed, the column must be declared with the worst-case data type, even if all you would need to hold is a boolean for some keys.
Inefficient searching - should you ever need to do a lookup from the value in the future, the mishmash of data will make indexing a nightmare.
Inefficient reading - reading a single user record now means doing an index scan for multiple rows, instead of pulling a single row.
Inefficient writing - writing out a single user record is now a multi-row process.
Contention - having mixed your user data and avatar data together, you've forced threads that only one care about one or the other to operate on the same table, increasing your risk of running into locking problems.
Lack of enforcement - your data constraints have now moved into the business layer. The database can no longer ensure that all users have all the attributes they should, or that those attributes are of the right type, etc.