Is it better to use database polling or events for the following system? - mysql

I'm working on an ordering system that works exactly the way Netflix's service works (see end of this question if you're not familiar with Netflix). I have two approaches and I am unsure which approach is the right one; one relies on database polling and the other is event driven.
The following two approaches assume this simplified schema:
member(id, planId)
plan(id, moviesPerMonthLimit, moviesAtHomeLimit)
wishlist(memberId, movieId, rank, shippedOn, returnedOn)
Polling: I would run the following count queries in wishlist
Count movies shippedThisMonth (where shippedOn IS NOT NULL #memberId)
Count moviesAtHome (where shippedOn IS NOT NULL, and returnedOn IS NULL #memberId)
Count moviesInList (#memberId)
The following function will determine how many movies to ship:
moviesToShip = Min(moviesPerMonthLimit - shippedThisMonth, moviesAtHomeLimit - moviesAtHome, moviesInList)
I will loop through each member, run the counts, and loop through their list as many times as moviesToShip. Seems like a pain in the neck, but it works.
Event Driven: This approach involves adding an extra column "queuedForShipping" and marking it to 0,1 every time an event takes place. I will do the following counts:
Count movies shippedThisMonth (where shippedOn IS NOT NULL #memberId)
Count moviesAtHome (where shippedOn IS NOT NULL, and returnedOn IS NULL #memberId)
Count moviesQueuedForShipping (where queuedForShipping = 1, #memberId)
Instead of using min, I have to use the following if statements
If moviesPerMonthLimit > (shippedThisMonth + moviesQueuedForShipping)
AND IF moviesAtHomeLimit > (moviesAtHome + moviesQueuedForShipping))
If both conditions are true, I will select a row from wishlist where queuedForShippinh = 0, and set it's queuedForShipping to 1. I will run this function every time someone adds, deletes, reorders their list. When it's time to ship, I would select #memberId where queuedForShipping = 1. I would also run this when updating shippedAt and returnedAt.
Approach one is simple. It also allows members to mess around with their ranks until someone decides to run the polling. That way what to ship is always decided by rank. But ppl keep telling polling is bad.
The event driven approach is self-sustaining, but it seems like a waste of time to ping the database with all those counts every time a person changes their list. I would also have to write to the column queuedForShipment. It also means when a member re-ranks their list and they have pending shipments (shippedAt IS NULL, queuedForShipping = 1) I would have to update those rows and set queuedForShipping back to 1 based on the new ranks. (What if someone added 5 movies, and then suddenly went to change the order? Well, queuedForShipment would already be set to 1 on the first two movies he or she added)
Can someone please give me their opinion on the best approach here and the cons/advantages of polling versus event driven?
Netflix is a monthly subscription service where you create a movie list, and your movies are shipped to you based on your service plan limits.

Based on what you described, there's no reason to keep the data "ready to use" (event) when you can create it very easily when needed (poll).
Reasons to cache it:
If you needed to display the next item to the user.
If the detailed data was being removed due to some retention policy.
If the polling queries were too slow.

Related

Suggestion/feedback on database design for work order tracking in multiple stations

I'm a student intern in a business team and my coworkers don't have the CS background so I hope to get some feedback and suggestion for improvement on the database design for the Flask web application that I will work on. Also, I self-learned sql a couple years ago by following tutorials on Youtube.
When a new work order is received by the business, it is then passed to a line of 5 stations to process it further. Currently the status of the work order is either started or finished. We hope to track it better by knowing the current station/stage (A, B, C, D, E) of the work order and then help improve the flow by letting the operator at each station know what's next in line.
My idea is create a web app (using Python 3, Flask, and postgresql) that updates the database when an operator at each station scans the work order's barcode and two other static barcodes (in_station_X and out_station_X). Each station will have a tablet connected to a scanner.
I. Station Operator perspective (for example Station 1)
Scan the batch of all incoming work order (barcode) for that shift. For each item, they would also scan the in_station_1 barcode to record the time_in for each work order.
The work orders come in queue so eventually the web app running on the tablet can show them what's next in line.
When an item is processed, the operator would scan the work order again and also the out_station_1 barcode to record the time_out for each work order.
The item coming out of that station may not have the same order as the incoming queue due to different priority (boolean Yes/No).
II. Admin/dashboard perspective:
See the current station and cycle time of each work order in that day.
Modify the priority of a work order if needed be.
Also, possibility to see reloop if a work order fails to be processed in station 2 and needs to go back to station 1.
III. The database:
a. Work Order Info table that contains fields such as:
id, workorder_barcode, requestor, priority (boolean Yes/No), date_created.
b. The Tracking Database: I'm thinking of having columns like:
- id (automatically generated for new row)
- workorder_barcode (nullable = False)
- current_station (nullable = False)
- time_in
- time_out
I have several questions/concerns related to this tracking table:
Every time a work order is scanned in or out, a new row will be created (which mean either column is blank). Do you see any issues with this approach vs. looking up the same work order that has time_in to fill the time_out? The reason for this is to avoid multiple look up when the database scales big.
Since the app screen at each station will show what's next in line, do you think a simple query with ORDER_BY to show the the order needed would suffice? What concerns me is showing the next item based on both Priority of each item and the current incoming order. I think I can sort by multiple columns (time_in and priority) and FILTER by current_station. However, as you can see below, I think the current table design may be more suitable for capturing events than doing queue control.
For example: the table for today would look like
id, workorder_barcode, current_station, time_in, time_out
61, 100.1, A, 6:00pm, null
62, 100.3, A, 6:01pm, null
63, 100.2, A, 6:02pm, null
...
70, 100.1, A, null, 6:03pm
71, 100.1, B, 6:04pm, null
...
74, 100.5, C, 6:05pm, null
At 6:05pm, the queue at each station would be
Station A queue: 100.3, 100.2
Station B queue: 100.1
Station C queue: 100.5
I think this can get complicated to have all 5 stations sharing the same table but seeing different queues. Is there a Queue based database that you would recommend I look into?
Thank you so much for reading this. I appreciate any questions, comments, and suggestions since I'm trying to learn more about database as I get hands-on with this project.

Query ActiveRecord for records and relation calculations at once

TL;DR? See Edit 2
I've got a little Rails application that has a few different sort of games people can play: it's based around sports, so they can pick the winners of each game every week (model PickEm, attribute correct boolean with nil for unfinished games), and predict the outcome of a specific team's game (model Guess, attribute score with integer, nil for unfinished games). Every User has_many PickEms and Guesses. And I'm trying to display standings (correct/total - total being all non-nil, score/total possible).
What I'm finding is that I can gather the users and their associated records, but in trying to display standings I'm discovering that every single User is triggering another query - slow and not sustainable as the user base increases. That's because #user.pick_em_score is pick_ems.where(correct: true).size and #user.guess_Score is guesses.where.not(score: nil).sum(:score). So I call user.pick_em_score and it runs that query. I feel like there should be a way to get every User, as well as these specific counts, at once, rather than buffering a whole bunch of needless extra stuff.
What I need:
User record
User.pick_em_score (calculated by counting correct records)
User.pick_ems count where NOT NULL
User.guesses_score (calculated by guesses.sum(:score))
User.guesses count where NOT NULL
Most of the stuff I find on Rails's ActiveRecord helpers, especially related to calculations, is for retrieving only the calculation. It looks like I'll probably need to delve directly into select() etc. But I can't get it working. Can someone point me in the right direction?
Edit
For clarification: I'm aware that I can write this information to the User model, but this is overly restrictive: next season, I'll need to add a new column to the User for that year's results, etc. In addition, this is a third degree of callback updating related models – the Match model already updates related PickEms and Guesses on save. I'm looking for the simplest ActiveRecord query or queries to be able to work with this information, as indicated by the title. Ideally one query that returns the above information, but if it needs to a few, that's OK.
I used to work directly in MySQL with PHP, but those skills have rusted (in raw MySQL, I imagine, I'd have several sub-select statements to help pull these counts) and I'd also like to be able to use Rails's ActiveRecord helpers and such, and avoid constructing raw SQL as much as possible.
Second Edit:
I seem to have it down to one call that starts to work, but I'm writing a lot of SQL. It's also brittle, IMO, and trying to run with it has failed. It also looks like I'm just pushing the million singular SELECT queries from Rails right into SQL, but that may still be a step up.
User.unscoped.select('users.*',
'(SELECT COUNT(*) FROM pick_ems WHERE pick_ems.user_id = users.id AND pick_ems.correct) AS correct_pick_ems',
'(SELECT COUNT(*) FROM pick_ems WHERE pick_ems.user_id = users.id AND pick_ems.correct IS NOT NULL) AS total_pick_ems',
'(SELECT SUM(guesses.score) FROM guesses WHERE guesses.user_id = users.id AND guesses.score IS NOT NULL) AS guesses_score',
'(SELECT COUNT(*) FROM guesses WHERE guesses.user_id = users.id AND guesses.score IS NOT NULL) AS guesses_count' )
The issue seems to be: is there a way to use Rails, and not raw SQL, to link up users.id that we see there with these subqueries? Or just … a better way to construct this, in general?
In addition, I'm running another set of SELECTs for the WHERE, which would hinge on total_pick_ems and guesses_count being > 0 but since I can't use those aliased columns, I have to call the SELECT one more time.
Welcome to AR. Its really only good for simple CRUD like queries. Once you actually want to query your data in anger it just doesn't have the capababilities to do the queries you want without resorting to wholesale SQL strings and often abandoning the ability to chain as a result.
Its precisely why I moved to Sequel as it does have the features to compose queries using a much fuller SQL feature set, including join conditions, window functions, recursive common table expressions, and advanced eager loading. The author is incredibly responsive and documentation is excellent compared to AR and Arel.
I don't expect you will like this answer but a time will come when you will start to look outside the opinionated components that come with rails which I have to say are hardly best of breed. Sequel also sped my application up many times over what I was able to get with AR as well, it not just developer happiness, it means less servers to run. Yes it will be a learning curve but IMO its better to learn tools that have your back covered.
Joins might work. Smthing like below
User.unscoped.joins(:guesses).joins(:pick_ems).
where("guesses.score IS NOT NULL").
select("users.*,
sum(guesses.score) as guesses_score,
count(guesses.id) as guesses_count,
count(case when pick_ems.correct = True then 1 else null end)
as correct_pick_ems,
count(case when pick_ems.correct != null then 1 else null end)
as total_pick_ems,
").
group("users.id")
If you need this information for a limited number of users at a time then above query or eager loading (User.includes(:guesses, :pick_ems)) with class methods like
def correct_pick_ems
pick_ems.count(&:correct)
end
would work.
However If you need this information for all the users most of the time, cached counters within the users table would be more optimal.
What you need is some sort of custom (smart) counter_cache to count only at certain conditions (e.g correct is true)
You can achive this using conditional after_save & after_destroy triggers to build your own custom counter_cache that looks like this:
class PickEm
belongs_to :user
after_save :increment_finished_counter_cache, if: Proc.new { |pick_em| pick_em.correct }
after_destroy :decrement_finished_counter_cache, if: Proc.new { |pick_em| pick_em.correct }
private
def increment_finished_counter_cache
self.user.update_column(:finished_games_counter, self.user.finished_games_counter + 1) #update_column should not trigger any validations or callbacks
end
def decrement_finished_counter_cache
self.user.update_column(:finished_games_counter, self.user.finished_games_counter - 1) #update_column should not trigger any validations or callbacks
end
end
Notes:
Code not tested (only to show the idea)
Some guys said it's better to avoid naming custom counters as rails name them (foo_counter_cache)
You should benchmark it, but my hunch is that adding all of that data into a single SELECT isn't going to be much faster than breaking it up into separate SELECTs (I've actually had cases where the latter was faster). By breaking it up, you can also stick to more ActiveRecord and less raw SQL, e.g.:
user_ids_to_pick_em_score = User.joins(:pick_ems).where(pick_ems: {correct: true}).group(:user_id).count
user_ids_to_pick_ems_count = User.joins(:pick_ems).where.not(pick_ems: {correct: nil}).group(:user_id).count
user_ids_to_guesses_score = Hash[User.select("users.id, SUM(guesses.score) AS total_score").joins(:guesses).group(:user_id).map{|u| [u.id, u.total_score]}]
user_ids_to_guesses_count = User.joins(:guesses).where.not(guesses: {score: nil}).group(:user_id).count
Edit: To display them, you could do like so:
<%- User.select(:id, :name).find_each do |u| -%>
Name: <%= u.name %>
Picks Correct: <%= user_ids_to_pick_em_score[u.id] %>/<%= user_ids_to_pick_ems_count[u.id] %>
Total Score: <%= user_ids_to_guesses_score[u.id] %>/<%= user_ids_to_guesses_count[u.id] %>
<%- end -%>

How do I calculate the importance/weight of input based on users reputation?

I have a couple systems which contain a users' table along with some form of karma/weight/reputation. Sometimes it's the number of posts a user has made, sometimes it's the number of up/down votes a user has received across all their activity on the site.
USER {
id int
name string
karma int
}
How do I use these numbers to calculate that user's "weight" or "authority"? For example, the vote of one long-time member is often worth much more than 4 votes from brand new users.
I was thinking about adding up the total points/karma/reputation of all members and then trying to come up with a 1-100 scale.
SUM(user.points) / COUNT(user.*) = average user points
Then something like
CEIL(userA.points / average user points) = their weight on an issue
However, there also needs to be a curve on the points this way as I don't want someone with 5,000 posts/karma to out weigh 20 new users votes.
Mathematically, your best bet is to weight by the log of the percentile ranking of user in question. However, that is painful in SQL.
Simpler would be to cheat and assume the mean is the same as the median (a very bad assumption statistically, but much simpler programmatically):
SELECT 1 - log10(SELECT COUNT (*) FROM user
WHERE (SUM(user.points) / COUNT(user.*)) < user.points)
/ SELECT (COUNT (*) from user))
In this way, your top 10% of karma would have one and a half the impact of your average user, almost twice the impact of a noob.
Changing the log base would scale this, obviously, where natural log (log() in mysql) would give the upper 10% 3 times as much impact as a noob, and twice the impact as average. Log2() is even more extreme. (Note: subtraction is required because the log will be negative.)
If you want a more severe effect you might try squaring the log. (Note: squaring makes the log squared positive, so addition is appropriate here.)
If you want a hyperprecise rule, you can go into standard deviations, but the sql gets cumbersome and slow. It all depends on how far down the rabbit hole you want to go....
There are probably some resources that can provide you with parameters for this, but you should probably decide exactly what you want rather than using some predefined model. I suggest you define some rules for which sets of users should be equivalent or which should outweigh each other (e.g. 10 0 karma users = 1 5k karma user) (equivalence is much easier to work with), which will very quickly produce parameters for some chosen equation.
Using log (as already suggested), some (fractional) power (like square root) or even just linear can work.
I suggest something like newKarma = a.karma^b + c, and it shouldn't be to difficult to solve a, b and c. I suggest you pick b rather than trying to calculate it. Using new users (with karma = 0) should make this quite easy to solve. Guessing values to get close to what you want can be easier than determining them mathematically (since some rules together won't fit any simple equation).
Note that c above is an offset to karma, which will give many new users more total karma than high-karma users. You may also want to think about a.(karma + c)^b, or a.(karma + c)^b + d. Analysing the rules you defined should tell you which one to use.
UPDATE: Added alternatives for c
EDIT: You have some options for SQL. A temp table (with sums) might actually be the fastest. You can also just use a view. A join on the same table might also be possible, though I'm not sure. Using a view would look something like: (for some chosen a,b,c and d) (you may also want to add indices to the view)
Votes(issueID, userID) // table structure
User(userID, karma, ...) // table structure
CREATE VIEW Sums AS
SELECT issueID, SUM(1*POWER(karma + 2, 3) + 4) AS sumVal
FROM Votes JOIN User ON User.userID = Votes.userID
GROUP BY issueID
Query:
SELECT (1*POWER(karma + 2, 3) + 4)/sumVal AS influenceOnIssue
FROM Votes JOIN User ON User.userID = Votes.userID
JOIN Sums on Sums.issueID = Votes.issueID
WHERE Votes.userID = #UserID AND Votes.issueID = #IssueID
A simplification may be to have a computed column that = 1*POWER(karma + 2, 3) + 4
The faster option would be to calculate the derived karma on insert/update, either by having an additional column and using triggers or just calculating in before you call insert/update, and calling insert/update with the new value.

Using logic within an update and returning updated fields using as few queries as possible

I'm writing a video game in javascript on a server that saves info in a mysql database and I am trying to make my first effect attached to simple healing potion item. To implement the effect I call a spells table using spell_id and it gets a field called effect containing the code to execute on my server. I use the eval() function to execute the code in the string. In order to optimize the game I want to run as few queries as possible. For this instance (and I think the answer will help me evaluate other similar effects) I want to update the 'player' table which contains a stat column like 'health' then I want it to add n which will be a decreasing number 15 then 250 ms later add 14 then 13 until that n=1 the net effect is a large jump in health then smaller and smaller accomplishing this is relatively easy if the player's health reaches his maximum allowed limit the effect will stop immediately...
but I'd like to do a single update statement for each increase rather than a select and an update every 250ms to check if health > max_health and make sure the player's health doesn't go above his max health. So to digress a bit I'd like a single update that given this data
player_id health max_health
========= ====== ==========
1 90 100
will add 15 to health unless (max_health-health) < 15... in this case it should only add 10.
An easier solution might be
if I could just return health and max health after each update I update it I don't mind doing a final
pseudo code
if health > max_health
update health set health = max health
So if anyone could explain how to return fields after an update that would help.
Or if anyone could show how to use logic within the update that would also help.
Also, If I didn't give enough information I'm sorry I'd be glad to provide more I just didn't want to make the question hard to understand.
update health
set health = least(max_health, health +<potion effect>)
where player_id = ...
EDIT
For your other question : normally, i think that update returns the number of affected rows. So if you try to update health when health is already = max_health, it should return 0.
I'd know how to do this in php, for example, but just said you where using javascript... so ?
http://dev.mysql.com/doc/refman/5.6/en/update.html
UPDATE returns the number of rows that were actually changed. The
mysql_info() C API function returns the number of rows that were
matched and updated and the number of warnings that occurred during
the UPDATE.
Use the ANSI standard CASE function, or the mysql only least function as in the other answer
UPDATE player
SET health = CASE WHEN health + [potion] > max_health
THEN max_health
ELSE health + [potion]
END CASE
WHERE player_id = [player_id]

DynamicQuery: How to select a column with linq query that takes parameters

We want to set up a directory of all the organizations working with us. They are incredibly diverse (government, embassy, private companies, and organizations depending on them ). So, I've resolved to create 2 tables. Table 1 will treat all the organizations equally, i.e. it'll collect all the basic information (name, address, phone number, etc.). Table 2 will establish the hierarchy among all the organizations. For instance, Program for illiterate adults depends on the National Institute for Social Security which depends on the Labor Ministry.
In the Hierarchy table, each column represents a level. So, for the example above, (i)Labor Ministry - Level1(column1), (ii)National Institute for Social Security - Level2(column2), (iii)Program for illiterate adults - Level3(column3).
To attach an organization to an hierarchy, the user needs to go level by level(i.e. column by column). So, there will be at least 3 situations:
If an adequate hierarchy exists for an organization(for instance, level1: US Embassy), that organization can be added (For instance, level2: USAID).--> US Embassy/USAID, and so on.
How about if one or more levels are missing? - then they need to be added
How about if the hierarchy need to be modified? -- not every thing need to be modified.
I do not have any choice but working by level (i.e. column by column). I does not make sense to have all the levels in one form as the user need to navigate hierarchies to find the right one to attach an organization.
Let's say, I have those queries in my repository (just that you get the idea).
Query1
var orgHierarchy = (from orgH in db.Hierarchy
select orgH.Level1).FirstOrDefault;
Query2
var orgHierarchy = (from orgH in db.Hierarchy
select orgH.Level2).FirstOrDefault;
Query3, Query4, etc.
The above queries are the same except for the property queried (level1, level2, level3, etc.)
Question: Is there a general way of writing the above queries in one? So that the user can track an hierarchy level by level to attach an organization.
In other words, not knowing in advance which column to query, I still need to be able to do so depending on some conditions. For instance, an organization X depends on Y. Knowing that Y is somewhere on the 3rd level, I'll go to the 4th level, linking X to Y.
I need to select (not manually) a column with only one query that takes parameters.
=======================
EDIT
As I just said to #Mark Byers, all I want is just to be able to query a column not knowing in advance which one. Check this out:
How about this
Public Hierarchy GetHierarchy(string name)
{
var myHierarchy = from hierarc in db.Hierarchy
where (hierarc.Level1 == name)
select hierarc;
retuen myHierarchy;
}
Above, the query depends on name which is a variable. It mighbe Planning Ministry, Embassy, Local Phone, etc.
Can I write the same query, but this time instead of looking to much a value in the DB, I impose my query to select a particular column.
var myVar = from orgH in db.Hierarchy
where (orgH.Level1 == "Government")
select orgH.where(level == myVariable);
return myVar;
I don't pretend that select orgH.where(level == myVariable) is even close to be valid. But that is what I want: to be able to select a column depending on a variable (i.e. the value is not known in advance like with name).
Thanks for helping
How about using DynamicQueryable?
http://weblogs.asp.net/scottgu/archive/2008/01/07/dynamic-linq-part-1-using-the-linq-dynamic-query-library.aspx
Your database is not normalized so you should start by changing the heirarchy table to, for example:
OrganizationId Parent
1 NULL
2 1
3 1
4 3
To query this you might need to use recursive queries. This is difficult (but not impossible) using LINQ, so you might instead prefer to create a parameterized stored procedure using a recursive CTE and put the query there.