Maintaining a points table to reduce rows - mysql

I'm struggling to design an efficient automated task to clean up a reputation points table, similar to SO I suppose.
If a user reads an article, comments on an article and/or shares an article, I give my members some reputation points. If my member does all three of these for example, there would be three separate rows in that DB table. When showing the members points, I simply use a SUM query to count all points for that member.
Now, with a million active members, with high reputation, there are many, many rows in my table and would somehow like to clean them up. Using a Cron Job, I would like to merge all reputation rows for each member, older than 3-months, into one row. For example:
user | repTask | repPoints | repDate
-----------+-------------------------------+--------------+-----------------------
10001 + Commented on article | 5 | 2012-11-12 08:40:32
10001 + Read an article | 2 | 2012-06-12 12:32:01
10001 + Shared an article | 10 | 2012-06-04 17:39:44
10001 + Read an article | 2 | 2012-05-19 01:04:11
Would become:
user | repTask | repPoints | repDate
-----------+-------------------------------+--------------+-----------------------
10001 + Commented on article | 5 | 2012-11-12 08:40:32
10001 + (merged points) | 14 | Now()
Or (merging months):
user | repTask | repPoints | repDate
-----------+-------------------------------+--------------+-----------------------
10001 + Commented on article | 5 | 2012-11-12 08:40:32
10001 + (Merged for 06/2012) | 12 | Now()
10001 + (Merged for 05/2012) | 2 | Now()
Anything after 3-months is considered legitimate, anything before may need to be revoked in-case of cheating, hence why I state 3-months.
First of all, is this a good idea? I'm trying to avoid, say in 3 years time, having 100's of millions of rows. If it's not a good idea to merge points, is there a better way to store the data as it's inputted. I obviously cannot change what's already inputted but could make it better for the future.
If this is a good idea, I'm struggling to come up with an efficient query to modify the data. I'm not looking for exact code but if somebody could help describe a suitable query that could merge all points older than 3-months, for each user, or merge all points older than 3-months into separate months, for each user, it would be extremely helpful.

You can do it that way, with cron jobs, but how about this:
Create a trigger or procedure so that anytime a point is added, it updates a total column in the users table, and anytime a point is revoked the total column is subtracted from?
This way, no matter how many millions or billions of rows in the points table, you don't have to query those to get the total points results. You could even have separate columns for months or years. Also, since you're not deleting any rows you can go back and retroactively revoke a point from, say, a year ago if needed.

Related

Mysql: Find most similar numerical rows based on multiple columns

This is my first question here, I'll try my best to be clear and factual. I've googled for quite a long time but never got the result I wanted. My Mysql knowledge isn't the best and maybe that's why I can't get this answer to work with my wanted function.
At first, here's my Mysql data
user | speed | strength | stamina | precision
---------------------------------------------
1 | 4 | 3 | 5 | 2
2 | 2 | 5 | 3 | 4
3 | 3 | 4 | 6 | 3
Question
I want a Mysql query that find the most similar row to a specific user. For example, if I want to see who's most similar to user 1, I want it to find user 3. User 1 and 2 have in total the same value (14) but 1 and 3 are more similar, see the picture for a better view.
I'd be so glad and grateful if someone knew what Mysql function I should look at, or if you have any ideas.
I think your requirement translated into functions would be "the minimum value of the average of the differences between users scores at ability level".
If that's the case, it can be translated in SQL like this
select t2.user,
(
abs(t1.speed - t2.speed) +
abs(t1.strength - t2.strength) +
abs(t1.stamina - t2.stamina) +
abs(t1.precision - t2.precision)
) / 4 as diff_avg
from users t1
cross join
users t2
where t2.user <> t1.user and
t1.user = 1 /* the starting user id goes here */
order by 2 asc
limit 1
The most accurate solution to do this numerically is by using profile similarity - by getting the rows with the highest correlation coefficient to User1
I have been looking for a way to do this in MySQL but can't seem to find a way to. Hope someone knows enough about this to help us

Advice on avoiding duplicate inserts when the data is repetitive and I don't have a timestamp

Details
This is a rather weird scenario. I'm trying to store records of sales from a service that I have no control over. I am just visiting a URL and storing the json it returns. It returns the last 25 sales of an item, sorted by cost, and it appears that the values will stay there for a max of 10hrs. The biggest problem is these values don't have timestamps so I can't very accurately infer how long items have been on the list and if they are duplicates.
Example:
Say I check this url at 1pm and I get these results
+--------+----------+-------+
| Seller | Category | Price |
+--------+----------+-------+
| Joe | A | 1000 |
| Mike | A | 1500 |
| Sue | B | 2000 |
+--------+----------+-------+
At 2pm i get the values and they are:
+--------+----------+-------+
| Seller | Category | Price |
+--------+----------+-------+
| Joe | A | 1000 |
| Sue | B | 2000 |
+--------+----------+-------+
This would imply that Mike's sale was over 10 hrs ago and the value timed out
At 3pm:
+--------+----------+-------+
| Seller | Category | Price |
+--------+----------+-------+
| Joe | A | 1000 |
| Joe | A | 1000 |
| Sue | B | 2000 |
+--------+----------+-------+
This implies that Joe made 1 sale of $1000 sometime in the past 10 hours, but has also made another sale at the same price since we last checked.
My Goal:
I'd like to be able to store each unique sale in the database once, but allow multiple sales if they do occur(I'm ok w/ only allowing only 1 sale per day if the original plan is too complicated). I realize w/o a timestamp and the potential of 25+ sales causing a value to disappear early, the results aren't going to be 100% accurate, but I'd like to be able to get an at least approximate idea of the sales occurring.
What I've done so far:
So far, I've made a table that has the aforementioned columns as well as a DATETIME of when I insert it into my db and then my own string version of the day it was inserted (YYYYMMDD). I made the combo of the Seller, Category, Price, and My YYYYMMDD date my primary key. I contemplated just searching for entries less than 10hrs old prior to insert, but I'm doing this operation on about 50k entries per hour so i'm afraid of that being too much of a load for the system(I don't know however, MySql is not my forte). What I'm currently doing is I've set the rule that I'm ok w/ only allowing the recording of 1 sale per day(this is done by my pk being the combo of the values i mentioned above), but i discovered that a sale made at 10pm will end up w/ a duplicate added the next day at 1am because the value hasn't time out yet and it's considered unique once again because the date has changed.
What would you do?
I'd love any ideas on how you'd go about achieving something like this. I'm open to all suggestions and I'm ok if the solution results in a seller only having 1 unique sale per day.
Thanks alot folks, I've been staring this problem down for a week now and I feel it's time to get some fresh eyes on it. Any comments are appreciated!
Update - While toying around w/ the thought that I basically want to disable entries for a given pseudo pk (seller-category-price) into the database for 10 hrs each time, it occured to me, what if i had a two stage insert process. Any time I got unqiue values I could put them in a stage one table that stores the data plus a time stamp of entry. If a duplicate tries to get inserted, I just ignore it. After 10hrs, I move those values from the stage 1 table to my final values table thus re-allowing entry for a duplicate sale after 10 hours. I think this would even allow multiple sales w/ overlapping time w/ just a bit of a delay. Say sales occured at 1pm and 6pm, the 1pm entry would be in the 1st stage table until 11pm and then once it got moved, the 6pm entry would be recorded, just 5 hours late(unfortunately the value would end up w/ a 5hr off insert date too which could move a sale to the next day, but i'm ok with that). This avoids the big issue i feared of querying the db on every insert for duplicates. The only thing it complicates is live viewing of the data, but i think doing a query from 2 different tables shouldn't be too bad. What do you guys and gals think? See any flaws in this method?
The problem is less about how to store the data than how to recognize which records are distinct in the first place (despite the fact there is no timestamp or transaction ID to distinguish them). If you can distinguish logically distinct records, then you can create a distinct synthetic ID or timestamp, or do whatever you prefer to store the data.
The approach I would recommend is to sample the URL frequently. If you can consistently harvest the data considerably faster than it is updated, you will be able to determine which records have been observed before by noting the sequence of records that precede them.
Assuming the fields in each record have some variability, it would be very improbable for the same sequence of 5 or 10 or 15 records to occur in a 10-hour period. So as long as you sample the data quickly enough to that only a fraction of the 25 records are rolled over each time, your conclusion would be very confident. This is similar to how DNA is sequenced in a "shotgun" algorithm.
You can determine how frequent the samples need to be by just taking samples and measuring how often you don't see enough prior records -- dial the sample frequency up or down.

Advantages of a lookup table with INTs over decimals in MySQL records?

Trying to summarize in as few of words as possible:
I am trying to create a system that tracks the various products an individual can sell and the commission percentage they earn on that particular item. I am thinking about creating reference integers for each product called "levels" which will relate to their commission percentage in a new lookup table instead of a single reference point.. Is this overkill though or are there any benefits over just placing inline for each record?
My gut tells me there are advantages of design 1 below but not sure what they are the more I think about it. If I need to update all individuals selling product X with level Y, indexes and replaces make that easy and fast ultimately in both methods. By using design 2, I can dynamically change any "earn" to whatever percentage I can come up with (0.58988439) for a product whereas I would have to create this "level" in design 1.
Note: the product does not relate to the earn diretly (one sales rep can earn 50% for the same product another sales rep only earns 40% on).
Reference Examples:
Design 1 - two tables
table 1
ID | seller_id | product_id | level
-----------------------------------------------
1 | 11111 | 123A | 2
2 | 11111 | 15J1 | 6
3 | 22222 | 123A | 3
table 2
ID | level | earn
--------------------------
1 | 1 | .60
2 | 2 | .55
3 | 3 | .50
4 | 4 | .45
5 | 5 | .40
6 | 6 | .35
Design 2 - one table
ID | seller_id | product_id | earn
-----------------------------------------------
1 | 11111 | 123A | .55
2 | 11111 | 15J1 | .35
3 | 22222 | 123A | .45
(where earn is decimal based, commission percentage)
Update 1 - 7/9/13
It should also be noted that a rep's commission level can change at any given time. For this, we have planned on simply using status, start, and end dates with ranges for eligible commission levels / earn. For example, a rep may earn a Level 2 (or 55%) from Jan 1 to Feb 1. This would be noted in both designs above. Then when finding what level or percentage a rep was earning at any given time: select * from table where (... agent information) AND start <= :date AND (end > :date or END IS NULL)
Does level mean anything to the business?
For instance, I could imagine a situation where the levels are the unit of management. Perhaps there is a rush for sales one quarter, and the rates for each level change. Or, is there reporting by level? In these situations is would make sense to have a separate "level" table.
Another situation would be different levels for different prices of the product -- perhaps the most you sell it for, the higher the commission. Or, the commissions could be based on thresholds, so someone who has sold enough this year suddenly gets a higher commission.
In other words, there could be lots of rules around commission that go beyond the raw percentage. In that case, a "rule" table would be a necessary part of the data model (and "levels" are a particular type of rule).
On the other hand, if you don't have any such rules and the commission is always based on the person and product, then storing the percentage in the table makes a lot of sense. It is simple and understandable. It also has good performance when accessing the percentage -- which presumably happens much more often than changing it.
First of all, using id values to reference a lookup table has nothing to do with normalization per se. Your design #2 shown above is just as normalized. Lots of people have this misunderstanding about normalization.
One advantage to using a lookup table (design #1) is that you can change what is earned by level 6 (for example), and by updating one row in the lookup table, you implicitly affect all rows that reference that level.
Whereas in design #2, you would have to update every row to apply the same change. Not only does this mean updating many rows (which has performance implictations), but it opens the possibility that you might not execute the correct UPDATE matching all the rows that need updating. So some rows may have the wrong value for what should be the same earning level.
Again, using a lookup table can be a good idea in many cases, it's just not correct to call it normalization.

Database suggestions for storing a "count" for every hour of the day

I have had an online archive service for over a year now. Unfortunately, I didn't put in the infrastructure to keep statistics. All I have now are the archive access logs.
For every hour there are two audio files (0-30 min in one and 30-60 min in other). I've currently used MySQL to put in the counts. Looks something like this:
| DATE | TIME | COUNT |
| 2012-06-12 | 20:00 | 39 |
| 2012-06-12 | 20:30 | 26 |
| 2012-06-12 | 21:00 | 16 |
and so on...
That makes 365 days * 24 hrs * 2 (2 halves in an hour) > 17500 rows. This makes read/write slow and I feel a lot of space is wasted storing it this way.
So do you know of any other database that will store this data more efficiently and is faster?
That's not too many rows. If it's properly indexed, reads should be pretty fast (writes will be a little slower, but even with tables of up to about half a million rows I hardly notice).
If you are selecting items from the database using something like
select * from my_table where date='2012-06-12'
Then you need to make sure that you have an index on the date column. You can also create multiple column indexes if you are using more than one column in your where statement. That will make your read statements very fast (like I said up to on the order of a million rows).
If you're unacquainted with indexes, see here:
MySQL Indexes

Best to build a SQL Query or extrapolate with another program?

I am having trouble developing some queries on the fly for our clients and sometimes find myself asking "Would it be better to start with a subset of the data I know I'm looking for, then just import into a program like Excel and process the data accordingly using similar functions, such as Pivot Tables"?.
One instance in particular I am struggling with is the following example:
I have an online member enrollment system. For simplicity sake, let's assume the data captured is: Member ID, Sign Up Date, their referral code, their state.
A sample member table may look like the following:
MemberID | Date | Ref | USState
=====================================
1 | 2011-01-01 | abc | AL
2 | 2011-01-02 | bcd | AR
3 | 2011-01-03 | cde | CA
4 | 2011-02-01 | abc | TX
and so on....
ultimately, the types of queries I want to build and run with this data set can extend to:
"Show me a list of all referral codes and the number of sign ups they had by each month in a single result set".
For example:
Ref | 2011-01 | 2011-02 | 2011-03 | 2011-04
==============================================
abc | 1 | 1 | 0 | 0
bcd | 1 | 0 | 0 | 0
cde | 1 | 0 | 0 | 0
I have no idea how to build this type of query in MySQL to be honest (I imagine if it can be done it would require a LOT of code, joins, subqueries, and unions.
Similarly, another sample query may be how many members signed up in each state by month
USState | 2011-01 | 2011-02 | 2011-03 | 2011-04
==============================================
AL | 1 | 0 | 0 | 0
AR | 1 | 0 | 0 | 0
CA | 1 | 0 | 0 | 0
TX | 0 | 1 | 0 | 0
I suppose my question is two fold:
1) Is it in fact best to just try and build these out with the necessary data from within a MySQL GUI such as Navicat or just import the entire subset of data into Excel and work forward?
2) If I was to use the MySQL route, what is the proper way to build the subsets of data in the examples mentioned below (note that the queries could become far more complex such as "Show how many sign ups came in for each particular month by each state and grouped by each agent as well (each agent has 50 possible rows)"
Thank you so much for your assistance ahead of time.
I am a proponent of doing this kind of querying on the server side, at least to get just the data you need.
You should create a time-periods table. It can get as complex as you desire, going down to days even.
id year month monthstart monthend
1 2011 1 1/1/2011 1/31/2011
...
This gives you almost limitless ability to group and query data in all sorts of interesting ways.
Getting the data for the original referral counts by month query you mentioned would be quite simple...
select a.Ref, b.year, b.month, count(*) as referralcount
from myTable a
join months b on a.Date between b.monthstart and b.monthend
group by a.Ref, b.year, b.month
order by a.Ref, b.year, b.month
The result set would be in rows like ref = abc, year = 2011, month = 1, referralcount = 1 as opposed to a column for every month. I am assuming that since getting a larger set of data and manipulating it in Excel was an option, that changing the layout of this data wouldn't be difficult.
Check out this previous answer that goes into a little more detail about the concept with different examples: SQL query for Figuring counts by month
I work on an Excel based application that deals with multi-dimensional time series data, and have recently been working on implementing predefined pivot table spreadsheets, so I know exactly what you're thinking. I'm a big proponent of giving users tools rather than writing up individual reports or a whole query language for them to use. You can create pivot tables on the fly that connect to the database and it's not that hard. Andrew Whitechapel has a great example here. But, you will also need to launch that in Excel or setup a basic Excel VSTO program, which is fairly easy to do in Visual Studio 2010. (microsoft.com/vsto)
Another thing, don't feel like you have to create ridiculously complex queries. Every join that you have will slow down any relational database. I discovered years ago that doing multi-step queries into temp tables in most cases will be much clearer, faster, and easier to write and support.