How do I calculate the importance/weight of input based on users reputation? - mysql

I have a couple systems which contain a users' table along with some form of karma/weight/reputation. Sometimes it's the number of posts a user has made, sometimes it's the number of up/down votes a user has received across all their activity on the site.
USER {
id int
name string
karma int
}
How do I use these numbers to calculate that user's "weight" or "authority"? For example, the vote of one long-time member is often worth much more than 4 votes from brand new users.
I was thinking about adding up the total points/karma/reputation of all members and then trying to come up with a 1-100 scale.
SUM(user.points) / COUNT(user.*) = average user points
Then something like
CEIL(userA.points / average user points) = their weight on an issue
However, there also needs to be a curve on the points this way as I don't want someone with 5,000 posts/karma to out weigh 20 new users votes.

Mathematically, your best bet is to weight by the log of the percentile ranking of user in question. However, that is painful in SQL.
Simpler would be to cheat and assume the mean is the same as the median (a very bad assumption statistically, but much simpler programmatically):
SELECT 1 - log10(SELECT COUNT (*) FROM user
WHERE (SUM(user.points) / COUNT(user.*)) < user.points)
/ SELECT (COUNT (*) from user))
In this way, your top 10% of karma would have one and a half the impact of your average user, almost twice the impact of a noob.
Changing the log base would scale this, obviously, where natural log (log() in mysql) would give the upper 10% 3 times as much impact as a noob, and twice the impact as average. Log2() is even more extreme. (Note: subtraction is required because the log will be negative.)
If you want a more severe effect you might try squaring the log. (Note: squaring makes the log squared positive, so addition is appropriate here.)
If you want a hyperprecise rule, you can go into standard deviations, but the sql gets cumbersome and slow. It all depends on how far down the rabbit hole you want to go....

There are probably some resources that can provide you with parameters for this, but you should probably decide exactly what you want rather than using some predefined model. I suggest you define some rules for which sets of users should be equivalent or which should outweigh each other (e.g. 10 0 karma users = 1 5k karma user) (equivalence is much easier to work with), which will very quickly produce parameters for some chosen equation.
Using log (as already suggested), some (fractional) power (like square root) or even just linear can work.
I suggest something like newKarma = a.karma^b + c, and it shouldn't be to difficult to solve a, b and c. I suggest you pick b rather than trying to calculate it. Using new users (with karma = 0) should make this quite easy to solve. Guessing values to get close to what you want can be easier than determining them mathematically (since some rules together won't fit any simple equation).
Note that c above is an offset to karma, which will give many new users more total karma than high-karma users. You may also want to think about a.(karma + c)^b, or a.(karma + c)^b + d. Analysing the rules you defined should tell you which one to use.
UPDATE: Added alternatives for c
EDIT: You have some options for SQL. A temp table (with sums) might actually be the fastest. You can also just use a view. A join on the same table might also be possible, though I'm not sure. Using a view would look something like: (for some chosen a,b,c and d) (you may also want to add indices to the view)
Votes(issueID, userID) // table structure
User(userID, karma, ...) // table structure
CREATE VIEW Sums AS
SELECT issueID, SUM(1*POWER(karma + 2, 3) + 4) AS sumVal
FROM Votes JOIN User ON User.userID = Votes.userID
GROUP BY issueID
Query:
SELECT (1*POWER(karma + 2, 3) + 4)/sumVal AS influenceOnIssue
FROM Votes JOIN User ON User.userID = Votes.userID
JOIN Sums on Sums.issueID = Votes.issueID
WHERE Votes.userID = #UserID AND Votes.issueID = #IssueID
A simplification may be to have a computed column that = 1*POWER(karma + 2, 3) + 4
The faster option would be to calculate the derived karma on insert/update, either by having an additional column and using triggers or just calculating in before you call insert/update, and calling insert/update with the new value.

Related

MySQL finding data if any 4 of 5 columns are found in a row

I have an imported table of several thousand customers, the development I am working on runs on the basis of anonymity for purchase checkouts (customers do not need to log in to check out), but if enough of their details match the database record then do a soft match and email the (probably new) email address and eventually associate the anonymous checkout with the account record on file.
This is rolling out this way due to the age of the records, many people have the same postal address or names but not the same email address, likewise some people will have moved house and some people will have changed name (marriage etc).
What I think I am looking for is a MySQL CASE system, however the CASE questions on Stack Overflow I've found don't appear to cover what I'm trying to get from this query.
The query should work something like this:
$input[0] = postcode (zip code)
$input[1] = postal address
$input[2] = phone number
$input[3] = surname
$input[4] = forename
SELECT account_id FROM account WHERE <4 or more of the variables listed match the same row>
The only way I KNOW I can do this is with a massive bunch of OR statements but that's excessive and I'm sure there's a cleaner more concise method.
I also apologise in advance if this is relatively easy but I don't [think I] know the keyword to research constructing this. As I say, CASE is my best guess.
I'm having trouble working out how to manipulate CASE to fit what I'm trying to do. I do not need to return the values only the account_id from the valid row (only) that matches 4 or 5 of the given inputs.
I imagine that I could construct a layout that does this:
SELECT account_id CASE <if postcode_column=postcode_var> X=X+1
CASE <if surname_column=surname_var> X=X+1
...
...
WHERE X > 3
Is CASE the right idea?
If not, What is the process I need to use to achieve the desired results?
What is [another] MySQL keyword / syntax I need to research, if not CASE.
Here is your pseudo query:
SELECT account_id
FROM account
WHERE (postcode = 'pc')+
(postal_address = 'pa')+
(phone_number = '12345678901')+
(surname = 'sn')+
(forename= 'fn') > 3

What's a good pattern to implement running queries on individual entities from a obtained result set in Datanucleus/JPA

I am basically obtaining a decently sized result set (a few thousand) through datanucleus by running a JPQL query. On each of these, I also want to find the number of references from another table. The data is in a MySQL db.
For example:
List<Instrument> instruments = em().createQuery("SELECT i FROM Instrument AS i").getResultList();
for(Instrument i : instruments)
{
Query q = em().createQuery("SELECT COUNT(c) FROM Component AS c WHERE c.instrument.id = :id")
q.setParameter("id", i.getId());
long count = (Long) q.getSingleResult();
}
So, basically I want the list of instruments and also the list of components attached to the instrument as per the above example.
I've used similar code at a bunch of places and it performs pretty poorly. I understand that for 2000 instruments, I'll fire 2000 additional queries to count components and that will slow things down. I'm sure there's a better pattern to obtain the same result that I want. How can I get things to speed up?
That's right, this is not an optimal solution. But the good news is that all of this can be done with one or at most two queries.
For instance you don't have to execute the counting query once for each instrument.
You can use grouping and get all counts with one query:
List<Instrument> instruments = em().createQuery("SELECT i FROM Instrument AS i").getResultList();
Query q = em().createQuery("SELECT c.instrument.id, COUNT(c) FROM Component AS c GROUP BY c.instrument.id")
List<Object[]> counts = q.getResultList();
for (Object[] elem : counts) {
// do something
// elem[0] is instrument ID
// elem[1] is count
}
I haven't check that but you can probably also do everything with one query by putting the second query as a subquery in the first one:
SELECT i
(SELECT COUNT(c) FROM Component AS c WHERE c.instrument.id = i.id)
FROM Instrument AS i
Similar to the first example, result list elem[0] would be an Instrument and elem[1] the count. It can be less efficient because the DB will have to execute the subquery for each instrument anyway, but it will be still quicker than your code, because it happens fully on DB side (no round-trips to DB for each counting query).

Is it better to use database polling or events for the following system?

I'm working on an ordering system that works exactly the way Netflix's service works (see end of this question if you're not familiar with Netflix). I have two approaches and I am unsure which approach is the right one; one relies on database polling and the other is event driven.
The following two approaches assume this simplified schema:
member(id, planId)
plan(id, moviesPerMonthLimit, moviesAtHomeLimit)
wishlist(memberId, movieId, rank, shippedOn, returnedOn)
Polling: I would run the following count queries in wishlist
Count movies shippedThisMonth (where shippedOn IS NOT NULL #memberId)
Count moviesAtHome (where shippedOn IS NOT NULL, and returnedOn IS NULL #memberId)
Count moviesInList (#memberId)
The following function will determine how many movies to ship:
moviesToShip = Min(moviesPerMonthLimit - shippedThisMonth, moviesAtHomeLimit - moviesAtHome, moviesInList)
I will loop through each member, run the counts, and loop through their list as many times as moviesToShip. Seems like a pain in the neck, but it works.
Event Driven: This approach involves adding an extra column "queuedForShipping" and marking it to 0,1 every time an event takes place. I will do the following counts:
Count movies shippedThisMonth (where shippedOn IS NOT NULL #memberId)
Count moviesAtHome (where shippedOn IS NOT NULL, and returnedOn IS NULL #memberId)
Count moviesQueuedForShipping (where queuedForShipping = 1, #memberId)
Instead of using min, I have to use the following if statements
If moviesPerMonthLimit > (shippedThisMonth + moviesQueuedForShipping)
AND IF moviesAtHomeLimit > (moviesAtHome + moviesQueuedForShipping))
If both conditions are true, I will select a row from wishlist where queuedForShippinh = 0, and set it's queuedForShipping to 1. I will run this function every time someone adds, deletes, reorders their list. When it's time to ship, I would select #memberId where queuedForShipping = 1. I would also run this when updating shippedAt and returnedAt.
Approach one is simple. It also allows members to mess around with their ranks until someone decides to run the polling. That way what to ship is always decided by rank. But ppl keep telling polling is bad.
The event driven approach is self-sustaining, but it seems like a waste of time to ping the database with all those counts every time a person changes their list. I would also have to write to the column queuedForShipment. It also means when a member re-ranks their list and they have pending shipments (shippedAt IS NULL, queuedForShipping = 1) I would have to update those rows and set queuedForShipping back to 1 based on the new ranks. (What if someone added 5 movies, and then suddenly went to change the order? Well, queuedForShipment would already be set to 1 on the first two movies he or she added)
Can someone please give me their opinion on the best approach here and the cons/advantages of polling versus event driven?
Netflix is a monthly subscription service where you create a movie list, and your movies are shipped to you based on your service plan limits.
Based on what you described, there's no reason to keep the data "ready to use" (event) when you can create it very easily when needed (poll).
Reasons to cache it:
If you needed to display the next item to the user.
If the detailed data was being removed due to some retention policy.
If the polling queries were too slow.

DynamicQuery: How to select a column with linq query that takes parameters

We want to set up a directory of all the organizations working with us. They are incredibly diverse (government, embassy, private companies, and organizations depending on them ). So, I've resolved to create 2 tables. Table 1 will treat all the organizations equally, i.e. it'll collect all the basic information (name, address, phone number, etc.). Table 2 will establish the hierarchy among all the organizations. For instance, Program for illiterate adults depends on the National Institute for Social Security which depends on the Labor Ministry.
In the Hierarchy table, each column represents a level. So, for the example above, (i)Labor Ministry - Level1(column1), (ii)National Institute for Social Security - Level2(column2), (iii)Program for illiterate adults - Level3(column3).
To attach an organization to an hierarchy, the user needs to go level by level(i.e. column by column). So, there will be at least 3 situations:
If an adequate hierarchy exists for an organization(for instance, level1: US Embassy), that organization can be added (For instance, level2: USAID).--> US Embassy/USAID, and so on.
How about if one or more levels are missing? - then they need to be added
How about if the hierarchy need to be modified? -- not every thing need to be modified.
I do not have any choice but working by level (i.e. column by column). I does not make sense to have all the levels in one form as the user need to navigate hierarchies to find the right one to attach an organization.
Let's say, I have those queries in my repository (just that you get the idea).
Query1
var orgHierarchy = (from orgH in db.Hierarchy
select orgH.Level1).FirstOrDefault;
Query2
var orgHierarchy = (from orgH in db.Hierarchy
select orgH.Level2).FirstOrDefault;
Query3, Query4, etc.
The above queries are the same except for the property queried (level1, level2, level3, etc.)
Question: Is there a general way of writing the above queries in one? So that the user can track an hierarchy level by level to attach an organization.
In other words, not knowing in advance which column to query, I still need to be able to do so depending on some conditions. For instance, an organization X depends on Y. Knowing that Y is somewhere on the 3rd level, I'll go to the 4th level, linking X to Y.
I need to select (not manually) a column with only one query that takes parameters.
=======================
EDIT
As I just said to #Mark Byers, all I want is just to be able to query a column not knowing in advance which one. Check this out:
How about this
Public Hierarchy GetHierarchy(string name)
{
var myHierarchy = from hierarc in db.Hierarchy
where (hierarc.Level1 == name)
select hierarc;
retuen myHierarchy;
}
Above, the query depends on name which is a variable. It mighbe Planning Ministry, Embassy, Local Phone, etc.
Can I write the same query, but this time instead of looking to much a value in the DB, I impose my query to select a particular column.
var myVar = from orgH in db.Hierarchy
where (orgH.Level1 == "Government")
select orgH.where(level == myVariable);
return myVar;
I don't pretend that select orgH.where(level == myVariable) is even close to be valid. But that is what I want: to be able to select a column depending on a variable (i.e. the value is not known in advance like with name).
Thanks for helping
How about using DynamicQueryable?
http://weblogs.asp.net/scottgu/archive/2008/01/07/dynamic-linq-part-1-using-the-linq-dynamic-query-library.aspx
Your database is not normalized so you should start by changing the heirarchy table to, for example:
OrganizationId Parent
1 NULL
2 1
3 1
4 3
To query this you might need to use recursive queries. This is difficult (but not impossible) using LINQ, so you might instead prefer to create a parameterized stored procedure using a recursive CTE and put the query there.

Stumbleupon type query

Wow, makes your head spin!
I am about to start a project, and although my mySql is OK, I can't get my head around what required for this:
I have a table of web addresses.
id,url
1,http://www.url1.com
2,http://www.url2.com
3,http://www.url3.com
4,http://www.url4.com
I have a table of users.
id,name
1,fred bloggs
2,john bloggs
3,amy bloggs
I have a table of categories.
id,name
1,science
2,tech
3,adult
4,stackoverflow
I have a table of categories the user likes as numerical ref relating to the category unique ref. For example:
user,category
1,4
1,6
1,7
1,10
2,3
2,4
3,5
.
.
.
I have a table of scores relating to each website address. When a user visits one of these sites and says they like it, it's stored like so:
url_ref,category
4,2
4,3
4,6
4,2
4,3
5,2
5,3
.
.
.
So based on the above data, URL 4 would score (in it's own right) as follows: 2=2 3=2 6=1
What I was hoping to do was pick out a random URL from over 2,000,000 records based on the current users interests.
So if the logged in user likes categories 1,2,3 then I would like to ORDER BY a score generated based on their interest.
If the logged in user likes categories 2 3 and 6 then the total score would be 5. However, if the current logged in user only like categories 2 and 6, the URL score would be 3. So the order by would be in context of the logged in users interests.
Think of stumbleupon.
I was thinking of using a set of VIEWS to help with sub queries.
I'm guessing that all 2,000,000 records will need to be looked at and based on the id of the url it will look to see what scores it has based on each selected category of the current user.
So we need to know the user ID and this gets passed into the query as a constant from the start.
Ain't got a clue!
Chris Denman
What I was hoping to do was pick out a random URL from over 2,000,000 records based on the current users interests.
This screams for predictive modeling, something you probably wouldn't be able to pull off in the database. Basically, you'd want to precalculate your score for a given interest (or more likely set of interests) / URL combination, and then query based on the precalculated values. You'd most likely be best off doing this in application code somewhere.
Since you're trying to guess whether a user will like or dislike a link based on what you know about them, Bayes seems like a good starting point (sorry for the wikipedia link, but without knowing your programming language this is probably the best place to start): Naive Bayes Classifier
edit
The basic idea here is that you continually run your precalculation process, and once you have enough data you can try to distill it to a simple formula that you can use in your query. As you collect more data, you continue to run the precalculation process and use the expanded results to refine your formula. This gets really interesting if you have the means to suggest a link, then find out whether the user liked it or not, as you can use this feedback loop really improve the prediction algorithm (have a read on machine learning, particularly genetic algorithms, for more on this)
I did this in the end:
$dbh = new NewSys::mySqlAccess("xxxxxxxxxx","xxxxxxxxxx","xxxxxxxxx","localhost");
$icat{1}='animals pets';
$icat{2}='gadget addict';
$icat{3}='games online play';
$icat{4}='painting art';
$icat{5}='graphic designer design';
$icat{6}='philosophy';
$icat{7}='strange unusual bizarre';
$icat{8}='health fitness';
$icat{9}='photography photographer';
$icat{10}='reading books';
$icat{11}='humour humor comedy comedian funny';
$icat{12}='psychology psychologist';
$icat{13}='cartoons cartoonist';
$icat{14}='internet technology';
$icat{15}='science scientist';
$icat{16}='clothing fashion';
$icat{17}='movies movie latest';
$icat{18}="\"self improvement\"";
$icat{19}='drawing art';
$icat{20}='latest band member';
$icat{21}='shop prices';
$icat{22}='recipe recipes food';
$icat{23}='mythology';
$icat{24}='holiday resorts destinations';
$icat{25}="(rude words)";
$icat{26}="www website";
$dbh->Sql("DELETE FROM precalc WHERE member = '$fdat{cred_id}'");
$dbh->Sql("SELECT * FROM prefs WHERE member = '$fdat{cred_id}'");
#chos=();
while($dbh->FetchRow()){
$cat=$dbh->Data('category');
$cats{$cat}='#';
}
foreach $cat (keys %cats){
push #chos,"\'$cat\'";
push #strings,$icat{$cat};
}
$sqll=join("\,",#chos);
$words=join(" ",#strings);
$dbh->Sql("select users.id,users.url,IFNULL((select sum(scoretot.scr) from scoretot where scoretot.id = users.id and scoretot.category IN \($sqll\)),0) as score from users WHERE MATCH (description,lasttweet) AGAINST ('$words' IN BOOLEAN MODE) AND IFNULL((SELECT ref FROM visited WHERE member = '$fdat{cred_id}' AND user = users.id LIMIT 1),0) = 0 ORDER BY score DESC limit 30");
$cnt=0;
while($dbh->FetchRow()){
$id=$dbh->Data('id');
$url=$dbh->Data('url');
$score=$dbh->Data('score');
$dbh2->Sql("INSERT INTO precalc (member,user,url,score) VALUES ('$fdat{cred_id}','$id','$url','$score')");
$cnt++;
}
I came up with this answer about three months ago, and just cannot read it. So sorry, I can't explain how it finally worked, but it managed to query 2 million websites and choose one based on the history of a users past votes on other sites.
Once I got it working, I moved on to another problem!
http://www.staggerupon.com is where it all happens!
Chris