Storing Hierarchical Data (MySQL) for Referral Marketing - mysql

I need to have a 5 levels hierarchy for the users registered to a website. Every user is invited by another, and I need to know all descendants for a user. And also ancestors for a user.
I have in mind 2 solution.
Keeping a table with relationships this way. A closure table:
ancestor_id descendant_id distance
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
2 3 1
Having this table for relationships. Keeping in a table 5 levels ancestors. A "ancestors" table:
user_id ancestor_level1_id ancestor_level2_id ancestor_level3_id ancestor_level4_id ancestor_level5_id
10 9 7 4 3 2
9 7 4 3 2 1
Are these good ideas?
I know about "the adjacency list model" and "the modified preorder tree traversal algorithm", but are these good solutions for a "referral" system?
The queries that I need to perform on this tree are:
frequently adding a new users
when a user buys something, their referrers get a percentage commission
every user should be able to find out how many people they've referred (and how many people were referred by people who they referred....) at each level

Closure Table
ancestor_id descendant_id distance
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
2 3 1
To add user 10, referred by user 3. (I don't think you need to lock the table between these two insertions):
insert into ancestor_table
select ancestor_id, 10, distance+1
from ancestor_table
where descendant_id=3;
insert into ancestor_table values (10,10,0);
To find all users referred by user 3.
select descendant_id from ancestor_table where ancestor_id=3;
To count those users by depth:
select distance, count(*) from ancestor_table where ancestor_id=3 group by distance;
To find the ancestors of user 10.
select ancestor_id, distance from ancestor_table where descendant_id=10;
The drawback to this method is amount of storage space this table will take.

Use the OQGRAPH storage engine.
You probably want to keep track of an arbitrary number of levels, rather than just 5 levels. Get one of the MySQL forks that supports the QGRAPH engine (such as MariaDB or OurDelta), and use that to store your tree. It implements the adjacency list model, but by using a special column called latch to send a command to the storage engine, telling it what kind of query to perform, you get all of the advantages of a closure table without needing to do the bookkeeping work each time someone registers for your site.
Here are the queries you'd use in OQGRAPH. See the documentation at
http://openquery.com/graph-computation-engine-documentation
We're going to use origid as the referrer, and destid as the referree.
To add user 11, referred by user 10
insert into ancestors_table (origid,destid) values (10,11)
To find all users referred by user 3.
SELECT linkid FROM ancestors_table WHERE latch = 2 AND origid = 3;
To find the ancestors of user 10.
SELECT linkid FROM ancestors_table WHERE latch = 2 AND destid = 10;
To find the number of users at each level, referred by user 3:
SELECT count(linkid), weight
FROM ancestors_table
WHERE latch = 2 AND origid = 3
GROUP BY weight;

Managing Hierarchical Data in MySQL
In general, I like the "nested set", esp. in MySQL which doesn't really have language support for hierarchical data.
It's fast, but you'll need to make sure your developers read that article if ease of maintenance is a big deal. It's very flexible - which doesn't seem to matter much in your case.
It seems a good fit for your problem - in the referral model, you need to find the tree of referrers, which is fast in the nested set model; you also need to know who are the ~children# of a given user, and the depth of their relationship; this is also fast.

Delimited String of Ancestors
If you're strongly considering the 5-level relationship table, it may simplify things to use a delimited string of ancestors instead of 5 separate columns.
user_id depth ancestors
10 7 9,7,4,3,2,1
9 6 7,4,3,2,1
...
2 2 1
1 1 (empty string)
Here are some SQL commands you'd use with this model:
To add user 11, referred by user 10
insert into ancestors_table (user_id, depth, ancestors)
select 11, depth+1, concat(10,',',ancestors)
from ancestors_table
where user_id=10;
To find all users referred by user 3. (Note that this query can't use an index.)
select user_id
from ancestors_table
where ancestors like '%,3,%' or ancestors like '3,%' or ancestors like '%,3';
To find the ancestors of user 10. You need to break up the string in your client program. In Ruby, the code would be ancestorscolumn.split(",").map{|x| x.to_i}. There's no good way to break up the string in SQL.
select ancestors from ancestors_table where user_id=10;
To find the number of users at each level, referred by user 3:
select
depth-(select depth from ancestors_table where user_id=3),
count(*)
from ancestors_table
where ancestors like '%,3,%' or ancestors like '3,%' or ancestors like '%,3'
group by depth;
You can avoid SQL injection attacks in the like '%,3,%' parts of these queries by using like concat('%,', ?, ',%') instead and binding the an integer for the user number to the placeholder.

Related

Access: finding the corresponding value of maximum value

I have a database in which I perform an audit on a set of required documents, for several locations of those documents.
So I have a table named Locations and a table named Documents, which are correlated through a 2 x 2 relationship.
Every document can have multiple versions. In my query, I want to see only the most recent version of each document, so the max(Id).
Now, every version can be 'audited' (checked) multiple times, for example 2 times each year. Each Audit/check is stored in a record, and I want to show only the most recent audit for each document, so Max(ID).
This is my Selection Query:
SELECT [~Locations].Location, [+DocuProperties].Category, [~Documents].[Document name], Max([DocuVersion].Id) AS MaxDocuID, Max([Audit].Id) AS MaxAuditID, [Audit].Conclusion
FROM ([~Documents] INNER JOIN ([~Locations] INNER JOIN ([+DocuLocation] INNER JOIN [+DocuProperties] ON [+DocuLocation].Id = [+DocuProperties].DocuLocation) ON [~Locations].Id = [+DocuLocation].Location) ON [~Locations].Id = [+DocuLocation].DocuName) INNER JOIN (DocuVersion INNER JOIN 2Audit ON [DocuVersion].Id = [Audit].DocuVersion) ON [+DocuProperties].Id = [DocuVersion].DocuLocation
GROUP BY [~Locations].Location, [+Docuproperties].Category, [~Documents].[Document name], [Audit].Conclusion
However: I do not wish to Group on Audit Conclusion, I wish to show the Audit conclusion that corresponds to the Max(Id) of that Audit.
So for every most recent Audit, I want to show the Conclusion. This conclusion I want to show for each Document, grouped byCategory and grouped byLocation.
I know I need to build a nested subquery of some form, but I just can't get any code to work.
I hope anybody can help.
The basic idea is like this:
Table 1
DocuProperties
Id Location Category
1 15 1
2 15 1
3 14 2
(every location can have multiple document properties a.k.a. objects)
Table2
DocuVersion
Id DocuProperty DocumentEndDate
1 1 01-01-2022
2 1 20-07-2023
3 2 31-07-2023 etc.
4 3 01-10-2023
(every DocuProperties can have multiple versions, I have to check If they are still valid, but also on some other criteria ).
Table 3
Audit
Id DocuVersion Conclusion
1 1 Not Valid
2 1 Not Valid
3 2 Valid
4 4 Valid
(every version can be audited multiple times. Every audit can have a different conclusion)
Which I would like to translate into the following:
LASTAudit (a.k.a. the most recent audit of the most recent version of the most recent property)
Location DocutPropertyId DocuVersionId AuditId Conclusion
15 2 2 2 Not Valid
14 3 4 4 Valid
The ID’s were easy to get right, as those were just Max(Id) functions. The problem was to get the Conclusion corresponding to that audit of that version of that object.

Storing csv in MySQL field – bad idea?

I have two tables, one user table and an items table. In the user table, there is the field "items". The "items" table only consists of a unique id and an item_name.
Now each user can have multiple items. I wanted to avoid creating a third table that would connect the items with the user but rather have a field in the user_table that stores the item ids connected to the user in a "csv" field.
So any given user would have a field "items" that could have a value like "32,3,98,56".
It maybe is worth mentioning that the maximum number of items per user is rather limited (<5).
The question: Is this approach generally a bad idea compared to having a third table that contains user->item pairs?
Wouldn't a third table create quite an overhead when you want to find all items of a user (I would have to iterate through all elements returned by MySQL individually).
You don't want to store the value in the comma separated form.
Consider the case when you decide to join this column with some other table.
Consider you have,
x items
1 1, 2, 3
1 1, 4
2 1
and you want to find distinct values for each x i.e.:
x items
1 1, 2, 3, 4
2 1
or may be want to check if it has 3 in it
or may be want to convert them into separate rows:
x items
1 1
1 2
1 3
1 1
1 4
2 1
It will be a HUGE PAIN.
Use atleast normalization 1st principle - have separate row for each value.
Now, say originally you had this as you table:
x item
1 1
1 2
1 3
1 1
1 4
2 1
You can easily convert it into csv values:
select x, group_concat(item order by item) items
from t
group by x
If you want to search if x = 1 has item 3. Easy.
select * from t where x = 1 and item = 3
which in earlier case would use horrible find_in_set:
select * from t where x = 1 and find_in_set(3, items);
If you think you can use like with CSV values to search, then first like %x% can't use indexes. Second, it will produce wrong results.
Say you want check if item ab is present and you do %ab% it will return rows with abc abcd abcde .... .
If you have many users and items, then I'd suggest create separate table users with an PK userid, another items with PK itemid and lastly a mapping table user_item having userid, itemid columns.
If you know you'll just need to store and retrieve these values and not do any operation on it such as join, search, distinct, conversion to separate rows etc. etc. - may be just may be, you can (I still wouldn't).
Storing complex data directly in a relational database is a nonstandard use of a relational database. Normally they are designed for normalized data.
There are extensions which vary according to the brand of software which may help. Or you can normalize your CSV file into properly designed table(s). It depends on lots of things. Talk to your enterprise data architect in this case.
Whether it's a bad idea depends on your business needs. I can't assess your business needs from way out here on the internet. Talk to your product manager in this case.

MySQL- Counting rows VS Setting up a counter

I have 2 tables posts<id, user_id, text, votes_counter, created> and votes<id, post_id, user_id, vote>. Here the table vote can be either 1 (upvote) or -1(downvote). Now if I need to fetch the total votes(upvotes - downvotes) on a post, I can do it in 2 ways.
Use count(*) to count the number of upvotes and downvotes on that post from votes table and then do the maths.
Set up a counter column votes_counter and increment or decrement it everytime a user upvotes or downvotes. Then simply extract that votes_counter.
My question is which one is better and under what condition. By saying condition, I mean factors like scalability, peaktime et cetera.
To what I know, if I use method 1, for a table with millions of rows, count(*) could be a heavy operation. To avoid that situation, if I use a counter then during peak time, the votes_counter column might get deadlocked, too many users trying to update the counter!
Is there a third way better than both and as simple to implement?
The two approaches represent a common tradeoff between complexity of implementation and speed.
The first approach is very simple to implement, because it does not require you to do any additional coding.
The second approach is potentially a lot faster, especially when you need to count a small percentage of items in a large table
The first approach can be sped up by well designed indexes. Rather than searching through the whole table, your RDBMS could retrieve a few records from the index, and do the counts using them
The second approach can become very complex very quickly:
You need to consider what happens to the counts when a user gets deleted
You should consider what happens when the table of votes is manipulated by tools outside your program. For example, merging records from two databases may prove a lot more complex when the current counts are stored along with the individual ones.
I would start with the first approach, and see how it performs. Then I would try optimizing it with indexing. Finally, I would consider going with the second approach, possibly writing triggers to update counts automatically.
As this sounds a lot like StackExchange, I'll refer you to this answer on the meta about the database schema used on the site. The votes table looks like this:
Votes table:
Id
PostId
VoteTypeId, one of the following values:
1 - AcceptedByOriginator
2 - UpMod
3 - DownMod
4 - Offensive
5 - Favorite (if VoteTypeId = 5, UserId will be populated)
6 - Close
7 - Reopen
8 - BountyStart (if VoteTypeId = 8, UserId will be populated)
9 - BountyClose
10 - Deletion
11 - Undeletion
12 - Spam
15 - ModeratorReview
16 - ApproveEditSuggestion
UserId (only present if VoteTypeId is 5 or 8)
CreationDate
BountyAmount (only present if VoteTypeId is 8 or 9)
And so based on that it sounds like the way it would be run is:
SELECT VoteTypeId FROM Votes WHERE VoteTypeId = 2 OR VoteTypeId = 3
And then based on the value, do the maths:
int score = 0;
for each vote in voteQueryResults
if(vote == 2) score++;
if(vote == 3) score--;
Even with millions of results, this is probably going to be a very fast operation as it's so simple.

How to request lists that contain certain items in MySQL

In the application I am developing, the user has to set parameters to define the end product he will get.
My tables look like this :
Categories
-------------
Id Name
1 Material
2 Color
3 Shape
Parameters
-------------
Id CategoryId Name
1 1 Wood
2 1 Plastic
3 1 Metal
4 2 Red
5 2 Green
6 2 Blue
7 3 Round
8 3 Square
9 3 Triangle
Combinations
-------------
Id
1
2
...
ParametersCombinations
----------------------
CombinationId ParameterId
1 1
1 4
1 7
2 1
2 5
2 7
Now only some combinations of parameters are available to the user. In my example, he could get a red round wooden thingy or a green round wooden thingy but not a blue one because I can't produce it.
Let's say the user selected wood and round parameters. How do I make a request to know that there's only red and green available so I can disable the blue option for him ?
Or is there some better way to model my database ?
Let us assume you provide the selected parameters id in the following format
// I call this a **parameterList** for convenience sake.
(1,7) // this is parameter id 1 and id 7.
I am also assuming you are using some scripting language to help you with your app. Like ruby or php.
I am also assuming you want to avoid putting as much logic into your stored procedure or MySQL queries as much as possible.
Another assumption is that you are using one of the Rapid Application MVC Frameworks like Rails, Symfony or CakePHP.
Your logic would be:
Find all the combinations that contain ALL the parameters in your parameterList and put these found combinations in a list called relevantCombinations
Find all the parameters_combinations that contain at least 1 of the combinations in the list relevantCombinations. Retrieve only the unique parameter values.
First two steps can be solved using simple Model::find methods and a forloop in the frameworks I described above.
If you are not using frameworks, it is also cool to use the scripting language raw.
If you require them in MySQL queries, here are some possible queries. Be aware that these are not necessary the best queries.
First one is
SELECT * FROM (
SELECT `PossibleList`.`CombinationId`, COUNT(`PossibleList`.`CombinationId`) as number
FROM (
SELECT `CombinationId` FROM `ParametersCombinations`
WHERE `ParameterId` IN (1, 7)
) `PossibleList` GROUP BY `PossibleList`.`CombinationId`
) `PossibleGroupedList` WHERE `number` = 2;
-- note that the (1, 7) and the number 2 needs to be supplied by your app.
-- 2 refers to the number of parameters supplied.
-- In this case you supplied 1 and 7 therefore 2.
To confirm, look at http://sqlfiddle.com/#!2/16831/3.
Note how I purposely have a Combination 3 which only has the Parameter 1 but not 7. Therefore the query did not give you back 3, but only 1 and 2. Feel free to tweak the asterisk * in the first line.
Second one is
SELECT DISTINCT(`ParameterID`)
FROM `ParametersCombinations`
WHERE `CombinationId` IN (1, 2);
-- note that (1, 2) is the result we expect from the first step.
-- the one we call relevantCombinations
To confirm, look at http://sqlfiddle.com/#!2/16831/5
I do not recommend being a masochist and attempt to get your answer in a single query.
I also do NOT recommend using the MySQL queries I have supplied. It is less masochistic. But sufficiently masochistic for me NOT to recommend this way.
Since you did not indicate any tag other than mysql, I suspect that you are stronger with mysql. Hence my answer contains mysql.
My strongest suggestion would be my first. Make full use of established frameworks and put your logic in the business logic layer. Not in the data layer. Even if you don't use frameworks and just use raw php and ruby, that is still a better place for you to place your logic in than MySQL.
I saw that T gave an answer in a single MySQL query but I can tell you that (s)he considers only 1 parameter.
See this part:
WHERE ParameterId = 7 -- 7 is the selected parameter
You can adapt his/her answer with some trickery using a forloop and appending OR clauses.
Again, I do NOT recommend that in the big picture of building an app.
I have also tested his/her answer with http://sqlfiddle.com/#!2/2eda4/2. There may be 1 or 2 small bugs.
In summary, my recommendations in descending order of strength:
Use a framework like Rails or CakePHP and the pseudocode step 1 and 2 and as many find as you need. (STRONGEST)
Use raw scripting language and the pseudocode step 1 and 2 and as many simple queries as you need.
Use the raw MySQL queries I created. (LEAST STRONG)
P.S. I left out the part in my queries as to how to get the name of the Parameters. But given that you can get the ParameterIDs from my answer, I think that is trivial. I have also left out how you may need to remove the already selected parameters (1, 7). Again, that should be trivial to you.
Try the following
SELECT p.*, pc.CombinationId
FROM Parameters p
-- get the parameter combinations for all the parameters
JOIN ParametersCombinations pc
ON pc.ParameterId = p.Id
-- filter the parameter combinations to only combinations that include the selected parameter
JOIN (
SELECT CombinationId
FROM ParametersCombinations
WHERE ParameterId = 7 -- 7 is the selected parameter
) f ON f.CombinationId = pc.CombinationId
Or removing the already selected parameters
SELECT p.*, pc.CombinationId
FROM Parameters p
JOIN ParametersCombinations pc
ON pc.ParameterId = p.Id
JOIN (
SELECT CombinationId
FROM ParametersCombinations
WHERE ParameterId IN (7, 1)
) f ON f.CombinationId = pc.CombinationId
WHERE ParameterId NOT IN (7, 1)

compatibility matrix

How can I model a compatibility matrix database in sql server 2008;
i.e. I have 4 groups of design. If I select group 1 and group 4, my choices will influence group 2 and 3 and will make them unselectable because they are not compatible with my choices.
I'm a beginner, so my explanation might seems watery, pls forgive!
Any kind of "matrix" is essentially an M:N relationship, only in this case both "axis" of the matrix represent the same thing ("groups of design"). The M:N relationship of a table with itself can be modeled like this:
Assuming relationships are bidirectional, you should also add: CHECK(GROUP_ID1 < GROUP_ID2) in the COMPATIBILITY table. This would, for example, allow (1, 4), but prevent (4, 1) from being in the table.
The example from your question would be represented by the following data in the database:
GROUP:
GROUP_ID
--------
1
2
3
4
COMPATIBILITY:
GROUP_ID1 GROUP_ID2
--------- ---------
1 4
When user selects a group X, you'd run the following query to find which groups it is compatible with. The remaining groups are in-compatible.
SELECT GROUP_ID1
FROM COMPATIBILITY
WHERE GROUP_ID2 = :X
UNION ALL
SELECT GROUP_ID2
FROM COMPATIBILITY
WHERE GROUP_ID1 = :X
For 1, this would return 4 and for 4 it would return 1. In either of these cases, 2 and 3 would not be returned - a sign they are incompatible.
On the other hand, if you want do disable 2 and 3 when 1 and 4 are selected, but not when just 1 or just 4 is selected, this is a different problem that would be more complicated to model in the relational paradigm. Let me know if that's what you actually need...