use a transaction database to calculate the probability of an item appearing in a future transaction using R or SQL - mysql

I have a database of transactions like in the table below
user_id order_id order_number product_name n
<int> <int> <int> <fctr> <int>
1 11878590 3 Pistachios 1
1 11878590 3 Soda 1
1 12878790 4 Yogurt 1
1 12878790 4 Cheddar Popcorn 1
1 12878790 4 Cinnamon Toast Crunch 1
2 12878791 11 Milk Chocolate Almonds 1
2 12878791 11 Half & Half 1
2 12878791 11 String Cheese 1
11 12878792 19 Whole Milk 1
11 12878792 19 Pistachios 1
11 12878792 19 Soda 1
11 12878792 19 Paper Towel Rolls 1
The table has multiple users who each have multiple transactions. Some users only have 3 transactions, other users have 15, etc. This is all in one table.
I'm trying to calculate a transition matrix for a markov model. I want to find the probability that an item will be in a new basket given that it was present in the previous basket of transactions.
I want my final table to look something like this
user_id product_name probability_present probability_absent
1 Soda .5 .5
1 Pistachios .5 .5
I'm having trouble figuring out how to get the data into a form so that I can calculate the probabilities and specifically coming up with a way to compare all of the t,t-1 combinations.
I have code that I've written to get things into this form, but I'm stuck at this point. I've written my code using the dplyr R package, but I could translate something in SQL into the R code. I can post my code in R if it will be helpful, but it is pretty simple at this point as I just had to do a few joins to get the table into this shape.
What else do I have to do to get the table/values that I'm trying to calculate?

This seems to give you the desired probabilities:
SELECT user_id,
product_name,
COUNT(DISTINCT order_number) / COUNT(*) AS prob_present,
1 - COUNT(DISTINCT order_number) / COUNT(*) AS prob_absent
FROM tbl
WHERE user_id = 1
GROUP BY user_id, product_name;
Or at least it gives you the numbers you have. If this is not right, please provide a slightly more complex example dataset.

Related

Edit product selling location using mysql

I'm building a e-Commerce platform (PHP + MySQL) and I want to add a attribute (feature) to products, the ability to specify (enable/disable) the selling status for specific city.
Here are simplified tables:
cities
id name
==========
1 Roma
2 Berlin
3 Paris
4 London
products
id name cities
==================
1 TV 1,2,4
2 Phone 1,3,4
3 Book 1,2,3,4
4 Guitar 3
In this simple example is easy to query (using FIND_IN_SET or LIKE) to check the availability of product for specific city.
This is OK for 4 city in this example or even 100 cities but will be practical for a large number of cities and for very large number of products?
For better "performance" or better database design should I add another table to table to JOIN in query (productid, cityid, status) ?
availability
id productid cityid status
=============================
1 1 1 1
2 1 2 1
3 1 4 1
4 2 1 1
5 2 3 1
6 2 4 1
7 3 1 1
8 3 2 1
9 3 3 1
10 3 4 1
11 4 3 1
For better "performance" or better database design should I add
another table
YES definitely you should create another table to hold that information likewise you posted rather storing in , separated list which is against Normalization concept. Also, there is no way you can gain better performance when you try to JOIN and find out the details pf products available in which cities.
At any point in time if you want to get back a comma separated list like 1,2,4 of values then you can do a GROUP BY productid and use GROUP_CONCAT(cityid) to get the same.

how to compare the current record to the next in the same table in terms of 'datetime' in MySQL?

Problem: I’m going to explain this problem using the Sakila sample database and it data so it is easier for you. Ok, so my question is how can I compare the current record to the next in the same table in terms of 'datetime'. This is how the table looks like:
payment_id customer_id staff_id rental_id amount payment_date last_update
1 1 1 76 2.99 25/05/2005 11:30:37 15/02/2006 22:12:30
2 1 1 573 0.99 28/05/2005 10:35:23 15/02/2006 22:12:30
3 1 1 1185 5.99 15/06/2005 00:54:12 15/02/2006 22:12:30
4 1 2 1422 0.99 15/06/2005 18:02:53 15/02/2006 22:12:30
5 1 2 1476 9.99 15/06/2005 21:08:46 15/02/2006 22:12:30
Using the above explanation in this sample, for each ‘staff_id’, how can I compare the current row with the next (using ‘payment_date’ for current and next), so it brings only the pair of records where the amount of the current record is the same as the next (something like current.amount = next.amount). This means that each record should be compared to the next of the same ‘staff_id’, and so on.
I’m currently using this query, which do the job, but it takes for ever. I know it works good because I setted LIMIT 3 and it brought the correct ones (you can test it as well if you have the Sakila sample database):
SELECT * FROM payment a
JOIN payment b ON a.staff_id = b.staff_id AND a.payment_date > b.payment_date AND a.amount = b.amount
LEFT JOIN payment c ON a.staff_id = c.staff_id AND c.payment_date < a.payment_date AND c.payment_date > b.payment_date
WHERE c.payment_id IS NULL
LIMIT 3;
Could you please help me?

Compare rows and get percentage

I found it hard to find a fitting title. For simplicity let's say I have the following table:
cook_id cook_rating
1 2
1 1
1 3
1 4
1 2
1 2
1 1
1 3
1 5
1 4
2 5
2 2
Now I would like to get an output of 'good' cooks. A good cook is someone who has a rating of at least 70% of 1, 2 or 3, but not 4 or 5.
So in my example table, the cook with id 1 has a total of 10 ratings, 7 of which have type 1, 2 and 3. Only three have type 4 or 5. Therefore the cook with id 1 would be a 'good' cook, and the output should be the cook's id with the number of good ratings.
cook_id cook_rating
1 7
The cook with id 2, however, doesn't satisfy my condition, therefore should not be listed at all.
select cook_id, count(cook_rating) - sum(case when cook_rating = 4 OR cook_rating = 5 then 1 else 0 end) as numberOfGoodRatings from cook
where cook_rating in (1,2,3,4,5)
group by cook_id
order by numberOfGoodRatings desc
However, this doesn't take into account the fact that there might be more 4 or 5 than good ratings, resulting in negative outputs. Plus, the requirement of at least 70% is not included.
You can get this with a comparison in your HAVING clause. If you must have just the two columns in the result set, this can be wrapped as a sub-select select cook_id, positive_ratings FROM (...)
SELECT
cook_id,
count(cook_rating < 4 OR cook_rating IS NULL) as positive_ratings,
count(*) as total_ratings
FROM cook
GROUP BY cook_id
HAVING (positive_ratings / total_ratings) >= 0.70
ORDER BY positive_ratings DESC
Edit Note that count(cook_rating < 4) is intended to only count rows where the rating is less than 4. The MySQL documentation says that count will only count non-null rows. I haven't tested this to see if it equates FALSE with NULL but I would be surprised it it doesn't. Worst case scenario we would need to wrap that in an IF(cook_rating < 4, 1,NULL).
I suggest you change a little your schema to make this kind of queries trivial.
Suppose you add 5 columns to your cook table, to simply count the number of each ratings :
nb_ratings_1 nb_ratings_2 nb_ratings_3 nb_ratings_4 nb_ratings_5
Updating such a table when a new rating is entered in DB is trivial, just as would be recomputing those numbers if having redundancy makes you nervous. And it makes all filterings and sortings fast and easy.

MySQL: Matching inexact values using "ON"

I'm way out of my league here...
I have a mapping table (table1) to assign particular values (value) to a whole number (map_nu). My second table (table2), is a collection of averages (avg) for each user (user_id).
(I couldn't figure out how to properly make a markdown table, please feel free to edit!)
table1: table2:
(value)(Map_nu) (user_id)(avg)
---- -----
1 1 1 1.111
1.045 2 2 1.2
1.09 3 3 1.33333
1.135 4 4 1
1.18 5 5 1.389
1.225 6 6 1.42
1.27 7 7 1.07
1.315 8
1.36 9
1.405 10
The value Map_nu is a special number that each user gets assigned according to their average. I need to find a way to match the averages from table2 to the closest value in table1. I only need to match to the 2 digit past the decimal, so I've added the Truncated function
SELECT table2.user_id, map_nu
FROM `table1`
JOIN table2 ON TRUNCATE(table1.value,2)=TRUNCATE(table2.avg,2)
I still miss the values that don't match the averages exactly. Is there a way to pick the nearest truncated value or even to round to the second decimal? Rounding up/down wont matter as long as its applied to all values the same.
I am trying to have the following result (if rounded up):
(user_id)(Map_nu)
----
1 4
2 6
3 6
4 1
5 10
6 11
7 3
Thanks!
i think you might have to do this in 2 separate queries. there is no 'nearest' operator in sql, so you can either calculate it in your software, or you could use
select map_nu from table1 ORDER BY abs(value - $avg) LIMIT 1
inside a loop. however, that cannot be used as a join function as it requires the ORDER and LIMIT which are not valid as joins.
another way of looking at it is it seems that your map_nu and value are deterministic in relation to each other - value = 1 + ((map_nu - 1) * 0.045) - so maybe you could make use of that fact and calculate an integer based on that equation? assuming that relationship holds true for all values of map_nu.
This is an awkward database design. What is the data representing and what are you trying to solve? There might be a better way.
Maybe do something like...
SELECT a.user_id, b.map_nu, abs(a.avg - b.value)
FROM
table2 a
join table1 b
left join table1 c on abs(a.avg - b.value) > abs(a.avg - c.value)
where c.value is null
order by a.user_id
Doesn't actually produce the same output as the one you were expecting for (doesn't do any rounding). Though you should be able to tweak it from there. Above query will produce the output below (w/ data you've provided):
user_id map_nu abs(a.avg - b.value)
------- ------ --------------------
1 3 0.0209999999999999
2 5 0.02
3 8 0.01833
4 1 0
5 10 0.016
6 10 0.0149999999999999
7 3 0.02
Beware though if you're dealing with large tables. Evaluate the explain of the above query if it'll be practical to run it within MySQL or if better to be done outside it.
Note 2: Will produce duplicate rows if there are avg values that are equi-distant to value values within table1 (Ex. if value for map_nu's 11 and 12 are 2 and 3 and someone get's an avg of 2.5). Your question doesn't really specify what to do for that so you might want to take that into account.
Its taking a little extra work, but I figure the easiest way to get my results will be to map all values to the second decimal place in table1:
1 1
1.01 1
1.02 1
1.03 1
1.04 1
1.05 2
1.06 2
1.07 2
1.08 2
1.09 3
1.1 3
1.11 3
1.12 3
1.13 3
1.14 4
...
Thanks for the suggestions! Sorry I couldn't present the question more clear.

Getting count of rows with multiple combinations

I need help with a MySQL query. We have a database (~10K rows) which I have simplified down to this problem.
We have 7 truck drivers who visit 3 out of a possible 9 locations, daily. Each day they visit exactly 3 different locations and each day they can visit different locations than the previous day. Here are representative tables:
Table: Drivers
id name
10 Abe
11 Bob
12 Cal
13 Deb
14 Eve
15 Fab
16 Guy
Table: Locations
id day address driver.id
1 1 Oak 10
2 1 Elm 10
3 1 4th 10
4 1 Oak 16
5 1 4th 16
6 1 Toy 16
7 1 Toy 11
8 1 5th 11
9 1 Law 11
10 2 Oak 11
11 2 4th 11
12 2 Toy 11
.........
We have data for a full year and we need to find out how many times each "route" is visited over a year, sorted from most to least.
From my high school math, I believe there are 9!/(6!3!) route combinations, or 84 in total. I want do something like:
Get count of routes where route addresses = 'Oak' and 'Elm' and '4th'
then run again
where route addresses = 'Oak' and 'Elm' and '5th'
then again and again, etc.Then sort the route counts, descending. But I don't want to do it 84 times. Is there a way to do this?
I'd be looking at GROUP_CONCAT
SELECT t.day
, t.driver
, GROUP_CONCAT(t.address ORDER BY t.address)
FROM mytable t
GROUP
BY t.day
, t.driver
What's not clear here, if there's an order to the stops on the route. Does the sequence make a difference, and how to we tell what the sequence is? To ask that a different way, consider these two routes:
('Oak','Elm','4th') and ('Elm','4th','Oak')
Are these equivalent (because it's the same set of stops) or are they different (because they are in a different sequence)?
If sequence of stops on the route distinguishes it from other routes with the same stops (in a different order), then replace the ORDER BY t.address with ORDER BY t.id or whatever expression gives the sequence of the stops.
Some caveats with GROUP_CONCAT: the maximum length is limited by the setting of group_concat_max_len and max_allowed_packet variables. Also, the comma used as the separator... if we combine strings that contain commas, then in our result, we can't reliably distinguish between 'a,b'+'c' and 'a'+'b,c'
We can use that query as an inline view, and get a count of the the number of rows with identical routes:
SELECT c.route
, COUNT(*) AS cnt
FROM ( SELECT t.day
, t.driver
, GROUP_CONCAT(t.address ORDER BY t.address) AS route
FROM mytable t
GROUP
BY t.day
, t.driver
) c
GROUP
BY c.route
ORDER
BY cnt DESC