Retention Tracking - mysql

Let’s say I have an Angry Birds game.
I want to know how many players are buying the ‘mighty eagle’ weapon each month out of the players which bought the mighty eagle weapon in the previous months in their LTV in the system
I have the dates of all items bought per each client.
What I practically would like to have is a two dimensional
matrix that will tell me what the percentage of the players which moved from
LTV_month_X to LTV_month_Y for each combination of X<Y for a specific current
month?
An example:
example_png
(it didn't let me to put the pic inline so please press the link to see the pic)
Now, I have found a way to get the number of players moved
actually from from LTV_month_X to LTV_month_Y that LTV_month_Y is their current
month of activity within the system using SQL query and Excel Pivot table.
What I try find out is mainly is how to get the base number of those who potentially could do that transition.
A few definitions:
LTV_month_X = DATEDIFF(MONTH, first_eagle_month, specific_eagle_month)+1
Preferably I would like to have the solutions in ANSI-SQL, if not then MySQL or
MSSQL but no Oracle functions should be used at all.
Since I’m looking for the percentage of the transition two-steps plans could also work, first find the potential ones and the find the actual ones who moved to measure the retention from  LTV_month_X to LTV_month_Y.
One last issue: I need for it to be possible to drill down and find the actual IDs of the clients who moved from any stage X to any other stage Y (>X).

The use of the term LTV here is not clear. I assume you mean the lifetime of the user.
If I understand the question, you are asking, based on a list of entities each with one or more events, how do I group (e.g. count) the entities by the month of the last event and the month of the one before last event.
in mysql, you can use a variable to do that. I'm not going to expalin the whole concept, but basically, when within a SELECT statement you write #var:=column, then that variable is assigned the value of that column, and you can use that to compare values between consectuive columns e.g.
LEAST(IF(#var=column,#same:=#same+1,#same:=0),#var:=column)
the use of LEAST is a trick to ensure execution order.

The two dimension you are looking for are
Actual purchase month
Relative purchase month
SELECT
player_id,
TRUNCATE(first_purchase,'MM') AS first_month ,
TRUNCATE(current_purchase_date ,'MM') AS purchase_month,
months_between(current_purchase _date, first_purchase_date)+1 AS relative_month,
SUM(purchase_amount) AS total_purchase,
COUNT(DISTINCT player_id) AS player_count
FROM ...
Now you can pivot purchase month to relative month and aggregate

Related

How to store recent usage frequency in MySQL

I'm working on the Product Catalog module of an Invoicing application.
When the user creates a new invoice the product name field should be an autocomplete field which shows the most recently used products from the product catalog.
How can I store this "usage recency/frequency" in the database?
I'm thinking about adding a new field recency which would be increased by 1 every time the product was used, and decreased by 1/(count of all products), when an other product is used. Then use this recency field for ordering, but it doesn't seem to me the best solution.
Can you help me what is the best practice for this kind of problem?
Solution for the recency calculation:
Create a new column in the products table, named last_used_on for example. Its data type should be TIMESTAMP (the MySQL representation for the Unix-time).
Advantages:
Timestamps contains both date and time parts.
It makes possible VERY precise calculations and comparisons in regard
to dates and times.
It lets you format the saved values in the date-time format of your
choice.
You can convert from any date-time format into a timestamp.
In regard to your autocomplete fields, it allows you to filter
the products list as you wish. For example, to display all products
used since [date-time]. Or to fetch all products used between
[date-time-1] and [date-time-2]. Or get the products used only on Mondays, at 1:37:12 PM, in the last two years, two months and three
days (so flexible timestamps are).
Resources:
Unix-Time
The DATE, DATETIME, and TIMESTAMP Types
How should unix timestamps be stored in int columns?
How to convert human date to unix timestamp in Mysql?
Solution for the usage rate calculation:
Well, actually, you are not speaking about a frequency calculation, but about a rate - even though one can argue that frequency is a rate, too.
Frequency implies using the time as the reference unit and it's measured in Hertz (Hz = [1/second]). For example, let's say you want to query how many times a product was used in the last year.
A rate, on the other hand, is a comparison, a relation between two related units. Like for example the exchange rate USD/EUR - they are both currencies. If the comparison takes place between two terms of the same type, then the result is a number without measurement units: a percentage. Like: 50 apples / 273 apples = 0.1832 = 18.32%
That said, I suppose you tried to calculate the usage rate: the number of usages of a product in relation with the number of usages of all products. Like, for a product: usage rate of the product = 17 usages of the product / 112 total usages = 0.1517... = 15.17%. And in the autocomplete you'd want to display the products with a usage rate bigger than a given percentage (like 9% for example).
This is easy to implement. In the products table add a column usages of type int or bigint and simply increment its value each time a product is used. And then, when you want to fetch the most used products, just apply a filter like in this sql statement:
SELECT
id,
name,
(usages*100) / (SELECT sum(usages) as total_usages FROM products) as usage_rate
FROM products
GROUP BY id
HAVING usage_rate > 9
ORDER BY usage_rate DESC;
Here's a little study case:
In the end, recency, frequency and rate are three different things.
Good luck.
To allow for future flexibility, I'd suggest the following additional (*) table to store the entire history of product usage by all users:
Name: product_usage
Columns:
id - internal surrogate auto-incrementing primary key
product_id (int) - foreign key to product identifier
user_id (int) - foreign key to user identifier
timestamp (datetime) - date/time the product was used
This would allow the query to be fine tuned as necessary. E.g. you may decide to only order by past usage for the logged in user. Or perhaps total usage within a particular timeframe would be more relevant. Such a table may also have a dual purpose of auditing - e.g. to report on the most popular or unpopular products amongst all users.
(*) assuming something similar doesn't already exist in your database schema
Your problem is related to many other web-scale search applications, such as e.g. showing spell corrections, related searches, or "trending" topics. You recognized correctly that both recency and frequency are important criteria in determining "popular" suggestions. In practice, it is desirable to compromise between the two: Recency alone will suffer from random fluctuations; but you also don't want to use only frequency, since some products might have been purchased a lot in the past, but their popularity is declining (or they might have gone out of stock or replaced by successor models).
A very simple but effective implementation that is typically used in these scenarios is exponential smoothing. First of all, most of the time it suffices to update popularities at fixed intervals (say, once each day). Set a decay parameter α (say, .95) that tells you how much yesterday's orders count compared to today's. Similarly, orders from two days ago will be worth α*α~.9 times as today's, and so on. To estimate this parameter, note that the value decays to one half after log(.5)/log(α) days (about 14 days for α=.95).
The implementation only requires a single additional field per product,
orders_decayed. Then, all you have to do is to update this value each night with the total daily orders:
orders_decayed = α * orders_decayed + (1-α) * orders_today.
You can sort your applicable suggestions according to this value.
To have an individual user experience, you should not rely on a field in the product table, but rather on the history of the user.
The occurrences of the product in past invoices created by the user would be a good starting point. The advantage is that you don't need to add fields or tables for this functionality. You simply rely on data that is already present anyway.
Since it is an auto-complete field, maybe past usage is not really relevant. Display n search results as the user types. If you feel that results are better if you include recency in the calculation of the order, go with it.
Now, implementation may defer depending on how and when product should be displayed. Whether it has to be user specific usage frequency or application specific (overall). But, in both case, I would suggest to have a history table, which later you can use for other analysis.
You could design you history table with atleast below columns:
Id | ProductId | LastUsed (timestamp) | UserId
And, now you can create a view, which will query this table for specific time range (something like product frequency of last week, last month or last year) and will give you highest sold product for specific time range.
Same can be used for User's specific frequency by adding additional condition to filter by Userid.
I'm thinking about adding a new field recency which would be increased
by 1 every time the product was used, and decreased by 1/(count of all
products), when an other product is used. Then use this recency field
for ordering, but it doesn't seem to me the best solution.
Yes, it is not a good practice to add a column for this and update every time. Imagine, this product is most awaiting product and people love to buy it. Now, at a time, 1000 people or may be more requested for this product and for every request you are going to update same row, since to maintain the concurrency database has to lock that specific row and update for each request, which is definitely going to hit your database and application performance instead you can simply insert a new row.
The other possible solution is, you could use your existing invoice table as it will definitely have all product and user specific information and create a view to get frequently used product as I mentioned above.
Please note that, this is an another option to achieve what you are expecting. But, I would personally recommend to have history table instead.
The scenario
When the user creates a new invoice the product name field should be an autocomplete field which shows the most recently used products from the product catalogue.
your suggested solution
How can I store this "usage recency/frequency" in the database?
If it is a web application, don't store it in a Database in your server. Each user has different choices.
Store it in the user's browser as Cookie or Localstorage because it will improve the User Experience.
If you still want to store it in MySQL table,
Do the following
Create a column recency as said in question.
When each time the item used, increase the count by 1 as said in question.
Don't decrease it when other items get used.
To get the recent most used item,
query
SELECT * FROM table WHERE recence = (SELECT MAX(recence) FROM table);
Side note
Go for the database use only if you want to show the recent most used products without depending the user.
As you aren't certain on wich measure to choose, and it's rather user experience related problem, I advice you have a number of measures and provide a user an option to choose one he/she prefers. For example the set of available measures could include most popular product last week, last month, last 3 months, last year, overall total. For the sake of performance I'd prefer to store those statistics in a separate table which is refreshed by a scheduled job running every 3 hours for example.

ACCESS - calculating prices

I have one table in my access DB which contains a list of all devices that has been sold to our customer. We have only one customer and only one type of device. The table contains details like name, serial, warranty detail of device. Now we need to calculate the price for invoice purpose. The thing is that the price should be calculated depending on amount of sold devices and also we don't want to hard code the price instead we would like to use a separate table whit different category of prices and calculating the price based on this table, because of the prices changing frequently and so we should modify the price value only in one table.
We have 3 category of price : if customer bought 100 device then the unit price is 15$ else if 200 devices the unit price would be 10$ else if 300 then price for unit would be 5$. So based on these rules we will need to calculate the price.
So I'm looking for the right approach that would be for this problem.
This is a very open-ended question and is impossible to definitively answer without more information about the architecture of the database.
There are 3 different ways I usually perform a calculation inside my Access databases.
perform a calculation inside a query. For simple things.
perform a calculation inside a module function, that is called by a query. For complicated things.
perform a calculation inside a module vba subroutine, that is called by a button OnClick Event. For extremely complicated things.
You could do a cross join as long as you have a range wide enough to account for all possibilities (i.e. some ridiculous upper boundary like 9999999). Otherwise a subquery would work with the same WHERE clause:
SELECT Prices.Price
FROM Prices
WHERE (Prices.MinQuantity <= Invoice.Quantity) AND
(Prices.MaxQuantity >= Invoice.Quantity)
As Gustav and V-ball point out there are many ways to set this up depending on your needs.

School Site SQL Design

I was just handed a project to modify a school site for class registration. The current system is designed for each class to run one time per week, but I need to change things so that a class can occur on one or more nights per week. I am struggling with finding an efficient method to relate each of the classes, and allow the customers to view and select a single class when registering for the series.
My first thought is to add another field (groupid) that can hold a unique value to tie corresponding classes together. Looking at queries to sort this method is difficult, because if I sort by day-of-week followed by groupid (for display, class selection, etc), then the classes will be separated. Sorting by groupid then day-of-week produces a non-chronological order, which doesn't work either. Is there a way to move the classes together groupid after sorting, but not affect the sorted date?
My second thought was to modify the table to support multiple classes per row. This would be the easier method, but less flexible, and even more problematic if the classes don't run at the same time of the week.
Anyway, I'm a little lost, and would appreciate any feedback on design, and/or a query to help with my sort problem.
Thanks!
A class is a single entity regardless of how many days it meets in a given week. Create a Schedule table. It would include a FK_ClassID and ScheduleDate. If it meets three days in a week, it would have three records. This way, a student could schedule multiple classes, but check to make sure they do not over-lap on the same day of the week.

Design for 'Total' field in a database

I am trying to find an optimal solution for my Database (MySQL), but I'm stuck over the decision whether or not to store a Total column.
This is the simplified version of my database :
I have a Team table, a Game table and a 'Score' table. Game will have {teamId, scoreId,...} while Score table will have {scoreId, Score,...} (Here ... indicates other columns in the tables).
On the home page I need to show the list of Teams with their scores. Over time the number of Teams will grow to 100s while the list of Score(s) will grow to 100000s. Which is the preferred way:
Should I sum up the scores and show along with teams every time the page is requested. (I don't want to cache because the scores will keep changing) OR
Should I have a total_score field in the Team table where I update the total_score of a team every time a new score is added to the Scores table for that group?
Which of the two is a better option or is there any other better way?
I use two guidelines when deciding to store a calculated value. In the best of all worlds, both of these statements will be true.
1) The value must be computationally expensive.
2) The value must have a low probability of changing.
If the cost of calculating the value is very high, but it changes daily, I might consider making a nightly job that updates the value.
Start without the total column and only add it if you start having performance issues.
Calculating sum at request time is better for accuracy but worse for efficiency.
Caching total in a field (dramatically) improves performance of certain queries, but increases code complexity or may show stale data (if you update cached value not at the same time, but via cron job).
It's up to you! :)
I agree that computed values should not be used except for special situations such as month end snapshots of databases.
I would simply create a view with one column in the view equal to your computed total column. Then you can query the view instead of the base tables.
Depending on how often your scores gets updated and what exactly the "score" means
Case1: Score is a LIVE score
If the "score" is the live score like "runs scored in cricket or baseball" or "score of vollyball match or tabletennis" then I really dont understand the need of showing the "sum" of the "running" scores. However, this may be a requirements also in some cases like showing the total runs scored by a team till now + the runs scored so far in the on going (live) game.
In this case I suggest you another option which is combination of your 1st and 2nd option
Total_score in the team table would be good with slight change in your data model. which is
Add a new column in the scores table called LIVE which will be 0 for a finished match 1 for a live match (and optionally -1 indicating match is about to start but the scores wont get update)
Now union two tables something like
select team_id,sum(total_sore) from (
select team_id,total_score from team
union
select team_id,sum(score) total_score from scores where live = 1 group by team_id)subquery
group by team_id
Case2: Score is just a RESULT
Well just query the db directly (your 1st option) as because the result will be updated only after the game ends and the update infact it will be a new entry in the score table.
If my assumption is correct, the scores get updated only after the game is finished. Moreover the update can be even less often when considered the games played by a team.

MySQL Query eliminate duplicates but only adjacent to each other

I have the following query..
SELECT Flights.flightno,
Flights.timestamp,
Flights.route
FROM Flights
WHERE Flights.adshex = '400662'
ORDER BY Flights.timestamp DESC
Which returns the following screenshot.
However I cannot use a simple group by as for example BCS6515 will appear a lot later in the list and I only want to "condense" the rows that are the same next to each other in this list.
An example of the output (note BCS6515 twice in this list as they were not adjacent in the first query)
Which is why a GROUP BY flightno will not work.
I don't think there's a good way to do so in SQL without a column to help you. At best, I'm thinking it would require a subquery that would be ugly and inefficient. You have two options that would probably end up with better performance.
One would be to code the logic yourself to prune the results. (Added:) This can be done with a procedure clause of a select statement, if you want to handle it on the database server side.
Another would be to either use other information in the table or add new information to the table for this purpose. Do you currently have something in your table that is a different value for each instance of a number of BCS6515 rows?
If not, and if I'm making correct assumptions about the data in your table, there will be only one flight with the same number per day, though the flight number is reused to denote a flight with the same start/end and times on other days. (e.g. the 10a.m. from NRT to DTW is the same flight number every day). If the timestamps were always the same day, then you could use DAY(timestamp) in the GROUP BY. However, that doesn't allow for overnight flights. Thus, you'll probably need something such as a departure date to group by to identify all the rows as belonging to the same physical flight.
GROUP BY does not work because 'timestamp' value is different for 2 BCS6515 records.
it will work only if:
SELECT Flights.flightno,
Flights.route
FROM Flights
WHERE Flights.adshex = '400662'
GROUP BY (Flights.flightno)