I have a small program that I use to track my progress in reading books and stuff like goodreads to know how much I read per day.
I created two tables for that, tbl_materials(material_id int, name varchar), tbl_progress(date_of_update timestamp, material_id int foreign key, read_pages int, skipped bit).
Whenever I read some pages I insert into tbl_progress the current page that I've finished
I may read in the book multiple times. And if I skipped some pages I insert them into tbl_progress and mark the bit skipped to true. The problem is I can't query the tbl_progress to know how much I read per day
What I have tried is to find the last inserted progress for every single material in every single day
so for example:
+-------------+------------+---------+---------------------+
| material_id | read_pages | skipped | last_update |
+-------------+------------+---------+---------------------+
| 4 | 1 | | 2017-09-22 00:56:02 |
| 3 | 1 | | 2017-09-22 00:56:14 |
| 12 | 1 | | 2017-09-24 20:13:01 |
| 4 | 30 | | 2017-09-25 01:56:38 |
| 4 | 34 | | 2017-09-25 02:19:47 |
| 54 | 1 | | 2017-09-29 04:22:11 |
| 59 | 9 | | 2017-10-14 15:25:14 |
| 4 | 68 | T | 2017-10-18 02:33:04 |
| 4 | 72 | | 2017-10-18 03:50:51 |
| 2 | 3 | | 2017-10-18 15:02:46 |
| 2 | 5 | | 2017-10-18 15:10:46 |
| 4 | 82 | | 2017-10-18 16:18:03 |
| 4 | 84 | | 2017-10-20 18:06:40 |
| 4 | 87 | | 2017-10-20 19:11:07 |
| 4 | 103 | T | 2017-10-21 19:50:29 |
| 4 | 104 | | 2017-10-22 19:56:14 |
| 4 | 108 | | 2017-10-22 20:08:08 |
| 2 | 6 | | 2017-10-23 00:35:45 |
| 4 | 111 | | 2017-10-23 02:29:32 |
| 4 | 115 | | 2017-10-23 03:06:15 |
+-------------+------------+---------+---------------------+
I calculate my total read pages per day = last read page in this day - last read page in a date prior to this date and this works but the problem is I can't avoid skipped pages.
the first row in 2017-09-22 I read 1 page then another 1 page so the total read in this day = 2 (for only material_id = 4)
in 2017-09-25 the last update for material_id 4 is 34 pages which means I read 34-1 = 33 pages (last update in this day 34 - last update prior to this date 1) = 33
till now every thing works well but when it comes to considering skipped pages I could't do it for example:
in 2017-10-18 the last number of read pages for material_id = 4 was 34 (in 2017-09-25) then I skipped 34 pages and now the current page is 68 then read 4 pages (2017-10-18 03:50:51 ) then another 10 pages (2017-10-18 16:18:03) so the total for material_id = 4 is 14
I created a view to select the most recent last_update for every book in every day
create view v_mostRecentPerDay as
select material_id id,
(select title from materials where materials.material_id = id) title,
completed_pieces,
last_update,
date(last_update) dl,
skipped
from progresses
where last_update = (
select max(last_update)
from progresses s2
where s2.material_id = progresses.material_id
and date(s2.last_update) = date(progresses.last_update)
and s2.skipped = false
);
so if there are many updates for single book in one day, this view retrieves the last one (with the max of last_update) which accompany the biggest number of read pages and so for every single book
and another view to get the total read pages every day:
create view v_totalReadInDay as
select dl, sum(diff) totalReadsInThisDay
from (
select dl,
completed_pieces - ifnull((select completed_pieces
from progresses
where material_id = id
and date(progresses.last_update) < dl
ORDER BY last_update desc
limit 1
), 0) diff
from v_mostRecentPerDay
where skipped = false
) omda
group by dl;
but the problem is that the last view calculates skipped pages.
expected result:
+------------+------------------+
| day | total_read_pages |
+------------+------------------+
| 2017-09-22 | 2 |
+------------+------------------+
| 2017-09-24 | 1 |
+------------+------------------+
| 2017-09-25 | 33 |
+------------+------------------+
| 2017-09-29 | 1 |
+------------+------------------+
| 2017-10-14 | 9 |
+------------+------------------+
| 2017-10-18 | 19 |
+------------+------------------+
| 2017-10-20 | 5 |
+------------+------------------+
| 2017-10-21 | 0 |
+------------+------------------+
| 2017-10-22 | 21 |
+------------+------------------+
| 2017-10-23 | 8 |
+------------+------------------+
mysql> SELECT VERSION();
+-----------------------------+
| VERSION() |
+-----------------------------+
| 5.7.26-0ubuntu0.16.04.1-log |
+-----------------------------+
This seems like a super-convoluted way to evaluate pages read per day. Have you considered denormalising your data slightly and storing both the current page and the number of pages read?
The current page may make more sense stored in the material table, or in a separate bookmark table e.g.
bookmark - id, material_id, page_number
reading - id, bookmark_id, pages_complete, was_skipped, ended_at
When a reading (or skipping!) session is complete, the pages_complete can easily be calculated from the current page minus the old current page in the bookmark, and this can be done in your application logic
Your pages per day query simply becomes
SELECT SUM(pages_complete) pages_read
FROM reading
WHERE ended_at >= :day
AND ended_at < :day + INTERVAL 1 DAY
AND was_skipped IS NOT TRUE
You can make a view the uses the same columns of table progresses + another derived column which uses the same idea as #Arth suggested (pages_completed column)
This column will contain the current completed_pages - completed_pages with last update prior to the first completed pages which is the difference.
So for example if your progress table like this:
+-------------+------------+---------+---------------------+
| material_id | read_pages | skipped | last_update |
+-------------+------------+---------+---------------------+
| 4 | 68 | T | 2017-10-18 02:33:04 |
| 4 | 72 | | 2017-10-18 03:50:51 |
| 2 | 3 | | 2017-10-18 15:02:46 |
| 2 | 5 | | 2017-10-18 15:10:46 |
| 4 | 82 | | 2017-10-18 16:18:03 |
+-------------+------------+---------+---------------------+
we will add another derived column called diff.
where diff read_pages in 2017-10-18 02:33:04 - read_pages directly prior to 2017-10-18 02:33:04
+-------------+------------+---------+---------------------+------------------+
| material_id | read_pages | skipped | last_update | Derived_col_diff |
+-------------+------------+---------+---------------------+------------------+
| | 68 | T | 2017-10-18T02:33:04 | 68 - null = 0 |
| 4 | | | | |
+-------------+------------+---------+---------------------+------------------+
| 4 | 72 | | 2017-10-18T03:50:51 | 72 - 68 = 4 |
+-------------+------------+---------+---------------------+------------------+
| 2 | 3 | | 2017-10-18T15:02:46 | 3 - null = 0 |
+-------------+------------+---------+---------------------+------------------+
| 2 | 5 | | 2017-10-18T15:10:46 | 5 - 3 = 2 |
+-------------+------------+---------+---------------------+------------------+
| 4 | 82 | | 2017-10-18T16:18:03 | 82 - 72 = 10 |
+-------------+------------+---------+---------------------+------------------+
note: that 68 - null is null but I put it 0 for clarification
The derived column here is the difference between this read_pages - read_pages directly before this read_pages.
Here is a view
create view v_progesses_with_read_pages as
select s0.*,
completed_pieces - ifnull((select completed_pieces
from progresses s1
where s1.material_id = s0.material_id
and s1.last_update = (
select max(last_update)
FROM progresses s2
where s2.material_id = s1.material_id and s2.last_update < s0.last_update
)), 0) read_pages
from progresses s0;
Then you can select the sum of this derived column per day:
select date (last_update) dl, sum(read_pages) totalReadsInThisDay from v_progesses_with_read_pages where skipped = false group by dl;
Which will result in something like this:
+-------------+-----------------------------+
| material_id | totalReadsInThisDay |
+-------------+-----------------------------+
| 2017-10-18 | 16 |
+-------------+-----------------------------+
| 2017-10-19 | 20 (just for clarification) |
+-------------+-----------------------------+
Note that the last row is from my mind lol
Related
Let's suppose I have a MySQL table 'orders' with the following data:
| id | order_no | item_id | amount | datetime |
| 1 | 123 | 901 | 1 | 2020-08-05 00:00:01 |
| 2 | 324 | 902 | 2 | 2020-08-06 00:00:01 |
| 3 | 324 | 905 | 1 | 2020-08-06 00:00:01 |
| 4 | 511 | 902 | 1 | 2020-08-07 00:00:01 |
| 5 | 400 | 904 | 3 | 2020-08-08 00:00:01 |
| 6 | 195 | 903 | 1 | 2020-08-09 00:00:01 |
| 7 | 195 | 905 | 2 | 2020-08-09 00:00:01 |
| 8 | 250 | 908 | 1 | 2020-08-10 00:00:01 |
| 9 | 222 | 901 | 3 | 2020-08-11 00:00:01 |
| 10 | 315 | 903 | 1 | 2020-08-12 00:00:01 |
| 11 | 315 | 905 | 2 | 2020-08-12 00:00:01 |
| 12 | 198 | 903 | 1 | 2020-08-13 00:00:01 |
| 13 | 651 | 902 | 2 | 2020-08-14 00:00:01 |
| 14 | 651 | 907 | 2 | 2020-08-14 00:00:01 |
| 15 | 405 | 902 | 1 | 2020-08-15 00:00:01 |
| 16 | 112 | 905 | 1 | 2020-08-16 00:00:01 |
and in my website I want to display the orders according to user's settings like: orders per page/ page number. The data need to be ordered by 'datetime' in ascending order, so if the page number is 2 with orders per page = 5, I would need data of id-s 8-14 (as rows of id 1-7 make the first 5 orders and 8-14: the second one). Note that some orders (in bold) have 2 rows (and can have more) with the same order_no but different item_id.
The simple LIMIT and OFFSET clauses are of no use here unless I combine them with some subqueries but so far I have not found the solution.
I have come to a solution that I think will work best for me: one table 'orders' with all the order 'header' data + first item of the order (each piece of item data in a separate column, like item_id, amount etc.) then, another column: 'items' of type JSON to store 2nd and further items if there are 2 or more; this way I'll be able to use LIMIT and OFFSET and will need only one query especially for inserting a new order, which had worried me the most because with 2 tables I would have to use a transaction.
A select query, in most cases will be simple, only with 2 or more items per order I will need to handle the items from the JSON column, it will not harm performance at all with, as I mentioned in a comment, most of orders containing only one item.
Thank you #Shadow for your comments, they have really helped me to find the solution as I think I'd been going in a wrong direction.
I have the following table, let's call it Segments:
-------------------------------------
| SegmentStart | SegmentEnd | Value |
-------------------------------------
| 1 | 4 | 20 |
| 4 | 8 | 60 |
| 8 | 10 | 20 |
| 10 | 1000000 | 0 |
-------------------------------------
I am trying to join this table with itself, to obtain the following result set:
-------------------------------------
| SegmentStart | SegmentEnd | Value |
-------------------------------------
| 1 | 4 | 20 |
| 1 | 8 | 60 |
| 1 | 10 | 60 |
| 1 | 1000000 | 60 |
| 4 | 8 | 60 |
| 4 | 10 | 60 |
| 4 | 1000000 | 60 |
| 8 | 10 | 20 |
| 8 | 1000000 | 20 |
| 10 | 1000000 | 0 |
-------------------------------------
Basically, I would need to join every row, with every other row that comes after it, then get the MAX() of the value between each of the rows joined previously. Example: if I am joining row 1 with row 3, I would need the MAX(Value) from all of these 3 rows.
What I already done is the following query:
SELECT s1.SegmentStart, s2.SegmentEnd, GREATEST(s1.Value, s2.Value) as Value FROM Segments s1 CROSS JOIN Segments s2 ON s1.SegmentStart < s2.SegmentEnd
This query creates a similar table to the one desired, but the value fields get mixed up in the following way (I've marked between !! the row that differs):
-------------------------------------
| SegmentStart | SegmentEnd | Value |
-------------------------------------
| 1 | 4 | 20 |
| 1 | 8 | 60 |
| 1 | 10 | !20! |
| 1 | 1000000 | !20! |
| 4 | 8 | 60 |
| 4 | 10 | 60 |
| 4 | 1000000 | 60 |
| 8 | 10 | 20 |
| 8 | 1000000 | 20 |
| 10 | 1000000 | 0 |
-------------------------------------
The problem is with the GREATEST() function, because it only compares the two rows that are being joined (start-end 1-4, 8-10), and not the whole interval (in this case, it would be 3 rows, the ones with start-end 1-4, 4-8, 8-10)
How should I modify this query, or what query should I use, to get my desired result?
Additional info, that may help: the rows in the original table, are always ordered based on SegmentStart, and there can be no duplicate or missing values. Every interval between x and y will appear only once in the table, with no overlaps, and no gaps at all.
I am using Maria DB 10.3.13.
Something like this?
SELECT
s1.SegmentStart
, s2.SegmentEnd
, MAX(s.Value) as Value
FROM
Segments s1
INNER JOIN Segments s2 ON (
s2.SegmentEnd > s1.SegmentStart
)
INNER JOIN Segments s ON (
s.SegmentStart >= s1.SegmentStart
AND s.SegmentEnd <= s2.SegmentEnd
)
GROUP BY
s1.SegmentStart
, s2.SegmentEnd
I have a database that tracks the size of claims.
Each claim has fixed information that is stored in claim (such as claim_id and date_reported_to_insurer).
Each month, I get a report which is added to the table claim_month. This includes fields such as claim_id, month_id [101 is 31/01/2018, 102 is 28/02/2018, etc] and paid_to_date.
Since most claims don't change from month to month, I only add a record for claim_month when the figure has changed since last month. As such, a claim may have a June report and an August report, but not a July report. This would be because the amount paid to date increased in June and August, but not July.
The problem that I have now is that I want to be able to check the amount paid each month.
Consider the following example data:
+----------------+----------+----------------+--------------+
| claim_month_id | claim_id | month_id | paid_to_date |
+----------------+----------+----------------+--------------+
| 1 | 1 | 6 | 1000 |
+----------------+----------+----------------+--------------+
| 5 | 1 | 7 | 1200 |
+----------------+----------+----------------+--------------+
| 7 | 2 | 6 | 500 |
+----------------+----------+----------------+--------------+
| 12 | 1 | 9 | 1400 |
+----------------+----------+----------------+--------------+
| 18 | 2 | 8 | 600 |
+----------------+----------+----------------+--------------+
If we assume that this is all of the information regarding claim 1 and 2, then that would suggest that they are both claims that occurred during June 2018. Their transactions should look like the following:
+----------------+----------+----------------+------------+
| claim_month_id | claim_id | month_id | paid_month |
+----------------+----------+----------------+------------+
| 1 | 1 | 6 | 1000 |
+----------------+----------+----------------+------------+
| 5 | 1 | 7 | 200 |
+----------------+----------+----------------+------------+
| 7 | 2 | 6 | 500 |
+----------------+----------+----------------+------------+
| 12 | 1 | 9 | 200 |
+----------------+----------+----------------+------------+
| 18 | 2 | 8 | 100 |
+----------------+----------+----------------+------------+
The algorithm I'm using for this is
SELECT claim_month_id,
month_id,
claim_id,
new.paid_to_date - old.paid_to_date AS paid_to_date_change,
FROM claim_month AS new
LEFT JOIN claim_month AS old
ON new.claim_id = old.claim_id
AND ( new.month_id > old.month_id
OR old.month_id IS NULL )
GROUP BY new.claim_month_id
HAVING old.month_id = Max(old.month_id)
However this has two issues:
It seems really inefficient at dealing with claims with multiple
records. I haven't run any benchmarking, but it's pretty obvious.
It doesn't show new claims. In the above example, it would only show lines 2, 3 and 5.
Where am I going wrong with my algorithm, and is there a better logic to use to do this?
Use LAG function to get the next paid_to_date of each claim_id, and use the current paid_to_date minus the next paid_to_date.
SELECT
claim_month_id,
claim_id,
month_id,
paid_to_date - LAG(paid_to_date, 1, 0) OVER (PARTITION BY claim_id ORDER BY month_id) AS paid_month
FROM claim
The output table is:
+----------------+----------+----------+------------+
| claim_month_id | claim_id | month_id | paid_month |
+----------------+----------+----------+------------+
| 1 | 1 | 6 | 1000 |
| 5 | 1 | 7 | 200 |
| 12 | 1 | 9 | 200 |
| 7 | 2 | 6 | 500 |
| 18 | 2 | 8 | 100 |
+----------------+----------+----------+------------+
I have a ratings table, where each user can add one rating a day. But each user might miss several days between ratings.
I'd like to get the average rating for each user_id's first 7 entries of created_at.
My table:
mysql> desc entries;
+------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| rating | tinyint(4) | NO | | NULL | |
| user_id | int(10) unsigned | NO | MUL | NULL | |
| created_at | timestamp | YES | | NULL | |
+------------+------------------+------+-----+---------+----------------+
Ideally I'd just get something like:
+------------+------------------+
| day | average_rating |
+------------+------------------+
| 1 | 2.53 |
+------------+------------------+
| 2 | 4.30 |
+------------+------------------+
| 3 | 3.67 |
+------------+------------------+
| 4 | 5.50 |
+------------+------------------+
| 5 | 7.23 |
+------------+------------------+
| 6 | 6.98 |
+------------+------------------+
| 7 | 7.22 |
+------------+------------------+
The closest I've been able to get is:
SELECT rating, user_id, created_at FROM entries ORDER BY user_id asc, created at desc
Which isn't very close at all...
Is it even possible? Will the performance be terrible? It's something that would need to run every time a web page is loaded, so would it be better to just run this once a day and save the results? (to another table!?)
edit - second attempt
Working towards a solution, I think this would get the rating for each user's first day:
select rating from entries where user_id in
(select user_id from entries order by created_at limit 1);
But I get:
ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
So now I'm going to play around with JOIN to see if that helps.
edit - third attempt, getting closer
I found this stackoverflow post, which is closer to what I want.
select e1.* from entries e1 left join entries e2
on (e1.user_id = e2.user_id and e1.created_at > e2.created_at)
where e2.id is null;
It gets the rating for the first day for each user.
Next step is to work out how to get days 2 to 7. I can't use 1.created_at > e2.created_at for that, so I'm really confused now.
edit - fourth attempt
Okay, I think it's not possible. Once I worked out how to turn off 'full group by' mode, I realised I'll probably need to use a subquery with limit <user_id>, <day_num>, for which I get:
ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
My current method is to just get the entire table, and use PHP to calculate the average for each day.
If I understand correctly you want to take the last 7 ratings the user gave, ordered by the date they gave the rating. The last 7 ratings of one user may fall on different days to another user, however they will be averaged together regardless of date.
First we need to order the data by user and date and give each user their own incrementing row count. I do this by adding two variables, one for the last user id and one for the row number:
select e.created_at,
e.rating,
if(#lastUser=user_id,#row := #row+1, #row:=1) as row,
#lastUser:= e.user_id as user_id
from entries e,
( select #row := 0, #lastUser := 0 ) vars
order by e.user_id asc,
e.created_at desc;
If the previous user_id is different we reset the row counter to 1. The result from this is:
+---------------------+--------+------+---------+
| created_at | rating | row | user_id |
+---------------------+--------+------+---------+
| 2017-01-10 00:00:00 | 1 | 1 | 1 |
| 2017-01-09 00:00:00 | 1 | 2 | 1 |
| 2017-01-08 00:00:00 | 1 | 3 | 1 |
| 2017-01-07 00:00:00 | 1 | 4 | 1 |
| 2017-01-06 00:00:00 | 1 | 5 | 1 |
| 2017-01-05 00:00:00 | 1 | 6 | 1 |
| 2017-01-04 00:00:00 | 1 | 7 | 1 |
| 2017-01-03 00:00:00 | 1 | 8 | 1 |
| 2017-01-02 00:00:00 | 1 | 9 | 1 |
| 2017-01-01 00:00:00 | 1 | 10 | 1 |
| 2017-01-13 00:00:00 | 1 | 1 | 2 |
| 2017-01-11 00:00:00 | 1 | 2 | 2 |
| 2017-01-09 00:00:00 | 1 | 3 | 2 |
| 2017-01-07 00:00:00 | 1 | 4 | 2 |
| 2017-01-05 00:00:00 | 1 | 5 | 2 |
| 2017-01-03 00:00:00 | 1 | 6 | 2 |
| 2017-01-01 00:00:00 | 1 | 7 | 2 |
| 2017-01-13 00:00:00 | 1 | 1 | 3 |
| 2017-01-01 00:00:00 | 1 | 2 | 3 |
| 2017-01-03 00:00:00 | 1 | 1 | 4 |
| 2017-01-01 00:00:00 | 1 | 2 | 4 |
| 2017-01-02 00:00:00 | 1 | 1 | 5 |
+---------------------+--------+------+---------+
We now simply wrap this in another statement to select the avg where the row number is less than or equal to seven.
select e1.row day, avg(e1.rating) avg
from (
select e.created_at,
e.rating,
if(#lastUser=user_id,#row := #row+1, #row:=1) as row,
#lastUser:= e.user_id as user_id
from entries e,
( select #row := 0, #lastUser := 0 ) vars
order by e.user_id asc,
e.created_at desc) e1
where e1.row <=7
group by e1.row;
This outputs:
+------+--------+
| day | avg |
+------+--------+
| 1 | 1.0000 |
| 2 | 1.0000 |
| 3 | 1.0000 |
| 4 | 1.0000 |
| 5 | 1.0000 |
| 6 | 1.0000 |
| 7 | 1.0000 |
+------+--------+
I have 2 tables
Transaction table
+----+----------+-----+---------+----
| TID | CampaignID | DATE |
+----+----------+-----+---------+---+
| 1 | 5 | 2016-01-01 |
| 2 | 5 | 2016-01-01 |
| 3 | 2 | 2016-01-01 |
| 4 | 5 | 2016-01-01 |
| 5 | 1 | 2016-01-01 |
| 6 | 1 | 2016-02-02 |
| 7 | 3 | 2016-02-02 |
| 8 | 3 | 2016-02-02 |
| 9 | 5 | 2016-02-02 |
| 10| 4 | 2016-02-02 |
+----+----------+-----+---------+---+
Campaign Table
+-------------+----------------+--------------------
| CampaignID | DailyMaxImpressions | CampaignActive
+-------------+----------------+--------------------
| 1 | 5 | Y |
| 2 | 5 | Y |
| 3 | 5 | Y |
| 4 | 5 | Y |
| 5 | 1 | Y |
+-------------+----------------+--------------------
What I am trying to do is get a single random campaign where the the count in transaction table is less than the daily max impressions in the campaign table. I might also be passing a date s part of the query for the transaction table
So for CampaignId 1 there must be 4 trans of less in the transaction table and the Campaignactive must be a "Y"
Any help would be appreciated if this can be done in a single statement. ( mysql )
Thanks in advance,
Jeff Godstein
This should get it for you. The basic query is select each campaign that is active. The INNER query will pre-aggregate per campaign for the given date in question. From that, a LEFT-JOIN allows any campaign to be returned even if it does NOT exist within the subquery OR it DOES exist, but the count is less than that allowed for the date in question. The order by RAND() is obvious.
SELECT
c.CampaignID
from
Campaign c
LEFT JOIN
( select
t1.CampaignID,
count(*) as CampCount
from
Transaction t1
where
t1.Date = YourDateParameterValue
group by
t1.CampaignID ) as T
ON c.CampaignID = T.CampaignID
where
c.CampaignActive = 'Y'
AND ( t.CampaignID IS NULL
OR t.CampCount < c.DailyMaxImpressions )
order by
RAND()