I have a table rate with the following structure (approximate):
CREATE TABLE `rate` (
`id` int(11) PRIMARY KEY NOT NULL AUTO_INCREMENT,
`from` date NOT NULL,
`to` date NOT NULL
)
And an (approximately) identical table stop_sale:
CREATE TABLE `stop_sale` (
`id` int(11) PRIMARY KEY NOT NULL AUTO_INCREMENT,
`from` date NOT NULL,
`to` date NOT NULL
)
Considering that, for each one, their time interval is the range of days they cover between their respective from and to fields:
I want to query these tables together in such a way that the time intervals do not overlap, but instead adjust so that stop_sale takes priority.
Example
rates
| id | from | to |
| 1 | "2018-01-05" | "2018-01-31" |
| 2 | "2018-02-01" | "2018-02-15" |
stop_sale
| id | from | to |
| 1 | "2018-01-11" | "2018-01-20" |
| 2 | "2018-02-01" | "2018-02-10" |
Desired Result
| rate_id | from | to |
| 1 | "2018-01-05" | "2018-01-10" |
| 0 | "2018-01-11" | "2018-01-20" |
| 1 | "2018-01-21" | "2018-01-31" |
| 0 | "2018-02-01" | "2018-02-10" |
| 2 | "2018-02-11" | "2018-02-15" |
Notice how rate with id=1 gets split into two records based on the time interval of stop_rate with id=1 (Note: ids are not important, just the time intervals).
In other words, stop_sale time intervals perform a subtraction operation upon the time intervals of rate, and are also painted with the final result set.
Is this possible with SQL? And MySQL?
If so, how optimal a query is it? Is it better to handle this operation in PHP?
As far as I know there is no way to do this with just a SQL query. This could be solved iteratively within a stored function, however there is no clean option to return the data. The best being returning a delimited string of the data.
An alternative is to build a store procedure that populates a table of the result data periodically and have php query against that. The basic logic would be:
pass in starting data value - starting date.
create a temp table with query:
select * from rates
where from >= starting_date
union
select * from stop_sale
where from >= starting_date
order by from asc
Iterated through temp table.
get first 'from' value.
get next greater 'to' value
look for 'from' value < current 'to' value and > current 'from' value but != 'from' values already in the results table being populated
if found insert current 'from' value and 'to' value -1day into results table being populated
else insert current 'from' value and 'to' value into results table being populated
This basic logic could be done in php, although it may be more complicated since you would need to build an array of the rows returned from the above temp table query, run the logic on it, and build out an array of results. This would be less efficient than running the stored procedure as once the stored procedure is run you would only need to run it again on the data newer than that run. Hence the starting_data parameter.
Related
I want to weekly update a field in a MySQL table "Persons", with the avg of two fields of the "Tasks" table, end_date and start_date:
PERSON:
+----------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-------------+------+-----+---------+-------+
| average_speed | int(11) | NO | | 0 | |
+----------------+-------------+------+-----+---------+-------+
TASKS:
+----------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-------------+------+-----+---------+-------+
| person_id | int(11) | NO | | NULL | |
| start_date | date | NO | | NULL | |
| end_date | date | NO | | NULL | |
+----------------+-------------+------+-----+---------+-------+
(tables are not complete).
average_speed = AVG(task.end_date - task.start_date)
Now, the Tasks table is really big, and ** I don't want to compute the average on every task for every person every week**. (That's a solution, but I'm trying to avoid it).
What's the best way to update the average_speed?
I thought about adding two columns in the person's table:
"last_count": count of computed tasks since now for each person
"last_sum": last sum of (end_date - start_date) for each person
So that on a new update i could do something like average_speed = (last_sum+new_sum) / (last_count + new_count) where new_count is the sum of the tasks in the last week.
Is there a better solution/architecture?
EDIT:
to answer a comment, the query I would do is something like this:
SELECT
count(t.id) as last_count,
sum(TIMESTAMPDIFF(MINUTE, t.start_date, t.end_date)) as last_sum
avg(TIMESTAMPDIFF(MINUTE, t.start_date, t.end_date))
from tasks as t
where t.end_date BETWEEN DATE_SUB(CURDATE(), INTERVAL 1 WEEK) AND CURDATE()
And i can rely on a php script to get result and do some calculations
Having a periodic update to the table is a bad way to go for all the reasons you've listed above, and others.
If you have access to the code that writes to the Tasks table, that’s the best place to put the update. Add an Average field and calculate and set the value when you write the task end time.
If you don’t have access to the code, you can add a calculated field to the table that shows the average and let SQL figure it out during the execution of a query. This can slow queries down a little, but the data is always valid and SQL is smart enough to only calculate that value when it is needed.
A third (ugly) option is a trigger on the table that updates the value when appropriate. I’m not a fan of triggers because they hide business logic in unexpected places, but sometimes you just have to get the job done.
Here is a table Evact:
+--------------+-----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+-----------------------+------+-----+---------+-------+
| EvActMas | char(10) | NO | PRI | | |
| EvActSub | char(10) | NO | PRI | | |
| EvActCode | char(10) | NO | PRI | | |
| EvActIncOutg | enum('I','O','B','N') | YES | | NULL | |
| EvActBudAct | enum('B','A','O') | YES | | NULL | |
...other columns ...
and here are some records:
EvActMas EvActSub EvActCode EvActIncOutg EvActBudAct ..other..
Bank-2017 Incoming mth01 I A
Bank-2017 Incoming mth02 I A
Bank-2017 Incoming mth03 I A
Bank-2017 Incoming mth04 I A
Bank-2017 Incoming mth05 I A
Bank-2017 Incoming mth06 I A
I want to add six new records to the table where 'Incoming' is changed to 'Outgoing' and 'I' is changed to 'O'.
I did it the hard way by creating a new table from the old one; updating the new table and then inserting back into Evact:
Create table btemp like Evact;
update btemp set Evact = 'Outgoing', EvActIncOutg = 'O';
insert into Evact select * from btemp;
That worked, but I want to get better at SQL. What I wish for is a way to do this in one step by joining Evact to itself in some way. Does anyone have a suggestion?
If you want to insert a bunch of rows that are part copies of existing rows:
INSERT INTO evact
SELECT evactmas, 'Outgoing', evactcode, 'O', evactbudact, ...other..
FROM evact
You make a Select statement that is the data you want to insert, some columns in the select are the values as-is, other columns are the new values
If you aren't specifying all the columns in the select you'll have to put a list of column names in brackets after the INTO so MySQL knows which columns are to get what data. You can only omit the columns list if your select query selects the same number of columns in the table (in which case the columns selected must be in the same order as the table columns to be inserted into)
If your table has a calculated primary key (auto increment for example) specify the value to insert as 0 or NULL to have MySQL calculate a new value for it, or name all the columns except that one after the INTO and omit it from the select list
I have a table with over then 50kk rows.
trackpoint:
+----+------------+-------------------+
| id | created_at | tag |
+----+------------+-------------------+
| 1 | 1484407910 | visitorDevice643 |
| 2 | 1484407913 | visitorDevice643 |
| 3 | 1484407916 | visitorDevice643 |
| 4 | 1484393575 | anonymousDevice16 |
| 5 | 1484393578 | anonymousDevice16 |
+----+------------+-------------------+
where 'created_at' is a timestamp of row added.
and i have a list of timestamps, for example like this one:
timestamps = [1502744400, 1502830800, 1502917200]
I need to select all timestamp in every interval between i and i+1 of timestamp.
Using Django ORM it's look like:
step = 86400
for ts in timestamps[:-1]:
trackpoint_set.filter(created_at__gte=ts,created_at__lt=ts + step).values('tag').distinct().count()
Because of actually timestamps list is very very longer and table has many of rows, finally i getting 500 time-out
So, my question is, how to for it in ONE raw SQL query join rows and list of values, so it looks like [(1502744400, 650), (1502830800, 1550)...]
Where second first value is timestamp, and the second is count of unique tags in each interval.
First index created_at. Next build query like created_at in (timestamp, timestamp+1). For each timestamp, run the query one by one rather than all at once.
INTRO: Given a table with a column 'time' of unique dates(or datetime) and another column with some random integer called 'users'.
I usually do a call as such:
select table.dates, count(table.dates)
from table
group by year(table.dates), month(table.dates)
order by table.dates desc
which will return the number of users per month, albeit in an unformatted way. (I know it's not the standard way, but I check my values and this seems to work)
Here is my problem:
DATA: a table with with non-unique year/month dates, and a corresponding user count on that row.
PROBLEM: I wish to sum the user counts for identical dates, and again show a user count for every month.
EDIT: Perhaps you can ignore the INTRO, and here is an example of the data I need to work with:
| Date |user count |
----------------------.-
|2015-01 | 9 |
|2014-09 | 5 |
|2014-09 | 2 |
|2014-08 | 5 |
|2014-09 | 7 |
|2014-08 | 2 |
|2014-07 | 3 |
I have an INNODB table levels:
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| id | int(9) | NO | PRI | NULL | |
| level_name | varchar(20) | NO | | NULL | |
| user_id | int(10) | NO | | NULL | |
| user_name | varchar(45) | NO | | NULL | |
| rating | decimal(5,4) | NO | | 0.0000 | |
| votes | int(5) | NO | | 0 | |
| plays | int(5) | NO | | 0 | |
| date_published | date | NO | MUL | NULL | |
| user_comment | varchar(255) | NO | | NULL | |
| playable_character | int(2) | NO | | 1 | |
| is_featured | tinyint(1) | NO | MUL | 0 | |
+--------------------+--------------+------+-----+---------+-------+
There are ~4 million rows. Because of the front-end functionality, I need to query this table with a variety of filters and sorts. They are on playable_character, rating, plays, and date_published. The date_published can be filtered to show by the last day, week, month, or anytime(last 3 years). There's also paging. So, depending on the user choices, the queries can look, for example, like one of these:
SELECT * FROM levels
WHERE playable_character = 0 AND
date_published BETWEEN date_sub(now(), INTERVAL 3 YEAR) AND now()
ORDER BY date_published DESC
LIMIT 0, 1000;
SELECT * FROM levels
WHERE playable_character = 4 AND
date_published BETWEEN date_sub(now(), INTERVAL 1 WEEK) AND now()
ORDER BY rating DESC
LIMIT 4000, 1000;
SELECT * FROM levels
WHERE playable_character = 5 AND
date_published BETWEEN date_sub(now(), INTERVAL 1 MONTH) AND now()
ORDER BY plays DESC
LIMIT 1000, 1000;
I started out with an index idx_date_char(date_published, playable_character) that works great on the first example query here -- basically anything that's ordering by date_published. Using EXPLAIN, I get 'using index condition', which is good. I think I understand why the index works, since the same two indexed columns exist in the WHERE and ORDER BY clauses.
My problem is with queries that ORDER by plays or rating. I understand I'm introducing a third column, but for the life of me I can't get an index that works well, despite trying just about every variation I could think of: composite indexes of all three or four in every order, and so on. Maybe the query could be written differently?
I should add that rating and plays are always queried as DESC. Only date_published may be either DESC or ASC.
Any suggestions greatly appreciated. TIA.
It seems you would make good use of data sorted in this way for each of the queries:
playable_character, date_published
playable_character, date_published, rating
playable_character, date_published, plays
Bear in mind that the data you need sorted in the first query happens to be a subset of the data the second and third query needs, so we can get rid of it.
Also note that adding DESC or ASC to an index is syntactically correct but doesn't actually change anything as that feature is not currently supported (it is expected to be supported in the future so that is why it is there). All indexes are stored in ascending order. More information here.
So these are the indexes that you should create:
ALTER TABLE levels ADD INDEX (playable_character, date_published, rating)
ALTER TABLE levels ADD INDEX (playable_character, date_published, plays)
That should make the 3 queries up there run faster than Forrest Gump.
The columns used in your where clause AND order by should be part of the index. I would have an index on
( playable_character, date_published DESC, rating DESC, plays DESC )
The reason I would put the playable character FIRST is you want that ID primary, then all those dates within question. The rating and plays are just along for the ride for assisting the ORDER BY clause).
Think of the index like this. If you have it ordered by Date_Published, then Playable_Character, think of a room of boxes. Each box has a date.. Within that box for a given date, you have them in order of character. So, you have 3 years worth of data to go through, you have to open all boxes for the last 3 years and find the character you are looking for.
Now, think of it like this. Each box is by character, and within that, all their dates are pre-sorted. So, you go to one box, open it... Move to the date in question and grab the records from X-Y range you want. Now, you can apply a simple order by of those records.
When your query includes a range predicate like BETWEEN, the order of columns in your index is important.
First, include one or more columns referenced by equality predicates.
Next, include one column referenced by a range predicate.
Any further columns in the index after the column referenced by a range predicate cannot be used for other range predicates or for sorting.
If you have no range predicate, you can add a column for sort order.
So your first query can benefit from an index on (playable_character, date_published). The sorting should be a no-op because the optimizer will just fetch rows in the index order.
The second and third queries are bound to do a filesort, because you have a range predicate and then you're sorting by a different column. If you had had only equality predicates, you would be able to use the third column to avoid the filesort, but that doesn't work when you have a range predicate.
The best you can hope for is that the conditions reduce the size of the result set so that it can sort in memory without doing too many sort merge passes. You can help this by increasing sort_buffer_size, but be careful not to increase it too much, because it's allocated per thread.
The ASC/DESC keywords in index definitions makes no difference in MySQL.
See http://dev.mysql.com/doc/refman/5.6/en/create-index.html:
These keywords are permitted for future extensions for specifying ascending or descending index value storage. Currently, they are parsed but ignored; index values are always stored in ascending order.