How to calculate a moving average in MYSQL - mysql

I have an application that stores stock quotes into my MySQL database.
I have a table called stock_history:
mysql> desc stock_history;
+-------------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+---------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| date | date | NO | MUL | NULL | |
| close | decimal(12,5) | NO | MUL | NULL | |
| dmal_3 | decimal(12,5) | YES | MUL | NULL | |
+-------------------+---------------+------+-----+---------+----------------+
5 rows in set (0.01 sec)
These are all the values in this table:
mysql> select date, close, dmal_3 from stock_history order by date asc;
+------------+----------+----------+
| date | close | dmal_3 |
+------------+----------+----------+-
| 2000-01-03 | 2.00000 | NULL |
| 2000-01-04 | 4.00000 | NULL |
| 2000-01-05 | 6.00000 | NULL |
| 2000-01-06 | 8.00000 | NULL |
| 2000-01-07 | 10.00000 | NULL |
| 2000-01-10 | 12.00000 | NULL |
| 2000-01-11 | 14.00000 | NULL |
| 2000-01-12 | 16.00000 | NULL |
| 2000-01-13 | 18.00000 | NULL |
| 2000-01-14 | 20.00000 | NULL |
+------------+----------+----------+-
10 rows in set (0.01 sec)
I am guaranteed that there will be 0 or 1 record for each date.
Can I write a single query that will insert the three-day moving average (ie: the average closing prices of that day and the two previous trading days before it) into the dmal_3 field? How?
When the query is done, I want the table to look like this:
mysql> select date, close, dmal_3 from stock_history order by date asc;
+------------+----------+----------+
| date | close | dmal_3 |
+------------+----------+----------+
| 2000-01-03 | 2.00000 | NULL |
| 2000-01-04 | 4.00000 | NULL |
| 2000-01-05 | 6.00000 | 4.00000 |
| 2000-01-06 | 8.00000 | 6.00000 |
| 2000-01-07 | 10.00000 | 8.00000 |
| 2000-01-10 | 12.00000 | 10.00000 |
| 2000-01-11 | 14.00000 | 12.00000 |
| 2000-01-12 | 16.00000 | 14.00000 |
| 2000-01-13 | 18.00000 | 16.00000 |
| 2000-01-14 | 20.00000 | 18.00000 |
+------------+----------+----------+
10 rows in set (0.01 sec)

That is what I call a good challenge. My solution first creates a counter for the values and uses it as a table. From it I select everything and join with the same query as a subquery checking the position of the counter on both. Once the query works it just need an inner join with the actual table to do the update. Here it is my solution:
update stock_history tb1
inner join
(
select a.id,
case when a.step < 3 then null
else
(select avg(b.close)
from (
select hh.*,
#stp:=#stp+1 stp
from stock_history hh,
(select #sum:=0, #stp:=0) x
order by hh.dt
limit 17823232
) b
where b.stp >= a.step-2 and b.stp <= a.step
)
end dmal_3
from (select h1.*,
#step:=#step+1 step
from stock_history h1,
(select #sum:=0, #step:=0) x
order by h1.dt
limit 17823232
) a
) x on tb1.id = x.id
set tb1.dmal_3 = x.dmal_3;
I changed some columns names for easiness of my test. Here it is the working SQLFiddle: http://sqlfiddle.com/#!9/e7dc00/1
If you have any doubt, let me know so I can clarify!
Edit
The limit 17823232 clause was added there in the subqueries because I don't know which version of MySql you are in. Depending on it (>= 5.7, not sure exactly) the database optimizer will ignore the internal order by clauses making it not work the way it should. I just chose a random big number usually you can use the maximum allowed.
The only column with different colunm name between your table and mine is the date one which I named dt because date is a reserved word and you should use backticks ( ` ) to use such columns, therefore I will left it as dt in above query.

Related

Mysql: How to create a column which is the difference between a column in a Table & another column in a View

In the database 'college2' there are 3 TABLES:'student, course & enrolment', and one(1) VIEW:'enrolment_status', which is created using the following command:
CREATE VIEW enrolment_status AS
SELECT code, COUNT(id)
FROM enrolment
GROUP BY code;
Explain command for 'course,enrolment and enrolment_status' results in:
mysql> EXPLAIN course;
+---------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-------------+------+-----+---------+-------+
| code | char(8) | NO | PRI | NULL | |
| name | varchar(90) | YES | MUL | NULL | |
| max_enrolment | char(2) | YES | | NULL | |
+---------------+-------------+------+-----+---------+-------+
3 rows in set (0.09 sec)
mysql> explain enrolment;
+-------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| id | char(6) | YES | MUL | NULL | |
| code | char(8) | YES | MUL | NULL | |
+-------+---------+------+-----+---------+-------+
2 rows in set (0.02 sec)
mysql> explain enrolment_status;
+-----------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+------------+------+-----+---------+-------+
| code | char(8) | YES | | NULL | |
| COUNT(id) | bigint(21) | NO | | 0 | |
+-----------+------------+------+-----+---------+-------+
2 rows in set (0.18 sec)
'max_enrolment' column in 'course' TABLE is the maximum allowed # of student for each course, say 10 or 20.
'count(id)' column in 'enrolment_status' VIEW (not table) is actual # of students enrolled in each course.
'id' column in 'enrolment' TABLE is the student id enrolled in a course.
HERE'S MY QUESTION:
I want to have the '# of seats left' which is the difference between 'max_enrolment' column and 'count(id)' column.
'#of seats left' can be a stand alone table or view or a column added to any of the above tables. How can i do this:
I tried many commands including the following,
CREATE VIEW seats_left AS (
SELECT course.code, course.max_enrolment - enrolment_status.count
FROM course, enrolment_status
WHERE course.code = enrolment_status.code);
...which gives me the following error message:
ERROR 1054 (42S22): Unknown column 'enrolment_status.count' in 'field list'
mysql> SELECT*FROM enrolment_status;
+----------+-----------+
| code | COUNT(id) |
+----------+-----------+
| COMP9583 | 7 |
| COMP9585 | 9 |
| COMP9586 | 7 |
| COMP9653 | 7 |
| COMP9654 | 7 |
| COMP9655 | 8 |
| COMP9658 | 7 |
+----------+-----------+
7 rows in set (0.00 sec)
mysql> SELECT code, max_enrolment FROM course;
+----------+---------------+
| code | max_enrolment |
+----------+---------------+
| COMP9583 | 10 |
| COMP9585 | 15 |
| COMP9586 | 15 |
| COMP9653 | 12 |
| COMP9654 | 10 |
| COMP9655 | 12 |
| COMP9658 | 12 |
+----------+---------------+
7 rows in set (0.00 sec)
+----------+---------------------+
| code | max_enrolment - cnt |
+----------+---------------------+
| COMP9583 | 9 |
| COMP9585 | 14 |
| COMP9586 | 14 |
| COMP9653 | 11 |
| COMP9654 | 9 |
| COMP9655 | 11 |
| COMP9658 | 11 |
+----------+---------------------+
7 rows in set (0.09 sec)
Try to use an acronym for in the view.
CREATE VIEW enrolment_status AS
SELECT code, COUNT(id) count
FROM enrolment
GROUP BY code;
Then you should be able to do this:
CREATE VIEW seats_left AS (
SELECT course.code, course.max_enrolment - enrolment_status.count
FROM course, enrolment_status
WHERE course.code = enrolment_status.code);
If you cannot change the view, then you must use the exact same name in the query:
CREATE VIEW seats_left AS (
SELECT course.code, course.max_enrolment - enrolment_status.'count(id)'
FROM course, enrolment_status
WHERE course.code = enrolment_status.code);
Try this:
SELECT b.`code`,max_enrolment - cnt from
(select `code`, cnt from
(select count(1) as cnt,`code` from enrolment_status
GROUP BY `code`) as a) as a
LEFT JOIN
(SELECT code,max_enrolment from course) as b
on a.`code` = b.`code`
You can change left join to right join

mysql join with sub-query

This is my schema:
mysql> describe stocks;
+-----------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| symbol | varchar(32) | NO | | NULL | |
| date | datetime | NO | | NULL | |
| value | float(10,3) | NO | | NULL | |
| contracts | int(8) | NO | | NULL | |
| open | float(10,3) | NO | | NULL | |
| close | float(10,3) | NO | | NULL | |
| high | float(10,3) | NO | | NULL | |
| low | float(10,3) | NO | | NULL | |
+-----------+-------------+------+-----+---------+----------------+
9 rows in set (0.03 sec)
I added the column open and low and I want to fill up with the data inside the table.
These values open/close are referenced to each day. (so the relative max/min id of each day should give me the correct value). So my first insight is get the list of date and then left join with the table:
SELECT DISTINCT(DATE(date)) as date FROM stocks
but I'm stuck because I can't get the max/min ID or the the first/last value. Thanks
You will get day wise min and max ids from below query
SELECT DATE_FORMAT(date, "%d/%m/%Y"),min(id) as min_id,max(id) as max_id FROM stocks group by DATE_FORMAT(date, "%d/%m/%Y")
But other requirement is not clear.
Solved!
mysql> UPDATE stocks s JOIN
-> (SELECT k.date, k.value as v1, y.value as v2 FROM (SELECT x.date, x.min_id, x.max_id, stocks.value FROM (SELECT DATE(date) as date,min(id) as min_id,max(id) as max_id FROM stocks group by DATE(date)) AS x LEFT JOIN stocks ON x.min_id = stocks.id) AS k LEFT JOIN stocks y ON k.max_id = y.id) sd
-> ON DATE(s.date) = sd.date
-> SET s.open = sd.v1, s.close = sd.v2;
Query OK, 995872 rows affected (1 min 50.38 sec)
Rows matched: 995872 Changed: 995872 Warnings: 0

mysql find missing items

i have a table with the following structure
mysql> describe stock_prices;
+---------------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| code | varchar(16) | YES | MUL | NULL | |
| pricelist | varchar(10) | YES | MUL | NULL | |
| settlement_discount | tinyint(1) | YES | | NULL | |
| overal_discount | tinyint(1) | YES | | NULL | |
| sale | tinyint(1) | YES | | NULL | |
| price_blob | longtext | YES | | NULL | |
+---------------------+-------------+------+-----+---------+----------------+
7 rows in set (0.00 sec)
when i run this query
mysql> SELECT pricelist, count(pricelist) as dup from stock_prices group by pricelist having dup>1 order by dup;
+-----------+------+
| pricelist | dup |
+-----------+------+
| GMBH | 1843 |
| DISTCART | 2241 |
| DISTSTD | 2241 |
| CART | 2242 |
| USSD | 2242 |
| SPCA | 2242 |
| SPCB | 2242 |
| SPCC | 2242 |
| EUCN | 2242 |
| STD | 2242 |
| EUSD | 2242 |
| USCN | 2242 |
+-----------+------+
12 rows in set (0.03 sec)
all the pricelist items should have the same values, but GMBH has 399 less and DISTCART and DISTSTD have 1 less.
basically, i have code that does not have a pricelist entry.
when i run:
mysql> SELECT code, count(code) as dup from stock_prices group by code having dup>1 order by dup;
+-------------+-----+
| code | dup |
+-------------+-----+
| XN44-CH2 | 9 |
| XN23-MGY1 | 11 |
| XN24-CH2 | 11 |
| XN25-VWH1 | 11 |
| XN36-BL2 | 11 |
| XN36-CH3 | 11 |
| XN37-BL3 | 11 |
| XN38-BC3 | 11 |
| XN38-CE3 | 11 |
....
so in this case XN44-CH2 is missing 3 codes and XN23-MGY1 is missing 1 code
mysql> SELECT COUNT(pricelist) FROM stock_prices WHERE pricelist = 'GMBH';
+------------------+
| COUNT(pricelist) |
+------------------+
| 1843 |
+------------------+
1 row in set (0.00 sec)
what would be the correct way to find out what the missing pricelists for each is?
any advice much appreciated.
Assuming there is a reference table for all the price lists and one for all the codes, you could do something like this in standard SQL:
SELECT
p.pricelist,
c.code
FROM
pricelists AS p
CROSS JOIN
codes AS c
EXCEPT
SELECT
pricelist,
code
FROM
stock_prices
;
That is, get all the combinations of the existing pricelists and codes and subtract those that are present in stock_prices. The result would be the missing pairs.
As MySQL doesn't support EXCEPT, you could implement the same logic with a LEFT JOIN:
SELECT
p.pricelist,
c.code
FROM
pricelists AS p
CROSS JOIN
codes AS c
LEFT JOIN
stock_prices AS s ON p.pricelist = s.pricelist
AND c.code = s.code
WHERE s.id IS NULL
;
If you do not have those reference tables, you could replace them with derived tables in this way:
pricelists ==> (SELECT DISTINCT pricelist FROM stock_prices)
codes ==> (SELECT DISTINCT code FROM stock_prices)
And the query would then look like this:
SELECT
p.pricelist,
c.code
FROM
(SELECT DISTINCT pricelist FROM stock_prices) AS p
CROSS JOIN
(SELECT DISTINCT code FROM stock_prices) AS c
LEFT JOIN
stock_prices AS s ON p.pricelist = s.pricelist
AND c.code = s.code
WHERE s.id IS NULL
;

How can I optimize this mysql query to find maximum simultaneous calls?

I'm trying to calculate maximum simultaneous calls. My query, which I believe to be accurate, takes way too long given ~250,000 rows. The cdrs table looks like this:
+---------------+-----------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-----------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| CallType | varchar(32) | NO | | NULL | |
| StartTime | datetime | NO | MUL | NULL | |
| StopTime | datetime | NO | | NULL | |
| CallDuration | float(10,5) | NO | | NULL | |
| BillDuration | mediumint(8) unsigned | NO | | NULL | |
| CallMinimum | tinyint(3) unsigned | NO | | NULL | |
| CallIncrement | tinyint(3) unsigned | NO | | NULL | |
| BasePrice | float(12,9) | NO | | NULL | |
| CallPrice | float(12,9) | NO | | NULL | |
| TransactionId | varchar(20) | NO | | NULL | |
| CustomerIP | varchar(15) | NO | | NULL | |
| ANI | varchar(20) | NO | | NULL | |
| ANIState | varchar(10) | NO | | NULL | |
| DNIS | varchar(20) | NO | | NULL | |
| LRN | varchar(20) | NO | | NULL | |
| DNISState | varchar(10) | NO | | NULL | |
| DNISLATA | varchar(10) | NO | | NULL | |
| DNISOCN | varchar(10) | NO | | NULL | |
| OrigTier | varchar(10) | NO | | NULL | |
| TermRateDeck | varchar(20) | NO | | NULL | |
+---------------+-----------------------+------+-----+---------+----------------+
I have the following indexes:
+-------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| cdrs | 0 | PRIMARY | 1 | id | A | 269622 | NULL | NULL | | BTREE | | |
| cdrs | 1 | id | 1 | id | A | 269622 | NULL | NULL | | BTREE | | |
| cdrs | 1 | call_time_index | 1 | StartTime | A | 269622 | NULL | NULL | | BTREE | | |
| cdrs | 1 | call_time_index | 2 | StopTime | A | 269622 | NULL | NULL | | BTREE | | |
+-------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
The query I am running is this:
SELECT MAX(cnt) AS max_channels FROM
(SELECT cl1.StartTime, COUNT(*) AS cnt
FROM cdrs cl1
INNER JOIN cdrs cl2
ON cl1.StartTime
BETWEEN cl2.StartTime AND cl2.StopTime
GROUP BY cl1.id)
AS counts;
It seems like I might have to chunk this data for each day and store the results in a separate table like simultaneous_calls.
I'm sure you want to know not only the maximum simultaneous calls, but when that happened.
I would create a table containing the timestamp of every individual minute
CREATE TABLE times (ts DATETIME UNSIGNED AUTO_INCREMENT PRIMARY KEY);
INSERT INTO times (ts) VALUES ('2014-05-14 00:00:00');
. . . until 1440 rows, one for each minute . . .
Then join that to the calls.
SELECT ts, COUNT(*) AS count FROM times
JOIN cdrs ON times.ts BETWEEN cdrs.starttime AND cdrs.stoptime
GROUP BY ts ORDER BY count DESC LIMIT 1;
Here's the result in my test (MySQL 5.6.17 on a Linux VM running on a Macbook Pro):
+---------------------+----------+
| ts | count(*) |
+---------------------+----------+
| 2014-05-14 10:59:00 | 1001 |
+---------------------+----------+
1 row in set (1 min 3.90 sec)
This achieves several goals:
Reduces the number of rows examined by two orders of magnitude.
Reduces the execution time from 3 hours+ to about 1 minute.
Also returns the actual timestamp when the highest count was found.
Here's the EXPLAIN for my query:
explain select ts, count(*) from times join cdrs on times.ts between cdrs.starttime and cdrs.stoptime group by ts order by count(*) desc limit 1;
+----+-------------+-------+-------+---------------+---------+---------+------+--------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+--------+------------------------------------------------+
| 1 | SIMPLE | times | index | PRIMARY | PRIMARY | 5 | NULL | 1440 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | cdrs | ALL | starttime | NULL | NULL | NULL | 260727 | Range checked for each record (index map: 0x4) |
+----+-------------+-------+-------+---------------+---------+---------+------+--------+------------------------------------------------+
Notice the figures in the rows column, and compare to the EXPLAIN of your original query. You can estimate the total number of rows examined by multiplying these together (but that gets more complicated if your query is anything other than SIMPLE).
The inline view isn't strictly necessary. (You're right about a lot of time to run the EXPLAIN on the query with the inline view, the EXPLAIN will materialize the inline view (i.e. run the inline view query and populate the derived table), and then give an EXPLAIN on the outer query.
Note that this query will return an equivalent result:
SELECT COUNT(*) AS max_channels
FROM cdrs cl1
JOIN cdrs cl2
ON cl1.StartTime BETWEEN cl2.StartTime AND cl2.StopTime
GROUP BY cl1.id
ORDER BY max_channels DESC
LIMIT 1
Though it still has to do all the work, and probably doesn't perform any better; the EXPLAIN should run a lot faster. (We expect to see "Using temporary; Using filesort" in the Extra column.)
The number of rows in the resultset is going to be the number of rows in the table (~250,000 rows), and those are going to need to be sorted, so that's going to be some time there. The bigger issue (my gut is telling me) is that join operation.
I'm wondering if the EXPLAIN (or performance) would be any different if you swapped the cl1 and cl2 in the predicate, i.e.
ON cl2.StartTime BETWEEN cl1.StartTime AND cl1.StopTime
I'm thinking that, just because I'd be tempted to try a correlated subquery. That's ~250,000 executions, and that's not likely going to be any faster...
SELECT ( SELECT COUNT(*)
FROM cdrs cl2
WHERE cl2.StartTime BETWEEN cl1.StartTime AND cl1.StopTime
) AS max_channels
, cl1.StartTime
FROM cdrs cl1
ORDER BY max_channels DESC
LIMIT 11
You could run an EXPLAIN on that, we're still going to see a "Using temporary; Using filesort", and it will also show the "dependent subquery"...
Obviously, adding a predicate on the cl1 table to cut down the number of rows to be returned (for example, checking only the past 15 days); that should speed things up, but it doesn't get you the answer you want.
WHERE cl1.StartTime > NOW() - INTERVAL 15 DAY
(None of my musings here are sure-fire answers to your question, or solutions to the performance issue; they're just musings.)

Join by part of string

I have following tables:
**visitors**
+---------------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------------+--------------+------+-----+---------+----------------+
| visitors_id | int(11) | NO | PRI | NULL | auto_increment |
| visitors_path | varchar(255) | NO | | | |
+---------------------+--------------+------+-----+---------+----------------+
**fedora_info**
+----------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+-------+
| pid | varchar(255) | NO | PRI | | |
| owner_uid | int(11) | YES | | NULL | |
+----------------+--------------+------+-----+---------+-------+
First I looking for visitors_path that are related to specific pages by:
SELECT visitors_id, visitors_path
FROM visitors
WHERE visitors_path REGEXP '[[:<:]]fedora/repository/.*:[0-9]+$';
The above query return expected result.
now .*:[0-9]+ in above query referred to pid in second table. now I want know count of result in above query grouped by owner_uid in second table.
How can I JOIN this tables?
EDIT
sample data:
visitors
+-------------+---------------------------------+
| visitors_id | visitors_path |
+-------------+---------------------------------+
| 4574 | fedora/repository/islandora:123 |
| 4575 | fedora/repository/islandora:123 |
| 4580 | fedora/repository/islandora:321 |
| 4681 | fedora/repository/islandora:321 |
| 4682 | fedora/repository/islandora:321 |
| 4704 | fedora/repository/islandora:321 |
| 4706 | fedora/repository/islandora:456 |
| 4741 | fedora/repository/islandora:456 |
| 4743 | fedora/repository/islandora:789 |
| 4769 | fedora/repository/islandora:789 |
+-------------+---------------------------------+
fedora_info
+-----------------+-----------+
| pid | owner_uid |
+-----------------+-----------+
| islandora:123 | 1 |
| islandora:321 | 2 |
| islandora:456 | 3 |
| islandora:789 | 4 |
+-----------------+-----------+
Expected result:
+-----------------+-----------+
| count | owner_uid |
+-----------------+-----------+
| 2 | 1 |
| 4 | 2 |
| 3 | 3 |
| 2 | 4 |
| 0 | 5 |
+-----------------+-----------+
I suggest you to normalize your database. When inserting rows in visitors extract pid in the front end language and put it in a separate column (e.g. fi_pid). Then you can join it easily.
The following query might work for you. But it'll be little cpu intensive.
SELECT
COUNT(a.visitors_id) as `count`,
f.owner_uid
FROM (SELECT visitors_id,
visitors_path,
SUBSTRING(visitors_path, ( LENGTH(visitors_path) -
LOCATE('/', REVERSE(visitors_path)) )
+ 2) AS
pid
FROM visitors
WHERE visitors_path REGEXP '[[:<:]]fedora/repository/.*:[0-9]+$') AS `a`
JOIN fedora_info AS f
ON ( a.pid = f.pid )
GROUP BY f.owner_uid
Following query returns expected result, but its very slow Query took 9.6700 sec
SELECT COUNT(t2.pid), t1.owner_uid
FROM fedora_info t1
JOIN (SELECT TRIM(LEADING 'fedora/repository/' FROM visitors_path) as pid
FROM visitors
WHERE visitors_path REGEXP '[[:<:]]fedora/repository/.*:[0-9]+$') t2 ON t1.pid = t2.pid
GROUP BY t1.owner_uid