How to optimize this query with multiple substring and subquery - mysql

Okay, I´m working on a website right now that shows information about parts of electronic devices. These parts sometimes get a revision. The part number stays the same, but they append an A, B, C etc to the part number, so the ´higher´ the letter, the newer it is. Also a date is added. So the table looks something like this:
------------------------------------------------------------
| Partcode | Description | Partdate |
------------------------------------------------------------
| 12345A | Some description 1 | 2009-11-10 |
| 12345B | Some description 2 | 2010-12-30 |
| 17896A | Some description 3 | 2009-01-12 |
| 12345C | Some description 4 | 2011-08-06 |
| 17896B | Some description 5 | 2009-07-10 |
| 12345D | Some description 6 | 2012-05-04 |
------------------------------------------------------------
What I need right now is the data from the newest revision of a part. So for this example I need:
12345D and 17896B
The query that some has build before me is something in the line of this:
SELECT substring(Partcode, 1, 5) AS Part,
(
SELECT pt.Partcode
FROM Parttable pt
WHERE substring(pt.PartCode, 1, 5) = Part
ORDER BY pt.Partdate DESC
LIMIT 0,1
),
(
SELECT pt.Description
FROM Parttable pt
WHERE substring(pt.PartCode, 1, 5) = Part
ORDER BY pt.Partdate DESC
LIMIT 0,1
),
(
SELECT pt.Partdate
FROM Parttable pt
WHERE substring(pt.PartCode, 1, 5) = Part
ORDER BY pt.Partdate DESC
LIMIT 0,1
)
FROM Parttable
GROUP BY Part
As you will understand, this query is insanely slow and feels really inefficient. But I just can't get my head around how to optimize this query.
So I really hope someone can help.
Thanks in advance!
PS. I'm working on a MySQL database and before anyone asks, I can't change the database.

First : why not storing your version variable in a separate column? This way you wouldn't need to call substring to first extract it. If you really need the code and version to be concatenated, I thing it's a good practice to do it at the end.
Then in your place, I would first split the code and version, and simply use a max in an aggregate query, like:
SELECT code,max(version) FROM
(SELECT substring(Partcode, 5, 1) as code,
substring(Partcode, 1, 5) as version
FROM Parttable
)
AS part
GROUP BY code;
Note: I haven't tested this query so you may need to fix few parameters, like the substring indexes.

Related

Implementing SUMIF() function from Excel to SQL

Lately, I have been learning how to use SQL in order to process data. Normally, I would use Python for that purpose, but SQL is required for the classes and I still very much struggle with using it comfortably in more complicated scenarios.
What I want to achieve is the same result as in the following screenshot in Excel:
Behaviour in Excel, that I want to implement in SQL
The formula I used in Excel:
=SUMIF(B$2:B2;B2;C$2:C2)
Sample of the table:
> select * from orders limit 5;
+------------+---------------+---------+
| ID | clientID | tonnage |
+------------+---------------+---------+
| 2005-01-01 | 872-13-44-365 | 10 |
| 2005-01-04 | 369-43-03-176 | 2 |
| 2005-01-05 | 408-24-90-350 | 2 |
| 2005-01-10 | 944-16-93-033 | 5 |
| 2005-01-11 | 645-32-78-780 | 14 |
+------------+---------------+---------+
The implementation is supposed to return similar results as following group by query:
select
orders.clientID as ID,
sum(orders.tonnage) as Tonnage
from orders
group by orders.clientID;
That is, return how much each client have purchased, but at the same I want it to return each step of the addition as separate record.
For an instance:
Client A bought 350 in the first order and then 231 in the second one. In such case the query would return something like this:
client A - 350 - 350 // first order
client A - 281 - 581 // second order
Example, how it would look like in Excel
I have already tried to use something like:
select
orders.clientID as ID,
sum(case when orders.clientID = <ID> then orders.tonnage end)
from orders;
But got stuck quickly, since I would need to somehow dynamically change this <ID> and store it's value in some kind of temporary variable and I can't really figure out how to implement such thing in SQL.
You can use window function for running sum.
In your case, use like this
select id, clientID, sum(tonnage) over (partition by clientID order by id) tonnageRunning
from orders
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=13a8c2d46b5ac22c5c120ac937bd6e7a

Random sampling using subquery and rand() gives unexpected results

Edit: If it makes any difference, I am using mysql 5.7.19.
I have a table A, and am trying to randomly sample on average 10% of the rows. I have decided that using rand() in a subquery, and then filtering out on that random result would do the trick, but it is giving unexpected results. When I print out the randomly generated value after filtering, I get random values that do not match my main query's "where" clause, so I suppose it is regenerating the random value in the outer select.
I guess I'm missing something to do with subqueries and when things are executed, but I'm really not sure what's going on.
Can anyone explain what I might be doing wrong? I've checked out this post: In which sequence are queries and sub-queries executed by the SQL engine? , and my subquery is correlated so I assume that my subquery is being executed first, and then the main query is filtering off of it. Given my assumptions, I do not understand why the result has values that should have been filtered away.
Query:
select
*
from
(
select
*,
rand() as rand_value
from
A
) a_rand
where
rand_value < 0.1;
Result:
--------------------------------------
| id | events | rand_value |
--------------------------------------
| c | 1 | 0.5512495763145849 | <- not what I expected
--------------------------------------
I am not able to reproduce using this SQL Fiddle use that link and click the blue [Run SQL] button a few times
CREATE TABLE Table1
(`x` int)
;
INSERT INTO Table1
(`x`)
VALUES
(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
;
Query 1:
select
*
from (
select
*
, rand() as rand_value
from Table1
) a_rand
where
rand_value < 0.1
[Results]:
| x | rand_value |
|---|---------------------|
| 1 | 0.03006686086772649 |
| 1 | 0.09353976332912199 |
| 1 | 0.08519635823107917 |

SQL calculating difference between columns

I'm a bit of a newby at SQL and I don't really understand what to do here, so any help is really appreciated. I have a table full of readings from different readers, there's like 500.000 of them, so I can't do this by hand.
I received the table without the difference in it. I managed to calculate it, but there's a bit of a problem there...
It looks a bit like this:
reader_id | date | reading | difference
1 | 01-01-2013 | 205 | 0
1 | 02-01-2013 | 210 | 5
1 | 03-01-2013 | 213 | 3
... | ... | ... | ...
1 | 31-12-2013 | 2451 | 4
2 | 01-01-2013 | 8543 | 6092
2 | 02-01-2013 | 8548 | 5
reader_id and date form the primary key. The combination is unique.
How can I make sure I don't get the difference calculated when the last column contained a different reader_id?
When querying my data with a query like this one, the data get skewed by the incorrect difference between the two reader_ids:
SELECT AVG(difference), reader_id FROM table GROUP BY reader_id
For
I just want to get the average difference for each reader.
your query is perfectly good. I think you got something wrong in your difference calculation. The first value for reader_id=2, 6092, is the difference of the last reading from reader1 and the first reading from reader 2, i don't think that makes sense. If i'm not mistaken, the difference value is the current day reading - previous day reading. Therefore you should set the difference value of the first reading of each reader to 0.
You can do this with the following query:
UPDATE table t INNER JOIN (SELECT reader_id, min(date) as first_day FROM table GROUP BY reader_id) as tmp ON tmp.reader_id=t.reader_id AND tmp.first_day=t.date SET t.difference=0
Then
SELECT AVG(difference), reader_id FROM table GROUP BY reader_id
will do what you expect.
If you simply want the average difference, you can use the following query:
SELECT
meter_id,
MAX(reading) - MIN(reading) / COUNT(*) average_difference
FROM table
GROUP BY meter_id
ORDER BY meter_id;
It works on the logic that the the total difference for a given meter_id should be equal to MAX(reading) - MIN(reading).

MySQL search query ordered by match relevance

I know basic MySQL querying, but I have no idea how to achieve an accurate and relevant search query.
My table look like this:
id | kanji
-------------
1 | 一子
2 | 一人子
3 | 一私人
4 | 一時
5 | 一時逃れ
I already have this query:
SELECT * FROM `definition` WHERE `kanji` LIKE '%一%'
The problem is that I want to order the results from the learnt characters, 一 being a required character for the results of this query.
Say, a user knows those characters: 人,子,時
Then, I want the results to be ordered that way:
id | kanji
-------------
2 | 一人子
1 | 一子
4 | 一時
3 | 一私人
5 | 一時逃れ
The result which matches the most learnt characters should be first. If possible, I'd like to show results that contain only learnt characters first, then a mix of learnt and unknown characters.
How do I do that?
Per your preference, ordering by number of unmatched characters (increasing), and then number of matched character (decreasing).
SELECT *,
(kanji LIKE '%人%')
+ (kanji LIKE '%子%')
+ (kanji LIKE '%時%') score
FROM kanji
ORDER BY CHAR_LENGTH(kanji) - score, score DESC
Or, the relational way to do it is to normalize. Create the table like this:
kanji_characters
kanji_id | index | character
----------------------------
1 | 0 | 一
1 | 1 | 子
2 | 0 | 一
2 | 1 | 人
2 | 2 | 子
...
Then
SELECT kanji_id,
COUNT(*) length,
SUM(CASE WHEN character IN ('人','子','時') THEN 1 END) score
FROM kanji_characters
WHERE index <> 0
AND kanji_id IN (SELECT kanji_id FROM kanji_characters WHERE index = 0 AND character = '一')
GROUP BY kanji_id
ORDER BY length - score, score DESC
Though you didn't specify what should be done in the case of duplicate characters. The two solutions above handle that differently.
Just a thought, but a text index may help, you can get a score back like this:
SELECT match(kanji) against ('your search' in natural language mode) as rank
FROM `definition` WHERE match(`kanji`) against ('your search' in natural language mode)
order by rank, length(kanji)
The trick is to index these terms (or words?) the right way. I think the general trick is to encapsulate each word with double quotes and make a space between each. This way the tokenizer will populate the index the way you want. Of course you would need to add/remove the quotes on the way in/out respectively.
Hope this doesn't bog you down.

How can I get the difference between the individual maximum values of different days?

I am new in MySQL, I am trying to find:
The difference between a given day's maximum value occurred and the previous day's maximum value.
I was able to get the maximum values for dates via:
select max(`bundle_count`), `Production_date`
from `table`
group by `Production_date`
But I don't know how to use SQL to calculate the differences between maximums for two given dates.
am expecting output like this
Please help me.
Update 1: Here is a fiddle, http://sqlfiddle.com/#!2/818ad/2, that I used for testing.
Update 2: Here is a fiddle, http://sqlfiddle.com/#!2/3f78d/10 that I used for further refining/fixing, based on Sandy's comments.
Update 3: For some reason the case where there is no previous day was not being dealt with correctly. I thought it was. However, I've updated to make sure that works (a bit cumbersome--but it appears to be right. Last fiddle: http://sqlfiddle.com/#!2/3f78d/45
I think #Grijesh conceptually got you the main thing you needed via the self-join of the input data (so make sure you vote up his answer!). I've cleaned up his query a bit on syntax (building off of his query!):
SELECT
DATE(t1.`Production_date`) as theDate,
MAX( t1.`bundle_count` ) AS 'max(bundle_count)',
MAX( t1.`bundle_count` ) -
IF(
EXISTS
(
SELECT date(t2.production_date)
FROM input_example t2
WHERE t2.machine_no = 1 AND
date_sub(date(t1.production_date), interval 1 day) = date(t2.production_date)
),
(
SELECT MAX(t3.bundle_count)
FROM input_example t3
WHERE t3.machine_no = 1 AND
date_sub(date(t1.production_date), interval 1 day) = date(t3.production_date)
GROUP BY DATE(t3.production_date)
), 0
)
AS Total_Bundles_Used
FROM `input_example` t1
WHERE t1.machine_no = 1
GROUP BY DATE( t1.`production_date` )
Note 1: I think #Grijesh and I were cleaning up the query syntax issues at the same time. It's encouraging that we ended up with very similar versions after we were both doing cleanup. My version differs in using IFNULL() for when there is no preceding data. I also ended up with a DATE_SUB, and I made sure to reduce various dates to mere dates without time component, via DATE()
Note 2: I originally had not fully understood your source tables, so I thought I needed to implement a running count in the query. But upon better inspection, it's clear that your source data already has a running count, so I took that stuff back out.
I am not sure but you need something like this, Hope it will be helpful to you upto some extend:
Try this:
SELECT t1.`Production_date` ,
MAX(t1.`bundle_count`) - MAX(t2.`bundle_count`) ,
COUNT(t1.`bundle_count`)
FROM `table_name` AS t1
INNER JOIN `table_name` AS t2
ON ABS(DATEDIFF(t1.`Production_date` , t2.`Production_date`)) = 1
GROUP BY t1.`Production_date`
EDIT
I create a table name = 'table_name', as below,
mysql> SELECT * FROM `table_name`;
+---------------------+--------------+
| Production_date | bundle_count |
+---------------------+--------------+
| 2004-12-01 20:37:22 | 1 |
| 2004-12-01 20:37:22 | 2 |
| 2004-12-01 20:37:22 | 3 |
| 2004-12-02 20:37:22 | 2 |
| 2004-12-02 20:37:22 | 5 |
| 2004-12-02 20:37:22 | 7 |
| 2004-12-03 20:37:22 | 6 |
| 2004-12-03 20:37:22 | 7 |
| 2004-12-03 20:37:22 | 2 |
| 2004-12-04 20:37:22 | 1 |
| 2004-12-04 20:37:22 | 9 |
+---------------------+--------------+
11 rows in set (0.00 sec)
My query: to find difference in bundle_count between two consecutive dates:
SELECT t1.`Production_date` ,
MAX(t2.`bundle_count`) - MAX(t1.`bundle_count`) ,
COUNT(t1.`bundle_count`)
FROM `table_name` AS t1
INNER JOIN `table_name` AS t2
ON ABS(DATEDIFF(t1.`Production_date` , t2.`Production_date`)) = 1
GROUP BY t1.Production_date;
its output:
+---------------------+-------------------------------------------------+--------------------------+
| Production_date | MAX(t2.`bundle_count`) - MAX(t1.`bundle_count`) | COUNT(t1.`bundle_count`) |
+---------------------+-------------------------------------------------+--------------------------+
| 2004-12-01 20:37:22 | 4 | 9 |
| 2004-12-02 20:37:22 | 0 | 18 |
| 2004-12-03 20:37:22 | 2 | 15 |
| 2004-12-04 20:37:22 | -2 | 6 |
+---------------------+-------------------------------------------------+--------------------------+
4 rows in set (0.00 sec)
This is PostgreSQL syntax (sorry; it's what I'm familiar with) but should fundamentally work in either database. Note this doesn't exactly run in PostgreSQL either because group is not a valid table name (it's a reserved keyword). The approach is a self-join as others have mentioned but I've used a view to handle the max-by-day and the difference as separate steps.
create view max_by_day as
select
date_trunc('day', production_date) as production_date,
max(bundle_count) as bundle_count
from
group
group by
date_trunc('day', production_date);
select
today.production_date as production_date,
today.bundle_count,
today.bundle_count - coalesce(yesterday.bundle_count, 0)
from
max_by_day as today
left join max_by_day yesterday on (yesterday.production_date = today.production_date - '1 day'::interval)
order by
production_date;
PostgreSQL also has a construct called window functions which is useful for this and a bit easier to understand. Just had to stick in a bit of advocacy for a superior database. :-P
select
date_trunc('day', production_date),
max(bundle_count),
max(bundle_count) - lag(max(bundle_count), 1, 0)
over
(order by date_trunc('day', production_date))
from
group
group by
date_trunc('day', production_date);
These two approaches differ in how they handle missing days in the data - the first will treat it as a 0, the second will use the previous day which is present. There wasn't a case like this in your sample so I don't know if this is something you care about.