MySQL multi-step GROUP BY without subquery - mysql

I'm working on improving some queries I inherited, and was curious if it was possible to do the following - given a table the_table that looks like this:
id uri
---+-------------------------
1 /foo/bar/x
1 /foo/bar/y
1 /foo/boo
2 /alpha/beta/carotine
2 /alpha/delic/ipa
3 /plastik/man/spastik
3 /plastik/man/krakpot
3 /plastik/man/helikopter
As an implicit intermediate step I'd like to group these by the 1st + 2nd tuple of uri. The results of that step would look like:
id base
---+---------------
1 /foo/bar
1 /foo/boo
2 /alpha/beta
2 /alpha/delic
3 /plastik/man
And the final result would reflect the number of unique tuple1 + tuple2 values, per unique id:
id cnt
---+-----
1 2
2 2
3 1
I can achieve these results, but not without doing a subquery (to get the results of the implicit step mentioned above), and then select/grouping out of that. Something like:
SELECT
id,
count(base) cnt
FROM (
SELECT
id,
substring_index(uri, '/', 3) AS base
FROM the_table
GROUP BY id, base
)
GROUP BY id;
My reason for wanting to avoid the subquery is that I'm working with a fairly large (20M rows) data set, and the subquery gets very expensive. Gut tells me it's not doable, but figured I'd ask SO...

There's no need for a subquery -- you can use count with distinct to achieve the same result:
SELECT
id,
count(distinct substring_index(uri, '/', 3)) AS base
FROM the_table
GROUP BY id
SQL Fiddle Demo
BTW -- this returns count of 1 for id 3 -- I assume that was a typo in your posting.

Related

How to treat missing id 's value as 0 and order by it?

When I use in keyword in sql, there may be some id is missing , but I want treat them like they exist and other columns are null or 0.
For example, suppose I have a table with two columns and some rows:
[id,value1]
1      1
2      4
3      3
5      5
I may write sql like this:
select * from table where id in (1,4,5) order by value1 limit 0,2 ;
When this sql is executed, the return result is [(1,1),(5,5)].
But what I want is [(4,0),(1,1)], because I want to treat the missing id 4 like it exists in the table.
So the question is : Is there some elegant way to achieve it using sql instead of select all rows and sort them in memory.
Use a left join:
select *
from (select 1 as id union all
select 4 union all
select 5
) i left join
table t
using (id)
order by t.value1
limit 0, 2 ;
Note that you are ordering by a value in the existing table, so this depends on the fact that NULL is ordered before other values.

Creating new data table on existing one

Hello I've got a question, how (if it possible), I can create new datatables with close same rows but if In column value is in string "/" for example
ID
column_param
column_sym
column_value
column_val2
First
param_test1
ABC
11/12
test
Second
param_test2
CDE
22/11
test
Third
param_test3
EFG
44
teste
4'th
param_test4
HIJ
33/22
test
And here if I have param_test1 and param_test4 and if in this column value has "/" I want to create 2 other rows but if I will not set param_test2 then it stay as it is and everything should be in new datatable. Is any way to create this?
Thank you in advance.
Expected result:
As per Gordon's answer, I'm not sure what should be done with the your ID column.
I've replaced these with row numbers.
Depending on your version of MySQL/MariaDB, the ROW_NUMBER() window function may not be available. Depending on whether IDs in the output are necessary you may be able to simply omit this.
I've assumed the existence of a table called myNumbers which contains a single field num and is populated with positive integers from 1 to whatever you're likely to need.
I've included more in the output that you asked for, which will hopefully help you understand what's going on
SELECT
ROW_NUMBER() OVER (ORDER BY d.ID, n.num) as NewID,
d.ID as OriginalID,
n.num as,
d.column_param,
d.column_sym,
d.column_value as orig_value,
CASE WHEN column_param = 'param_test2' THEN d.column_value
ELSE substring_index(substring_index(d.column_value,'/',n.num),'/',-1) END as split_value,
d.column_val2
FROM
myData d
JOIN myNumbers n on char_length(d.column_value)-char_length(replace(d.column_value,'/','')) >= n.num-1
WHERE
n.num = 1 OR d.column_param <> 'param_test2'
ORDER BY
d.ID,
n.num
See this DB Fiddle (the columns output in a different order than I've specified, but I think that's a DB Fiddle quirk).
If you only want to "split" say param_test1 and param_test4 rows the code above code could be amended as follows:
SELECT
ROW_NUMBER() OVER (ORDER BY d.ID, n.num) as NewID,
d.ID as OriginalID,
d.column_param,
d.column_sym,
n.num,
d.column_value as orig_value,
CASE WHEN column_param NOT IN ('param_test1','param_test4') THEN d.column_value
ELSE substring_index(substring_index(d.column_value,'/',n.num),'/',-1) END as split_value,
d.column_val2
FROM
myData d
JOIN myNumbers n on char_length(d.column_value)-char_length(replace(d.column_value,'/','')) >= n.num-1
WHERE
n.num = 1 OR d.column_param IN ('param_test1','param_test4')
ORDER BY
d.ID,
n.num
I don't know how the id is being set, but you can do what you want using union all:
select column_param, column_sym,
substring_index(column_value, '/', 1) as column_value,
column_val2
from t
union all
select column_param, column_sym,
substring_index(column_value, '/', -1) as column_value,
column_val2
from t
where column_value = '%/%';

Minus the value based on data using MySQL

I've the following data.
What I need like below
I need to minus order by 1 with 2.
Example : (1-2) and I've display the result in order by 3.
If the branch having order_by as 1 - display as it is.
Using MySQL, how can I get this result?
You can get this result with a UNION query. The first part selects all rows from your table, the second uses a self-join to find branches which have order_by values of both 1 and 2, and subtracts their due values to get the new due value:
SELECT *
FROM data
UNION ALL
SELECT 3, d1.branch, d1.due - d2.due
FROM data d1
JOIN data d2 ON d2.branch = d1.branch AND d2.order_by = 2
WHERE d1.order_by = 1
ORDER BY branch, order_by
Demo on dbfiddle

Comparing attributes from the same table using SQL query

I have a contents table and the entires in it are as shown in the attached figure
There are more than 100,000 entries. I want to fetch the data where the update_date for commit=0 is greater than update_date for commit=1. I also need the corresponding row for commit=1.
I tried a few things, but takes a long time to retrieve the results. What is the best SQL query I can use. I am using MySQL database.
EDIT
I have now updated the table. There is an attribute called content_id which binds the rows together.
A query like this gives me half of what I want
select a.* from contents a, contents b where
a.content_id=b.content_id and
a.update_date > b.update_date and
a.committed=0 and b.committed=1
I also want the corresponding entries from committed=1, but they should be appended at the bottom as rows and not vertically concatenated as columns.
For example, I cannot use
select * from contents a, contents b where
a.content_id=b.content_id and
a.update_date > b.update_date and
a.committed=0 and b.committed=1
because the results from 'b' are appended vertically. Also, is there a better way to write this query. This works really slow if there are many entries in the database.
I am assuming that in the above example, you only need id=2 as for content id = 1, the update_date for commit=0 is greater than update_date for commit=1 and in that case you need data for commited = 1.
I an using Oracle, so you need to find a suitable replacement for row_number() funtion in mysql.
The logic would be
Create a view on the existing table to use rownumber so it will give rownumber like below order by time desc (see if you use a nested query to do it)
ID, CONTENT_ID, COMMITED, UPDATE_DATE, ROWN
2 1 1 06-SEP-15 00:00:56 1
1 1 0 07-SEP-15 00:00:56 2
3 2 0 03-SEP-15 00:00:56 1
4 2 1 04-SEP-15 00:00:56 2
Now select only rows where where rown=1 and commited=1
This is the query in oracle. The second with query c2 will be your view.
Oracle query
with c1 (id, content_id,commited,update_date) as
(
select 1,1,0,sysdate from dual union
select 2,1,1,sysdate-1 from dual union
select 3,2,0,sysdate-4 from dual union
select 4,2,1,sysdate-3 from dual
),
c2 as
(select c1.*,row_number() over(partition by content_id order by update_date) as rown from c1)
select id,content_id,commited,update_date from c2
where rown=1 and commited=1
ID, CONTENT_ID, COMMITED, UPDATE_DATE, ROWN
Output
ID, CONTENT_ID, COMMITED, UPDATE_DATE
2 1 1 06-SEP-15 00:06:17

Need Help streamlining a SQL query to avoid redundant math operations in the WHERE and SELECT

*Hey everyone, I am working on a query and am unsure how to make it process as quickly as possible and with as little redundancy as possible. I am really hoping someone there can help me come up with a good way of doing this.
Thanks in advance for the help!*
Okay, so here is what I have as best I can explain it. I have simplified the tables and math to just get across what I am trying to understand.
Basically I have a smallish table that never changes and will always only have 50k records like this:
Values_Table
ID Value1 Value2
1 2 7
2 2 7.2
3 3 7.5
4 33 10
….50000 44 17.2
And a couple tables that constantly change and are rather large, eg a potential of up to 5 million records:
Flags_Table
Index Flag1 Type
1 0 0
2 0 1
3 1 0
4 1 1
….5,000,000 1 1
Users_Table
Index Name ASSOCIATED_ID
1 John 1
2 John 1
3 Paul 3
4 Paul 3
….5,000,000 Richard 2
I need to tie all 3 tables together. The most results that are likely to ever be returned from the small table is somewhere in the neighborhood of 100 results. The large tables are joined on the index and these are then joined to the Values_Table ON Values_Table.ID = Users_Table.ASSOCIATED_ID …. That part is easy enough.
Where it gets tricky for me is that I need to return, as quickly as possible, a list limited to 10 results where value1 and value2 are mathematically operated on to return a new_ value where that new_value is less than 10 and the result is sorted by that new_value and any other where statements I need can be applied to the flags. I do need to be able to move along the limit. EG LIMIT 0,10 / 11,10 / 21,10 etc...
In a subsequent (or the same if possible) query I need to get the top 10 count of all types that matched that criteria before the limit was applied.
So for example I want to join all of these and return anything where Value1 + Value2 < 10 AND I also need the count.
So what I want is:
Index Name Flag1 New_Value
1 John 0 9
2 John 0 9
5000000 Richard 1 9.2
The second response would be:
ID (not index) Count
1 2
2 1
I tried this a few ways and ultimately came up with the following somewhat ugly query:
SELECT INDEX, NAME, Flag1, (Value1 * some_variable + Value2) as New_Value
FROM Values_Table
JOIN Users_Table ON ASSOCIATED_ID = ID
JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index
WHERE (Value1 * some_variable + Value1) < 10
ORDER BY New_Value
LIMIT 0,10
And then for the count:
SELECT ID, COUNT(TYPE) as Count, (Value1 * some_variable + Value2) as New_Value
FROM Values_Table
JOIN Users_Table ON ASSOCIATED_ID = ID
JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index
WHERE (Value1 * some_variable + Value1) < 10
GROUP BY TYPE
ORDER BY New_Value
LIMIT 0,10
Being able to filter on the different flags and such in my WHERE clause is important; that may sound stupid to comment on but I mention that because from what I could see a quicker method would have been to use the HAVING statement but I don't believe that will work in certain instance depending on what I want to use my WHERE clause to filter against.
And when filtering using the flags table :
SELECT INDEX, NAME, Flag1, (Value1 * some_variable + Value2) as New_Value
FROM Values_Table
JOIN Users_Table ON ASSOCIATED_ID = ID
JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index
WHERE (Value1 * some_variable + Value1) < 10 AND Flag1 = 0
ORDER BY New_Value
LIMIT 0,10
...filtered count:
SELECT ID, COUNT(TYPE) as Count, (Value1 * some_variable + Value2) as New_Value
FROM Values_Table
JOIN Users_Table ON ASSOCIATED_ID = ID
JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index
WHERE (Value1 * some_variable + Value1) < 10 AND Flag1 = 0
GROUP BY TYPE
ORDER BY New_Value
LIMIT 0,10
That works fine but has to run the math multiple times for each row, and I get the nagging feeling that it is also running the math multiple times on the same row in the Values_table table. My thought was that I should just get only the valid responses from the Values_table first and then join those to the other tables to cut down on the processing; with how SQL optimizes things though I wasn't sure if it might not already be doing that. I know I could use a HAVING clause to only run the math once if I did it that way but I am uncertain how I would then best join things.
My questions are:
Can I avoid running that math twice and still make the query work
(or I suppose if there is a good way
to make the first one work as well
that would be great)
What is the fastest way to do this
as this is something that will
be running very often.
It seems like this should be painfully simple but I am just missing something stupid.
I contemplated pulling into a temp table then joining that table to itself but that seems like I would trade math for iterations against the table and still end up slow.
Thank you all for your help in this and please let me know if I need to clarify anything here!
** To clarify on a question, I can't use a 3rd column with the values pre-calculated because in reality the math is much more complex then addition, I just simplified it for illustration's sake.
Do you have a benchmark query to compare against? Usually it doesn't work to try to outsmart the optimizer. If you have acceptable performance from a starting query, then you can see where extra work is being expended (indicated by disk reads, cache consumption, etc.) and focus on that.
Avoid the temptation to break it into pieces and solve those. That's an antipattern. That includes temp tables especially.
Redundant math is usually ok - what hurts is disk activity. I've never seen a query that needed CPU work reduction on pure calculations.
Gather your results and put them in a temp table
SELECT * into TempTable FROM (SELECT INDEX, NAME, Type, ID, Flag1, (Value1 + Value2) as New_Value
FROM Values_Table
JOIN Users_Table ON ASSOCIATED_ID = ID
JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index
WHERE New_Value < 10)
ORDER BY New_Value
LIMIT 0,10
Return Result for First Query
SELECT INDEX, NAME, Flag1, New_Value
FROM TempTable
Return Results for count of Types
Select ID, Count(Type)
FROM TempTable
GROUP BY TYPE
Is there any chance that you can add a third column to the values_table with the pre-calculated value? Even if the result of your calculation is dependent on other variables, you could run the calculation for the whole table but only when those variables change.