Say you have an inventory journal table with these fields:
ID, ProductID, WarehouseID, etc.
ID = PK
ProductID & WarehouseID are both FK and indexed.
Then let's say we populate 5 million rows of data into the table. I ran 2 queries.
The first query used both FKs ProductID and WarehouseID
SELECT inventoryjournals.id,inventoryjournals.ProductID
FROM zenlite.inventoryjournals
where productid = 1 && WarehouseID = 1
limit 30 offset 2500000
This took 5.75s to return the result understandably because it goes through from 1st record to 2.5 mill record. But then I ran another query with arbitrary ID constraint
SELECT inventoryjournals.id,inventoryjournals.ProductID
FROM zenlite.inventoryjournals
where productid = 1 && WarehouseID = 1 && id <10000000
limit 30 offset 2500000
or even this
SELECT inventoryjournals.id,inventoryjournals.ProductID
FROM zenlite.inventoryjournals
where productid = 1 && WarehouseID = 1 && id > 0
limit 30 offset 2500000
This shrank the time down to 1.5 ~ 1.6s?! Does this mean it's always better to add the PK constraints in all read queries? Like id > 0 (always true)
My question is, will doing this pose any risk?
There is no way to make offset 2500000 run fast. It must skip over that many rows (unless it hits the end of the table).
All 3 of your queries could benefit from
INDEX(production, WarehouseID, id)
Large offsets are a poor way to do "pagination". It is better to "remember where you left off". Or are you using the large OFFSET for some other purpose?
Related
I am trying to find a reliable query which returns the first instance of an acceptable insert range.
Research:
some of the below links adress similar questions, but I could get none of them to work for me.
Find first available date, given a date range in SQL
Find closest date in SQL Server
MySQL difference between two rows of a SELECT Statement
How to find a gap in range in SQL
and more...
Objective Query Function:
InsertRange(1) = (StartRange(i) - EndRange(i-1)) > NewValue
Where InsertRange(1) is the value the query should return. In other words, this would be the first instance where the above condition is satisfied.
Table Structure:
Primary Key: StartRange
StartRange(i-1) < StartRange(i)
StartRange(i-1) + EndRange(i-1) < StartRange(i)
Example Dataset
Below is an example User table (3 columns), with a set range distribution. StartRanges are always ordered in a strictly ascending way, UserID are arbitrary strings, only the sequences of StartRange and EndRange matters:
StartRange EndRange UserID
312 6896 user0
7134 16268 user1
16877 22451 user2
23137 25142 user3
25955 28272 user4
28313 35172 user5
35593 38007 user6
38319 38495 user7
38565 45200 user8
46136 48007 user9
My current Query
I am trying to use this query at the moment:
SELECT t2.StartRange, t2.EndRange
FROM user AS t1, user AS t2
WHERE (t1.StartRange - t2.StartRange+1) > NewValue
ORDER BY t1.EndRange
LIMIT 1
Example Case
Given the table, if NewValue = 800, then the returned answer should be 23137. This means, the first available slot would be between user3 and user4 (with an actual slot size = 813):
InsertRange(1) = (StartRange(i) - EndRange(i-1)) > NewValue
InsertRange = (StartRange(6) - EndRange(5)) > NewValue
23137 = 25955 - 25142 > 800
More Comments
My query above seemed to be working for the special case where StartRanges where tightly packed (i.e. StartRange(i) = StartRange(i-1) + EndRange(i-1) + 1). This no longer works with a less tightly packed set of StartRanges
Keep in mind that SQL tables have no implicit row order. It seems fair to order your table by StartRange value, though.
We can start to solve this by writing a query to obtain each row paired with the row preceding it. In MySQL, it's hard to do this beautifully because it lacks the row numbering function.
This works (http://sqlfiddle.com/#!9/4437c0/7/0). It may have nasty performance because it generates O(n^2) intermediate rows. There's no row for user0; it can't be paired with any preceding row because there is none.
select MAX(a.StartRange) SA, MAX(a.EndRange) EA,
b.StartRange SB, b.EndRange EB , b.UserID
from user a
join user b ON a.EndRange <= b.StartRange
group by b.StartRange, b.EndRange, b.UserID
Then, you can use that as a subquery, and apply your conditions, which are
gap >= 800
first matching row (lowest StartRange value) ORDER BY SB
just one LIMIT 1
Here's the query (http://sqlfiddle.com/#!9/4437c0/11/0)
SELECT SB-EA Gap,
EA+1 Beginning_of_gap, SB-1 Ending_of_gap,
UserId UserID_after_gap
FROM (
select MAX(a.StartRange) SA, MAX(a.EndRange) EA,
b.StartRange SB, b.EndRange EB , b.UserID
from user a
join user b ON a.EndRange <= b.StartRange
group by b.StartRange, b.EndRange, b.UserID
) pairs
WHERE SB-EA >= 800
ORDER BY SB
LIMIT 1
Notice that you may actually want the smallest matching gap instead of the first matching gap. That's called best fit, rather than first fit. To get that you use ORDER BY SB-EA instead.
Edit: There is another way to use MySQL to join adjacent rows, that doesn't have the O(n^2) performance issue. It involves employing user variables to simulate a row_number() function. The query involved is a hairball (that's a technical term). It's described in the third alternative of the answer to this question. How do I pair rows together in MYSQL?
I have the following query which takes about 28 seconds on my machine. I would like to optimize it and know if there is any way to make it faster by creating some indexes.
select rr1.person_id as person_id, rr1.t1_value, rr2.t0_value
from (select r1.person_id, avg(r1.avg_normalized_value1) as t1_value
from (select ma1.person_id, mn1.store_name, avg(mn1.normalized_value) as avg_normalized_value1
from matrix_report1 ma1, matrix_normalized_notes mn1
where ma1.final_value = 1
and (mn1.normalized_value != 0.2
and mn1.normalized_value != 0.0 )
and ma1.user_id = mn1.user_id
and ma1.request_id = mn1.request_id
and ma1.request_id = 4 group by ma1.person_id, mn1.store_name) r1
group by r1.person_id) rr1
,(select r2.person_id, avg(r2.avg_normalized_value) as t0_value
from (select ma.person_id, mn.store_name, avg(mn.normalized_value) as avg_normalized_value
from matrix_report1 ma, matrix_normalized_notes mn
where ma.final_value = 0 and (mn.normalized_value != 0.2 and mn.normalized_value != 0.0 )
and ma.user_id = mn.user_id
and ma.request_id = mn.request_id
and ma.request_id = 4
group by ma.person_id, mn.store_name) r2
group by r2.person_id) rr2
where rr1.person_id = rr2.person_id
Basically, it aggregates data depending on the request_id and final_value (0 or 1). Is there a way to simplify it for optimization? And it would be nice to know which columns should be indexed. I created an index on user_id and request_id, but it doesn't help much.
There are about 4907424 rows on matrix_report1 and 335740 rows on matrix_normalized_notes table. These tables will grow as we have more requests.
First, the others are right about knowing better how to format your samples. Also, trying to explain in plain language what you are trying to do is also a benefit. With sample data and sample result expectations is even better.
However, that said, I think it can be significantly simplified. Your queries are almost completely identical with the exception of the one field of "final_value" = 1 or 0 respectively. Since each query will result in 1 record per "person_id", you can just do the average based on a CASE/WHEN AND remove the rest.
To help optimize the query, your matrix_report1 table should have an index on ( request_id, final_value, user_id ). Your matrix_normalized_notes table should have an index on ( request_id, user_id, store_name, normalized_value ).
Since your outer query is doing the average based on an per stores averages, you do need to keep it nested. The following should help.
SELECT
r1.person_id,
avg(r1.ANV1) as t1_value,
avg(r1.ANV0) as t0_value
from
( select
ma1.person_id,
mn1.store_name,
avg( case when ma1.final_value = 1
then mn1.normalized_value end ) as ANV1,
avg( case when ma1.final_value = 0
then mn1.normalized_value end ) as ANV0
from
matrix_report1 ma1
JOIN matrix_normalized_notes mn1
ON ma1.request_id = mn1.request_id
AND ma1.user_id = mn1.user_id
AND NOT mn1.normalized_value in ( 0.0, 0.2 )
where
ma1.request_id = 4
AND ma1.final_Value in ( 0, 1 )
group by
ma1.person_id,
mn1.store_name) r1
group by
r1.person_id
Notice the inner query is pulling all transactions for the final value as either a zero OR one. But then, the AVG is based on a case/when of the respective value for the normalized value. When the condition is NOT the 1 or 0 respectively, the result is NULL and is thus not considered when the average is computed.
So at this point, it is grouped on a per-person basis already with each store and Avg1 and Avg0 already set. Now, roll these values up directly per person regardless of the store. Again, NULL values should not be considered as part of the average computation. So, if Store "A" doesn't have a value in the Avg1, it should not skew the results. Similarly if Store "B" doesnt have a value in Avg0 result.
I'm having trouble to calculate percentage of failure of every column for my very large MySQL table. Here is an example on how the small table will look like:
Assuming TABLE1 has 5 columns and 100 rows,
CREATE TABLE IF NOT EXIST TABLE1 (id VARCHAR(255) NOT NULL, col1 DOUBLE NOT NULL, col2 DOUBLE NOT NULL, col3 NOT DOUBLE NULL, col4 NOT DOUBLE NULL);
Each column from "col1" to "col4" having its own upper and lower limits and I need to find what is the percentage of failure for "col1" to "col4". Here is the example on how I run my calculation now.
Calculate total number of rows and group by column "id"
SELECT id, COUNT(*) FROM TABLE1 GROUP BY id;
Calculate total number of rows where col1,col2,col3,col4 meets all the limits and group by column "id"
SELECT id, COUNT(*) FROM TABLE1 WHERE (col1 BETWEEN 0 AND 10) AND (col2 BETWEEN 10 AND 20) AND (col3 BETWEEN 20 AND 30) AND (col3 BETWEEN 30 AND 40) GROUP BY id;
Calculate total number of rows that not meet col1 limit
SELECT id, COUNT(col1) FROM TABLE1 WHERE (col1 NOT BETWEEN 0 AND 10) GROUP BY id;
Calculate total number of rows that meet col1 limit but not meet col2 limit, group by "id"
SELECT id, COUNT(col2) FROM TABLE1 WHERE (col1 BETWEEN 0 AND 10) AND (col2 NOT BETWEEN 10 AND 20) GROUP BY id;
Calculate total number of rows that meet col1,col2 limit but not meet col3 limit, group by "id"
SELECT id, COUNT(col3) FROM TABLE1 WHERE (col1 BETWEEN 0 AND 10) AND (col2 BETWEEN 10 AND 20) AND (col3 NOT BETWEEN 20 AND 30) GROUP BY id;
Calculate total number of rows that meet col1,col2,col3 limit but not meet col4 limit, group by "id"
SELECT id, COUNT(col4) FROM TABLE1 WHERE (col1 BETWEEN 0 AND 10) AND (col2 BETWEEN 10 AND 20) AND (col3 BETWEEN 20 AND 30) AND (col3 NOT BETWEEN 30 AND 40) GROUP BY id;
I've written an R script to execute the above 5 queries and combine the result under one data frame. Here is example of the output after processed by R:
id,total_no_rows,yield,col1,col2,col3,col4
CATEGORY1,25,80%,2%,8%,4%,6%,0%
CATEGORY2,25,70%,6%,14%,2%,6%,2%
CATEGORY3,25,90%,5%,0%,5%,0%,0%
CATEGORY4,25,65%,20%,2.5%,2.5%,5%,5%
Now using this method i can get the result pretty quick for small table. However if the table become very large, say 1000 columns and 1 million of rows, the time to complete the calculation is ~ 2 hours, which is extremely long.
Is there anyway i can speed up the calculation?
I've tried indexing but apparently MySQL cannot index 1000 columns.
Tried simultaneous query (10 queries at one time) but no much improvement. (I'm using InnoDB by the way)
I've read some of the posts where users suggests to split the table into smaller chunks to speed up the queries execution. However, my raw data is poorly managed (long story) and all the data dumped into one big text file. So dividing the raw data in smaller chunk will be a challenge.
Please let me know if you have any alternative method to approach this kind of problem.
Edit:
Looks like the the proposal from Mani did save a lot of time to get the result. However the time to complete the query still takes around 10 minutes for very large table (Thousands of column and millions of rows). Is there any way to further improve the query time?
You can use case and find all possible scenario in a single select hit. It will reduce your time.
example
select id, count(*),
sum(case when col1 between 0 and 10 then 1 else 0 end) col1_yes,
sum(case when (col1 not between 0 and 10) and (col2 between 0 and 10) then 1
else 0 end) col1no_col2yes
from table
group by id;
I've created the mysql user function using the levenshtein distance and ratio source codes. I am comparing 2 records and based on a 75% match I want to select the record.
Order comes into table paypal_ipn_orders with an ITEM title
A query executes against a table itemkey to find a 75% match in a record called ITEM as well
if a 75% title is match it assigns an eight digit number from table itemkey to table paypal_ipn_orders
Here is the query
UPDATE paypal_ipn_orders
SET sort_num = (SELECT sort_id
FROM itemkey
WHERE levenshtein_ratio(itemkey.item, paypal_ipn_orders.item_name) > 75)
WHERE packing_slip_printed = 0
AND LOWER(payment_status) = 'completed'
AND address_name <> ''
AND shipping < 100
I have adjusted this a few times but it's failing between line 4 and 5 at the levenshtein_ratio part. If it works it says that the subquery returns more than one row. I don't know how to fix it to make it return the correct result, I just lost as to how to make this work.
A subquery on a SET should only return one value. If itemkey has more than one item that is 75% of item_name what do you want to do? The below will use one of the best matches:
UPDATE paypal_ipn_orders
SET sort_num = (SELECT sort_id
FROM itemkey
WHERE levenshtein_ratio(itemkey.item, paypal_ipn_orders.item_name) > 75
ORDER BY levenshtein_ratio(itemkey.item, paypal_ipn_orders.item_name) DESC
LIMIT 1)
WHERE packing_slip_printed = 0
AND LOWER(payment_status) = 'completed'
AND address_name <> ''
AND shipping < 100
If I have a table: ID, num,varchar
where ID is an integer, either 1,2 or 3.
then num is a number, counting from 1 to 100.
varchar is just some text.
So in total we have 300 rows, in no particular ordering in this table.
What query can I use to get the rows with ID=2 and num from 16-21 out of this table?
(resulting in 6 rows total)
How about
SELECT * from yourtable where ID = 2 AND num >= 16 AND num <= 21
Or, equivalent using BETWEEN
SELECT * from yourtable where ID = 2 AND num BETWEEN 16 AND 21
Create an index to have faster lookups later (but will slow down your inserts a bit):
CREATE INDEX indexname on yourtable (ID,num);
SELECT * FROM TABLE WHERE ID = 2 AND NUM > 15 AND NUM < 22;
where TABLE is the name of your table. In general, given that you're selecting on columns ID and NUM they should probably be indexed for faster retrieval (ie the database doesn;t have to check every row). Although given your table is small it probably wont make much difference here.
This should do it:
SELECT * FROM table WHERE id = 2 AND num > 15 AND num < 22