Splitting PLSQL result - plsqldeveloper

I'm trying to fetch data from a database using PLSQL Developer, the total of rows that need to be fetched is more than 1,5 million. When I tried to fetch the data all together it will really take a long time. I'm going to split it into two fetching phase, the 1st one, rows 1 - 1million and the rests go to the 2nd phase.
How could I do this in PLSQL ?

This select numbers each row using the analitical function so you can query by row numbers;
SELECT *
FROM
(
SELECT *,
ROW_NUMBER() OVER(ORDER BY id_column_here) r
FROM my_table
)
WHERE R<=100000;
You can use this with smaller row intervals to retrieve first 1000 then the next and so on :
SELECT *
FROM
(
SELECT *,
ROW_NUMBER() OVER(ORDER BY id_column_here) r
FROM my_table
)
WHERE R between 1000 and 2000;

Related

SQL Query Restricting to 1 return per condition contained within IN clause

Running MYSQL 5.5 and trying to essentially return only 1 record from each of the conditions in my IN clause. I can't use the DISTINCT because there should be multiple distinct records that are attached to each code (namely cost will be different) from the IN clause. Below is a dummy query of what I was trying to do, but doesn't work in 5.5 because of the ROW_NUMBER() function.
'1b' may have multiple records with differing cost values. title should always be the same across every record with the same codes value.
Any thoughts?
SELECT codes, name_place, title, cost
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY codes) rn
FROM MyDB.MyTable
)
WHERE codes IN ('1b', '1c', '1d', '1e')
AND rn = 1;

Keep only last two rows for grouped columns in table

I have a table "History" with about 300.000 rows, which is filled with new data daily. I want to keep only the last two lines of every refSchema/refId combination.
Actually I go this way:
First Step:
SELECT refSchema,refId FROM History GROUP BY refSchema,refId
With this statement I get all combinations (which are about 40.000).
Second Step:
I run a foreach which looks up for the existing rows for the query above like this:
SELECT id
FROM History
WHERE refSchema = ? AND refId = ? AND state = 'done'
ORDER BY importedAt
DESC LIMIT 2,2000
Please keep in mind, that I want to hold the last two rows in my table, so I limit 2,2000. If I find matching rows I put the id's in an array called idList.
Final Step
I delete all id's from the array in that way:
DELETE FROM History WHERE id in ($idList)
This all seems not to be the best performance, because I have to check every combination with an extra query. Is there a way to have one delete statement that does the magic to avoid the 40.000 extra queries?
Edit Update: I use AWS Aurora DB
If you are using MySQL 8+, then one conceptually simple way to proceed here is to use a CTE to identify the top two rows per group which you do want to retain. Then, delete any record whose schema/id pair do not appear in this whitelist:
WITH cte AS (
SELECT refSchema, refId
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY refSchema, refId ORDER BY importedAt DESC) rn
FROM History
) t
WHERE rn IN (1, 2)
)
DELETE
FROM History
WHERE (refSchema, refId) NOT IN (SELECT refSchema, refId FROM cte);
If you can't use CTE, then try inlining the above CTE:
DELETE
FROM History
WHERE (refSchema, refId) NOT IN (
SELECT refSchema, refId
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY refSchema, refId ORDER BY importedAt DESC) rn
FROM History
) t
WHERE rn IN (1, 2)
);

mysql randomizing result and optimization

I want to have randomized rows after a query, but using order by rand() is just exhausting on a table that has 120k+ rows. I have found a small solution that just outputs number of rows but it runs like it starts from a random index and then returns #number of rows after that. It is pretty fast but this just returns some rows after a random index. The code goes like:
SELECT *
FROM lieky AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(col_0)
FROM lieky)) AS id)
AS r2
WHERE r1.col_0 >= r2.id
ORDER BY r1.col_0 ASC
LIMIT 100
and i found it in here: http://jan.kneschke.de/projects/mysql/order-by-rand/
Is there something that would help me ?
I am trying to get randomized data into pagination, so when the user queries the database, he will always get the rows in a random order.
Thanks for help.
It should be noted that
(SELECT (RAND() * (SELECT MAX(col_0) FROM lieky)) AS id)
can return MAX(col_0), so you ll get only 1 row (because of WHERE r1.col_0 >= r2.id)
I think good solution should be somethink like:
add two columns groupId int, seed int; add index indexName (groupId , seed)
every x seconds (maybe every hour, day, ..) run script that ll be recalc these columns (see below)
when user open your rows list first time (or when you want to re-rand items) you save any random groupId to user's session; groupId can be from 0 to (select max(groupId) from lieky)
to show rows you use query like: (select * from lieky where groupId=%saved groupId% order by Seed limit x,100) — it should be very fast
About recalc script, it ll rather slow (so it's good idea to run it at night).
Seed you can update by using:
update lieky set Seed = rand()*1000000
Then set GroupId=0 for first N rows, GroupId=1 for following N rows, ...
N is max rows that you can show for user (max_page)*(per_page_count)

'GROUP BY' and subquery optimization

I have two tables: data and img each raw from data having 0, 1 or n images in img table.
I want to be able to get all records from data having more than 1 images in img. Cannot use JOINs because I can not edit the first part of the sql query: SELECT some_defualt_columns FROM data WHERE.
Here are my solutions:
Solution 1 Takes a while to perform but works
SELECT some_defualt_columns FROM data WHERE `id` IN
(SELECT data_id FROM
(SELECT data_id, count(*) as occ FROM
img GROUP BY data_id
HAVING occ >1)
AS tmp)
Solution 2 This should be faster than the previous (MySQL: View with Subquery in the FROM Clause Limitation) but this literally kills my MySQL server
SELECT * FROM data WHERE id IN
(SELECT data_id FROM img
GROUP BY `data_id`
HAVING count(`data_id`) > 1)
SOLUTION 3 Maybe the fastes but needs the creation of a view:
CREATE VIEW my_data_with_more_than_one_img AS
SELECT all_columns_of_data_table FROM
data JOIN img
GROUP BY img.data_id
HAVING (COUNT(img.data_id) > 1
Than execute a simple SELECT on this:
SELECT * FROM my_data_with_more_than_one_img WHERE 1
This last solution is rather fast, but I want to know if is there any faster (or smarter) way to get this done

SQL select a sample of rows

I need to select sample rows from a set. For example if my select query returns x rows then if x is greater than 50 , I want only 50 rows returned but not just the top 50 but 50 that are evenly spread out over the resultset. The table in this case records routes - GPS locations + DateTime.
I am ordering on DateTime and need a reasonable sample of the Latitude & Longitude values.
Thanks in advance
[ SQL Server 2008 ]
To get sample rows in SQL Server, use this query:
SELECT TOP 50 * FROM Table
ORDER BY NEWID();
If you want to get every n-th row (10th, in this example), try this query:
SELECT * From
(
SELECT *, (Dense_Rank() OVER (ORDER BY Column ASC)) AS Rank
FROM Table
) AS Ranking
WHERE Rank % 10 = 0;
Source
More examples of queries selecting random rows for other popular RDBMS can be found here: http://www.petefreitag.com/item/466.cfm
Every n'th row to get 50:
SELECT *
FROM table
WHERE row_number() over() MOD (SELECT Count(*) FROM table) / 50 == 0
FETCH FIRST 50 ROWS ONLY
And if you want a random sample, go with jimmy_keen's answer.
UPDATE:
In regard to the requirement for it to run on MS SQL, I think it should be changed to this (no MS SQL Server around to test though):
SELECT TOP 50 *
FROM (
SELECT t.*, row_number() over() AS rn, (SELECT count(*) FROM table) / 50 AS step
FROM table t
)
WHERE rn % step == 0
I suggest that you add a calculated column to your resultset on selection that is obtained as a random number, and then select the top 50 sorted by that column. That will give you a random sample.
For example:
SELECT TOP 50 *, RAND(Id) AS Random
FROM SourceData
ORDER BY Random
where SourceData is your source data table or view. This assumes T-SQL on SQL Server 2008, by the way. It also assumes that you have an Id column with unique ids on your data source. If your ids are very low numbers, it is a good practice to multiply them by a large integer before passing them to RAND, like this:
RAND(Id * 10000000)
If you want an statically correct sample, tablesample is a wrong solution. A good solution as I described in here based on a Microsoft Research paper, is to create a materialized view over your table which includes an additional column like
CAST( ROW_NUMBER() OVER (...) AS BYTE ) AS RAND_COL_, then you can add an index on this column, plus other interesting columns and get statistically correct samples for your queries fairly quickly. (by using WHERE RAND_COL_ = 1).