I have written a script which takes about 15 hours to execute. I need some query optimization technique or suggestion to make this script as faster as possible...
If anyone could help, take a look on the script:
declare #max_date date
declare #client_bp_id int
Select #max_date=MAX(tran_date) from All_Share_Txn
DELETE FROM Client_Share_Balance
DECLARE All_Client_Bp_Id CURSOR FOR
SELECT Bp_id FROM Client --Take All Client's BPID
OPEN All_Client_Bp_Id
FETCH NEXT FROM All_Client_Bp_Id
INTO #client_bp_id
WHILE ##FETCH_STATUS = 0
BEGIN
Insert Client_Share_Balance(Bp_id,Instrument_Id,Quantity_Total,Quantity_Matured,Quantity_Pledge,AVG_Cost,Updated_At,Created_At,Company_Id,Created_By,Updated_By)
select #client_bp_id,Instrument_Id,
sum(case when Is_buy='True' then Quantity when Is_buy='False' then -quantity end), --as Total Quantity
sum(case when Mature_Date_Share <= #max_date then (case Is_buy when '1' then quantity when '0' then -quantity end) else 0 end), --as Free Qty
ISnull((select sum(case pu.IsBuy when '1' then -pu.quantity else pu.quantity end) from
(Select * from Pledge UNION Select * from Unpledge) pu
where pu.Client_Bp_id=#client_bp_id and pu.Instrument_Id=t1.Instrument_Id and pu.Txn_Date<=#max_date
group by pu.Client_Bp_id,pu.Instrument_Id),0), -- as Pledge_Quantity
dbo.Avg_Cost(#client_bp_id,Instrument_Id), --as Avg_rate
GETDATE(),GETDATE(),309,1,1
from All_Share_Txn t1
where Client_Bp_id=#client_bp_id and Instrument_Id is not null
group by Instrument_Id
having sum(case Is_buy when '1' then quantity when '0' then -quantity end)<> 0
or sum(case when Mature_Date_Share <= #max_date then (case Is_buy when '1' then quantity when '0' then -quantity end) else 0 end) <> 0
FETCH NEXT FROM All_Client_Bp_Id
INTO #client_bp_id
END
CLOSE All_Client_Bp_Id
DEALLOCATE All_Client_Bp_Id
Just need to verify if the code could be written more efficiently..
Replace * with you ColumnNames Select * from Pledge. It should be like
Select Instrument_Id from Pledge
Exclude the usage of Cursor.
Do you have unique records in Pledge and Unpledge table, if so, UNION ALL should be used. As it is faster comparing with UNION
Insert the records of All_Share_Txn in Local Temporary Table.
Create another Local Temporary table which will have fields 'Total Quantity' information based upon Instrument_Id column and Instrument_Id. Now evaluate the Switch case based condition and insert the records for Quantity Information in this table. Please note while you extract information for this context, use the Local Temporary Table as created in Step 3.
Create another Local Temporary table which will have fields 'Free Qty' information based upon Instrument_Id column and Instrument_Id. Now evaluate the Switch case based condition and insert the records for Free Qty Information in this table. Please note while you extract information for this context, use the Local Temporary Table as created in Step 3.
Create another Local Temporary table which will have fields 'Pledge_Quantity' information based upon Instrument_Id column and Instrument_Id. Now evaluate the Switch case based condition and insert the records
for Pledge_Quantity Information in this table. Please note while you extract information for this context, use the Local Temporary Table as created in Step 3.
Create another Local Temporary table which will have fields 'Avg_rate' information based upon Instrument_Id column and Instrument_Id. Now evaluate the Switch case based condition and insert the records
for Avg_rate Information in this table. Please note while you extract information for this context, use the Local Temporary Table as created in Step 3.
Now, with the help of Joins among the tables created in Step 3, 4, 5, 6, 7. You can instantly get the Resultset.
If I understand you code. The cursor is the bottleneck in your code. So I would skip the cursor and do something like this:
Insert Client_Share_Balance(Bp_id,Instrument_Id..)
select Client_Bp_id,
......
from All_Share_Txn t1
where EXISTS(SELECT NULL FROM Client WHERE Client_Bp_id=t1.Bp_id)
and Instrument_Id is not null
group by Instrument_Id,Client_Bp_id
.......
Unless you care that you are reading COMMITTED data, then you can tell SQL Server to look at data the way it is, without holding any locks on the objects... basically the same behavior as WITH (NOLOCK), but Microsoft recommends not using object hints and letting SQL Server decide on the best locking method to use. Without "worrying" about data being committed or not, this greatly speeds up fetching data.
Add this to the top of your query
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
--Set at the connection level, not a one time thing
GO
See this link TRANSACTION ISOLATION LEVEL
Next thing, please check that your unique id columns having Clustered Indexes on them. If they just have unique constraints, that still leaves you with a heap and not a table. Heaps are basically a scattered mess. (You probably know this stuff, just mentioning it) Next, make sure you have non-clustered indexes for all of the columns you use in WHERE ORDER BY GROUP HAVING... and include columns that are returned frequently; this consumes more disk space btw. Use compressed indexes and tables if you are using Enterprise Version or higher.
I would run UPDATE STATISTICS WITH FULLSCAN on the tables to give SQL Server the latest and greatest statistics. Rebuilding the clustered index will do this for you by the way, and also you can tell update only data statistics to help speed this process up, since it can take while if your tables are millions of rows.
The biggest performance hit you are taking is the fact that you are grouping on aggregated results. SQL Server is going to scanning, using work tables, sorting...everything BUT seeking indexes, all because there is no index to be used to help your GROUP BY, HAVING, etc..
Sometimes we have no choice in this matter, but there are tricks like creating a #temporary table (or table object, and yes, you can create indexes on both) and populating it with pre calculated results, making sure that this #temporary table can be joined on. When you use ORDER BY, GROUP BY, HAVING , unless you are using a column, or multiple columns and use aggregates like SUM or scalar value user-defined functions, it's going to be slow - depending on what you define as slow :) but you did state that you think it should be faster.
Some basic settings to look at for any instance of SQL Server:
tempdb should have the same number of files as you have cores, all the same size and all the same growth rate using MB, not %. I would only go up to 8 files if you have more than 8 cores. Restart the instance after making changes; you do HAVE TO but I recommend it.
Example: 12 cores.
tempdb.mdf size= 1024MB growby= 256MB
tempdb2.ndf size= 1024MB growby= 256MB
(etc)
tempdb8.ndf size= 1024MB growby= 256MB
Same with your database. If you need to add more files, then do so with the recommended size settings and rebuild all of your clustered indexes, this will spread the data across the files, since it will be rebuilding the physical structure of the data.
Don't let SQL Server take up all the memory! Set its limit to total_avail_phys_memory minus 2GB (leave the O/S a good amount memory)
Don't just use the PRIMARY FILE GROUP; separate your data and your indexes in to their own file groups. If you can, put the indexes on their on RAID 10 drives and data on RAID 5 or 6.
Make sure you're giving SQL Server user databases the care they need with scheduled maintenance with a maintenance plan or by roll your own scripts.
Data on RAID 5, Logs on RAID 10, Tempdb on RAID 10 - each LUN (drive letter) should have dedicated spindles (drives)
I hope these suggestions are helpful, if anything they should be helpful for the overall performance of the instance.
Related
We have just one table with millions of rows where this query, as it stands takes 138 seconds to run on a server with a buffer pool size of 25G, the server itself linux with SSD drives.
I am wondering if anyone could suggest any improvements in MySQL settings or in the query itself that would reduce run time. We only have about 8 large member_id's that have this performance problem, the rest run under 5 seconds. We run multiple summary tables like this for rollup reporting.
select *
from (
SELECT distinct account_name AS source,SUM(royalty_amount) AS total_amount
FROM royalty_stream
WHERE member_id = '1050705'
AND deleted = 0
AND period_year_quarter >= '2016_Q1'
AND period_year_quarter <= '2016_Q2'
GROUP BY account_name
ORDER BY total_amount desc
LIMIT 1
) a
I see a few obvious improvements.
Subselects
Don't use a subselect. This isn't a huge deal, but it makes little sense to add the overhead here.
Using Distinct
Is the distinct really needed here? Since you're grouping, it should be unnecessary overhead.
Data Storage Practices
Your period_year_quarter evaluation is going to be a hurdle. String comparisons are one of the slower things you can do, unfortunately. If you have the ability to update the data structure, I would highly recommend that you break period_year_quarter into two distinct, integer fields. One for the year, one for the quarter.
Is royalty_amount actually stored as a number, or are you making the database implicitly convert it every time? If so (surprisingly common mistake) converting that to a number will also help.
Indexing
You haven't explained what indexes are on this table. I'm hoping that you at least have one on member_id. If not, it should certainly be indexed.
I would further recommend an index on (member_id, period_year_quarter). If you took my advice from the previous section, that should be (member_id, year, quarter).
select
account_name as source
, sum(royalty_amount) as total_amount
from
royalty_stream
where
member_id = '1050705'
and deleted = 0
and period_year_quarter between '2016_Q1' and '2016_Q2'
group by
account_name
order by
total_amount desc
limit 1
I have a table
name: order
with
id, product_id, comment
now I want to add a state
new table: order_state
1 -> finished
2 -> started
etc
then add a field order_state_id in the table order
in what way do I have to worry about performance?
does this always perform well or what is the case where it wont? e.g. i mean when doing joins etc with a lot of orders, say 200'000 orders
i have used mysql views before and they were horrible the view I created contained obviously several joins. Is this not a related problem?
Not an answer, just too big for a comment
In addition to what have been said, consider partial indexes.
Some DB like Postgres and SQL Server allows you to create indexes that not only specifies columns but rows.
It seems that you will end up with a constant growing amount of orders with order_state_id equal to finished (2) a stable amount of orders with order_state_id equal to started (1)
If your business make use of queries like this
SELECT id, comment
FROM order
WHERE order_state_id = 1
AND product_id = #some_value
Partial indexing allows you to limit the index, including only the unfinished orders
CREATE INDEX Started_Orders
ON order(product_id)
WHERE order_state_id = 1
This index will be smaller than the unfiltered contra part
Don't normalize order_state. Instead add this column
order_state ENUM('finished', 'started') NOT NULL
Then use it this way (for example):
SELECT ...
WHERE order_state = 'finished'
...
An ENUM (with up to 255 options) takes only 1 byte. INT takes 4 bytes. TINYINT takes 1 byte.
Back to your question... There are good uses of JOIN and there are unnecessary uses.
I need to develop a SSIS Package where I will need to import/use a flat file(has only 1 column) to compare each row against existing table's 2 columns(start and end column).
Flat File data -
110000
111112
111113
112222
112525
113222
113434
113453
114343
114545
And compare each row of the flat file against structure/data -
id start end
8 110000 119099
8 119200 119999
3 200000 209999
3 200000 209999
2 300000 300049
2 770000 779999
2 870000 879999
Now, If need to implement this in a simple stored procedure that would farely simple, however I am not able to get my head around this if I have do it in SSIS package.
Any ideas? Any help much appreciated.
At the core, you will need to use a Lookup Component. Write a query, SELECT T.id, T.start, T.end FROM dbo.MyTable AS T and use that as your source. Map the input column to the start column and select the id so that it will be added to the data flow.
If you hit run, it will perform an exact lookup and only find values of 110000 and 119200. To convert it to a range query, you will need to go into the Advanced tab. There should be 3 things you can check: amount of memory, rows and customize the query. When you check the last one, you should get a query like
SELECT * FROM
(SELECT T.id, T.start, T.end FROM dbo.MyTable AS T`) AS ref
WHERE ref.start = ?
You will need to modify that to become
SELECT * FROM
(SELECT T.id, T.start, T.end FROM dbo.MyTable AS T`) AS ref
WHERE ? BETWEEN ref.start AND ref.end
It's been my experience that the range queries can become rather inefficient as it seems to cache what's been seen already so if the source file had 110001, 110002, 110003 you would see 3 unique queries sent to the database. For small data sets, that may not be so bad but it led to some ugly load times for my DW.
An alternative to this is to explode the ranges. For me, I had a source system that only kept date ranges and I needed to know by day what certain counts were. The range lookups were not performing well so I crafted a query to convert the single row with a range of 2010-01-01 to 2013-07-07 to many rows, each with a single date 2013-01-01, 2013-01-02... While this approach lead to a longer pre-execute phase (it took a few minutes as the query had to generate ~30k rows per day for the past 5 years), once cached locally it was a simple seek to find a given transaction by day.
Preferably, I'd create a numbers table, fill it to the max of int and be done with it but you might get by with just using an inline table valued function to generate numbers. Your query would then look something like
SELECT
T.id
, GN.number
FROM
dbo.MyTable AS T
INNER JOIN
-- Make this big enough to satisfy your theoretical ranges
dbo.GenerateNumbers(1000000) AS GN
ON GN.number BETWEEN T.start and T.end;
That would get used in a "straight" lookup without the need for any of the advanced features. The lookup is going to get very memory hungry though so make the query as tight as possible. For example, cast the GN.number from a bigint to an int in the source query if you know your values will fit in an int.
This table contains server monitoring records. Once the server fails to ping, it inserts new records. So one server can fail multiple times. I want to get the count of records how many times SERVER 3 fails.
This is the table where failure_id is Primary Key.
failure_id server_id protocol added_date
---------- --------- -------- ---------------------
1 1 HTTP 2013-02-04 15:50:42
2 3 HTTP 2013-02-04 16:35:20
Using (*) to count the rows
SELECT
COUNT(*) AS `total`
FROM
`failures` `f`
WHERE CAST(`f`.`server_id` AS CHAR) = 3;
Using server_id to count the rows
SELECT
COUNT(`f`.`server_id`) AS `total`
FROM
`failures` `f`
WHERE CAST(`f`.`server_id` AS CHAR) = 3;
Using SUM to count the rows
SELECT
IFNULL(SUM(1), 0) AS `total`
FROM
`failures` `f`
WHERE CAST(`f`.`server_id` AS CHAR) = 3;
All the above queries return the correct output. But my database will be very large in the future. Which method is best to use based on performance? Thanks in advance...
I'd say none of the above. If you have control over the app that's inserting the records that is. If so, if you don't have a table for your servers, just create one. otherwise add a field called current_failure_count or something and stick it in that table. So when you insert the record, also do an update on your server table and set current_failure_count = current_failure_count + 1 for that server. That way you have to only read one record in the server table (indexed by server_id I'd assume) and you're set. No, this does not follow any of the normalization rules, but you are seeking speed and this is the best way to get it if you can control the client software.
If you cannot control the client software, perhaps you can put a trigger on the insert of records into the failures table that increments the current_failure_count value in the servers table. that should work as well.
Well, the second is definitely more efficient than the first.
I recommend you create a view for the server, which will severely speed things up
CREATE VIEW server3 AS
SELECT server_id
FROM failures
CAST(`f`.`server_id` AS CHAR) = 3;
Then Simply run a count on that view as if it was a table!
Like others, it's not clear to me why you're casting the server_id value. That is going to cost you more performance than any other issue.
If you can eliminate that cast so that you're searching WHERE server_id = (value) and you create an index on server_id then either of the first two queries you suggested will be able to perform index-only retrieval and will provide optimal performance.
SELECT COUNT(*) AS `total`
FROM failures f
WHERE f.server_id = 3;
count(*) will always be better than the arithmetic calculation, although applying index will give more faster result in this.
second best solution will be
SELECT IFNULL(SUM(1), 0) AS `total`
FROM failures `f`
WHERE f.server_id = 3;
this method is used my SQL engine of many tools such as microstrategy
hope answer helps...:)
We're doing an update query between two database tables and it is ridiculously slow. As in: it would take 30 days to perform the query.
One table, lab.list, contains about 940,000 records, the other, mind.list about 3,700,000 (3.7 million)
The update sets a field when two BETWEEN conditions are met. This is the query:
UPDATE lab.list L , mind.list M SET L.locId = M.locId WHERE L.longip BETWEEN M.startIpNum AND M.endIpNum AND L.date BETWEEN "20100301" AND "20100401" AND L.locId = 0
As it is now, the query is performing with about 1 update every 8 seconds.
We also tried it with the mind.list table in the same database, but that doesn't matter for the query time.
UPDATE lab.list L, lab.mind M SET L.locId = M.locId WHERE longip BETWEEN M.startIpNum AND M.endIpNum AND date BETWEEN "20100301" AND "20100401" AND L.locId = 0;
Is there a way to speed up this query? Basically IMHO it should make two subsets of the databases:
mind.list.longip BETWEEN M.startIpNum AND M.endIpNum
lab.list.date BETWEEN "20100301" AND "20100401"
and then update the values for these subsets. Somewhere along the line I think I made a mistake, but where? Maybe there is a faster query possible?
We tried log_slow_queries, but that shows that it is indeed examining 100s of millions of rows, probably going up all the way to 3331 gigarows.
Tech info:
Server version: 5.5.22-0ubuntu1-log (Ubuntu)
lab.list has indexes on locId, longip, date
lab.mind has indexes on locId, startIpNum AND M.endIpNum
hardware: 2x xeon 3.4 GHz, 4GB RAM, 128 GB SSD (so that should not be a problem!)
I would first of all try to index mind on startIpNum, endIpNum, locId in this order. locId is not used in SELECTing from mind, even if it is used for the update.
For the same reason I'd index lab on locId, date and longip (which isn't used in the first chunking, which should run on date) this order.
Then what kind of datatype is assigned to startIpNum and endIpNum? For IPv4, it's best to convert to INTEGER and use INET_ATON and INET_NTOA for user I/O. I assume you already did this.
To run the update, you might try to segment the M database using temporary tables. That is:
* select all records of lab in the given range of dates with locId = 0 into a temporary table TABLE1.
* run an analysis on TABLE1 grouping IP addresses by their first N bits (using AND with a suitable mask: 0x80000000, 0xC0000000, ... 0xF8000000... and so on, until you find that you have divided into a "suitable" number of IP "families". These will, by and large, match with startIpNum (but that's not strictly necessary).
* say that you have divided in 1000 families of IP.
* For each family:
* select those IPs from TABLE1 to TABLE3.
* select the IPs matching that family from mind to TABLE2.
* run the update of the matching records between TABLE3 and TABLE2. This should take place in about one hundred thousandth of the time of the big query.
* copy-update TABLE3 into lab, discard TABLE3 and TABLE2.
* Repeat with next "family".
It is not really ideal, but if the slightly improved indexing does not help, I really don't see all that many options.
In the end, the query was too big or cumbersome for mysql to fill. Even after indexing. Testing the same query with the same data on a high-end Sybase server, also took 3 hours.
So we abandoned the do it all on the database server thought, and went back to scripting languages.
We did the following in python:
load a chunk of 100000 records of the 3.7 million records, and loop over the rows
for each row, set the locId and fill in the rest of the columns
All these updates together take about 5 minutes, so a huge improvement!
Conclusion:
think outside of the database box!