MS SQL SELECT WITH NO LOCK - sql-server-2008

Is it safe to use SELECT with (NOLOCK) on a table that is never updated?

Nolock is safe under a very specific set of circumstances. Also, safety is not all or nothing. You might not care about some safety properties.
Nolock scans will sometimes fail if data in the tables moves around physically. Scans also can see rows twice or not at all. If your DML does not cause row movement this shouldn't happen. I'm saying "shouldn't" because this is not formally guaranteed by the product. For example, shrinking a file or migrating it empty should also cause row movement.
Updates can cause insert/delete pairs. Inserts and deletes can cause row movement.
Some specific forms of DML cannot cause row movement in the current implementation of the product although I doubt this is formally guaranteed either. For example, inserts that append to the b-tree being scanned don't cause row movement although the newly inserted row might be missed (I think).
Most of the time, when you use NOLOCK you should expect to very rarely see slightly broken data and very rarely see scans of b-trees fail. If that's alright with you then go ahead.

Related

How to optimize and Fast run SQL query

I have following SQL query that taking too much time to fetch data.
Customer.joins("LEFT OUTER JOIN renewals ON customers.id = renewals.customer_id").where("renewals.customer_id IS NULL && customers.status_id = 4").order("created_at DESC").select('first_name, last_name, customer_state, customers.created_at, customers.customer_state, customers.id, customers.status_id')
Above query takes 230976.6ms to execute.
I added indexing on firstname, lastname, customer_state and status_id.
How can I execute query within less then 3 sec. ?
Try this...
Everyone wants faster database queries, and both SQL developers and DBAs can turn to many time-tested methods to achieve that goal. Unfortunately, no single method is foolproof or ironclad. But even if there is no right answer to tuning every query, there are plenty of proven do's and don'ts to help light the way. While some are RDBMS-specific, most of these tips apply to any relational database.
Do use temp tables to improve cursor performance
I hope we all know by now that it’s best to stay away from cursors if at all possible. Cursors not only suffer from speed problems, which in itself can be an issue with many operations, but they can also cause your operation to block other operations for a lot longer than is necessary. This greatly decreases concurrency in your system.
However, you can’t always avoid using cursors, and when those times arise, you may be able to get away from cursor-induced performance issues by doing the cursor operations against a temp table instead. Take, for example, a cursor that goes through a table and updates a couple of columns based on some comparison results. Instead of doing the comparison against the live table, you may be able to put that data into a temp table and do the comparison against that instead. Then you have a single UPDATE statement against the live table that’s much smaller and holds locks only for a short time.
Sniping your data modifications like this can greatly increase concurrency. I’ll finish by saying you almost never need to use a cursor. There’s almost always a set-based solution; you need to learn to see it.
Don’t nest views
Views can be convenient, but you need to be careful when using them. While views can help to obscure large queries from users and to standardize data access, you can easily find yourself in a situation where you have views that call views that call views that call views. This is called nesting views, and it can cause severe performance issues, particularly in two ways. First, you will very likely have much more data coming back than you need. Second, the query optimizer will give up and return a bad query plan.
I once had a client that loved nesting views. The client had one view it used for almost everything because it had two important joins. The problem was that the view returned a column with 2MB documents in it. Some of the documents were even larger. The client was pushing at least an extra 2MB across the network for every single row in almost every single query it ran. Naturally, query performance was abysmal.
And none of the queries actually used that column! Of course, the column was buried seven views deep, so even finding it was difficult. When I removed the document column from the view, the time for the biggest query went from 2.5 hours to 10 minutes. When I finally unraveled the nested views, which had several unnecessary joins and columns, and wrote a plain query, the time for that same query dropped to subseconds.
Do use table-valued functions
RESOURCES
VIDEO/WEBCAST
Sponsored
Discover your Data Dilemma
WHITE PAPER
Best Practices when Designing a Digital Workplace
SEE ALL
Search Resources
Go
This is one of my favorite tricks of all time because it is truly one of those hidden secrets that only the experts know. When you use a scalar function in the SELECT list of a query, the function gets called for every single row in the result set. This can reduce the performance of large queries by a significant amount. However, you can greatly improve the performance by converting the scalar function to a table-valued function and using a CROSS APPLY in the query. This is a wonderful trick that can yield great improvements.
Want to know more about the APPLY operator? You'll find a full discussion in an excellent course on Microsoft Virtual Academy by Itzik Ben-Gan.
Do use partitioning to avoid large data moves
Not everyone will be able to take advantage of this tip, which relies on partitioning in SQL Server Enterprise, but for those of you who can, it’s a great trick. Most people don’t realize that all tables in SQL Server are partitioned. You can separate a table into multiple partitions if you like, but even simple tables are partitioned from the time they’re created; however, they’re created as single partitions. If you're running SQL Server Enterprise, you already have the advantages of partitioned tables at your disposal.
This means you can use partitioning features like SWITCH to archive large amounts of data from a warehousing load. Let’s look at a real example from a client I had last year. The client had the requirement to copy the data from the current day’s table into an archive table; in case the load failed, the company could quickly recover with the current day’s table. For various reasons, it couldn’t rename the tables back and forth every time, so the company inserted the data into an archive table every day before the load, then deleted the current day’s data from the live table.
This process worked fine in the beginning, but a year later, it was taking 1.5 hours to copy each table -- and several tables had to be copied every day. The problem was only going to get worse. The solution was to scrap the INSERT and DELETE process and use the SWITCH command. The SWITCH command allowed the company to avoid all of the writes because it assigned the pages to the archive table. It’s only a metadata change. The SWITCH took on average between two and three seconds to run. If the current load ever fails, you SWITCH the data back into the original table.
YOU MIGHT ALSO LIKE
Microsoft Dynamics AX ERP
Microsoft Dynamics AX: A new ERP is born, this time in the cloud
Joseph Sirosh
Why Microsoft’s data chief thinks current machine learning tools are like...
Urs Holzle Structure
Google's infrastructure czar predicts cloud business will outpace ads in 5...
This is a case where understanding that all tables are partitions slashed hours from a data load.
If you must use ORMs, use stored procedures
This is one of my regular diatribes. In short, don’t use ORMs (object-relational mappers). ORMs produce some of the worst code on the planet, and they’re responsible for almost every performance issue I get involved in. ORM code generators can’t possibly write SQL as well as a person who knows what they're doing. However, if you use an ORM, write your own stored procedures and have the ORM call the stored procedure instead of writing its own queries. Look, I know all the arguments, and I know that developers and managers love ORMs because they speed you to market. But the cost is incredibly high when you see what the queries do to your database.
Stored procedures have a number of advantages. For starters, you’re pushing much less data across the network. If you have a long query, then it could take three or four round trips across the network to get the entire query to the database server. That's not including the time it takes the server to put the query back together and run it, or considering that the query may run several -- or several hundred -- times a second.
Using a stored procedure will greatly reduce that traffic because the stored procedure call will always be much shorter. Also, stored procedures are easier to trace in Profiler or any other tool. A stored procedure is an actual object in your database. That means it's much easier to get performance statistics on a stored procedure than on an ad-hoc query and, in turn, find performance issues and draw out anomalies.
In addition, stored procedures parameterize more consistently. This means you’re more likely to reuse your execution plans and even deal with caching issues, which can be difficult to pin down with ad-hoc queries. Stored procedures also make it much easier to deal with edge cases and even add auditing or change-locking behavior. A stored procedure can handle many tasks that trouble ad-hoc queries. My wife unraveled a two-page query from Entity Framework a couple of years ago. It took 25 minutes to run. When she boiled it down to its essence, she rewrote that huge query as SELECT COUNT(*) from T1. No kidding.
OK, I kept it as short as I could. Those are the high-level points. I know many .Net coders think that business logic doesn’t belong in the database, but what can I say other than you’re outright wrong. By putting the business logic on the front end of the application, you have to bring all of the data across the wire merely to compare it. That’s not good performance. I had a client earlier this year that kept all of the logic out of the database and did everything on the front end. The company was shipping hundreds of thousands of rows of data to the front end, so it could apply the business logic and present the data it needed. It took 40 minutes to do that. I put a stored procedure on the back end and had it call from the front end; the page loaded in three seconds.
Of course, the truth is that sometimes the logic belongs on the front end and sometimes it belongs in the database. But ORMs always get me ranting.
Don’t do large ops on many tables in the same batch
This one seems obvious, but apparently it's not. I’ll use another live example because it will drive home the point much better. I had a system that suffered tons of blocking. Dozens of operations were at a standstill. As it turned out, a delete routine that ran several times a day was deleting data out of 14 tables in an explicit transaction. Handling all 14 tables in one transaction meant that the locks were held on every single table until all of the deletes were finished. The solution was to break up each table's deletes into separate transactions so that each delete transaction held locks on only one table. This freed up the other tables and reduced the blocking and allowed other operations to continue working. You always want to split up large transactions like this into separate smaller ones to prevent blocking.
Don't use triggers
This one is largely the same as the previous one, but it bears mentioning. Don’t use triggers unless it’s unavoidable -- and it’s almost always avoidable.
The problem with triggers: Whatever it is you want them to do will be done in the same transaction as the original operation. If you write a trigger to insert data into another table when you update a row in the Orders table, the lock will be held on both tables until the trigger is done. If you need to insert data into another table after the update, then put the update and the insert into a stored procedure and do them in separate transactions. If you need to roll back, you can do so easily without having to hold locks on both tables. As always, keep transactions as short as possible and don’t hold locks on more than one resource at a time if you can help it.
Don’t cluster on GUID
After all these years, I can't believe we’re still fighting this issue. But I still run into clustered GUIDs at least twice a year.
A GUID (globally unique identifier) is a 16-byte randomly generated number. Ordering your table’s data on this column will cause your table to fragment much faster than using a steadily increasing value like DATE or IDENTITY. I did a benchmark a few years ago where I inserted a bunch of data into one table with a clustered GUID and into another table with an IDENTITY column. The GUID table fragmented so severely that the performance degraded by several thousand percent in a mere 15 minutes. The IDENTITY table lost only a few percent off performance after five hours. This applies to more than GUIDs -- it goes toward any volatile column.
Don’t count all rows if you only need to see if data exists
It's a common situation. You need to see if data exists in a table or for a customer, and based on the results of that check, you’re going to perform some action. I can't tell you how often I've seen someone do a SELECT COUNT(*) FROM dbo.T1 to check for the existence of that data:
SET #CT = (SELECT COUNT(*) FROM dbo.T1);
If #CT > 0
BEGIN
END
It’s completely unnecessary. If you want to check for existence, then do this:
If EXISTS (SELECT 1 FROM dbo.T1)
BEGIN
END
Don’t count everything in the table. Just get back the first row you find. SQL Server is smart enough to use EXISTS properly, and the second block of code returns superfast. The larger the table, the bigger difference this will make. Do the smart thing now before your data gets too big. It’s never too early to tune your database.
In fact, I just ran this example on one of my production databases against a table with 270 million rows. The first query took 15 seconds, and included 456,197 logical reads, while the second one returned in less than one second and included only five logical reads. However, if you really do need a row count on the table, and it's really big, another technique is to pull it from the system table. SELECT rows from sysindexes will get you the row counts for all of the indexes. And because the clustered index represents the data itself, you can get the table rows by adding WHERE indid = 1. Then simply include the table name and you're golden. So the final query is SELECT rows from sysindexes where object_name(id) = 'T1' and indexid = 1. In my 270 million row table, this returned sub-second and had only six logical reads. Now that's performance.
Don’t do negative searches
Take the simple query SELECT * FROM Customers WHERE RegionID <> 3. You can’t use an index with this query because it’s a negative search that has to be compared row by row with a table scan. If you need to do something like this, you may find it performs much better if you rewrite the query to use the index. This query can easily be rewritten like this:
SELECT * FROM Customers WHERE RegionID < 3 UNION ALL SELECT * FROM Customers WHERE RegionID
This query will use an index, so if your data set is large it could greatly outperform the table scan version. Of course, nothing is ever that easy, right? It could also perform worse, so test this before you implement it. There are too many factors involved for me to tell you that it will work 100 percent of the time. Finally, I realize this query breaks the “no double dipping” tip from the last article, but that goes to show there are no hard and fast rules. Though we're double dipping here, we're doing it to avoid a costly table scan.
Ref:http://www.infoworld.com/article/2604472/database/10-more-dos-and-donts-for-faster-sql-queries.html
http://www.infoworld.com/article/2628420/database/database-7-performance-tips-for-faster-sql-queries.html

Delete MySql rows, or mark "dead"?

I've always had a weird feeling in my gut about actually deleting rows from certain types of tables.
For example, if I have a table of Users...when they delete their account, rather than fully deleting their row, I have been marking as "dead" or inactive. This allows me to retain a record of their existence if I ever need it again.
In situations like this - considering performance, overhead, etc - should I delete the row, or simply mark as inactive?
Which is more "common"?
Personally, I almost always use "soft deletes" as you describe.
If space is a concern, I'll have a job that will periodically clean up the soft-deleted records after they've been deleted for a certain amount of time.
Perhaps you could move the inactive MySQL records to a separate table designed to hold inactive accounts? That way, you could simply move them back over if you need to, or delete the table if database size becomes an issue.
Data are very valuable to be permanently deleted from the database. Mark as dead.
I generally give status for such cases. In this pattern
0 Inactive
1 Active
2 Trashed
In addition to "soft" deletes, another solution is to use "audit tables". I asked what they were on dba.stackexchange.com recently.
Audit tables are typically used to record actions, such as insert/update/delete, performed on a second table, possibly storing old and new values, time, etc.
They can be implemented using triggers in a straightforward way.
Pros:
the "unused" data is in a separate table
it's easy to turn the level-of-detail knob from fine-grained to coarse-grained
it may be more efficient space-wise, depending on the exact implementation
Cons:
since data is in a separate table, it could cause key conflicts in the case that a row were "undeleted"
it may be less efficient space-wise, depending on the exact implementation
This question made me remember this entertaining anecdote. My point: there are so many factors to consider when choosing between hard and soft delete that there is no thumb rule telling you which one to pick.

Is mysql UPDATE faster than INSERT INTO?

This is more of a theory question.
If I'm running 50,000 queries that insert new rows, and 50,000 queries that updates those rows, which one will take less time?
Insert would be faster because with update you need to first search for the record that you are going to update and then perform the update.
Though this hardly seems like a valid comparison as you never have a choice whether to insert or update as the two fill two completely different needs.
EDIT: I should add too that this is with the assumption that there are no insert triggers or other situations that could cause potential bottlenecks.
Insert Operation : Create -> Store
Update Operation : Retrieve -> Modify -> Store
Insert Operation faster.
With an insert into the same table, you can always insert all the rows with one query, making it much faster than inserting one by one. When updating, you can update several rows at a time, but you cannot apply this to every update situation, and often you have to run one update query at a time (when updating a specific id) - and on a big table this is very slow having to find the row and then update it every time. It is also slower even if you have indexed the table, by my experience.
As an aside here, don't forget that by doing loads more inserts than updates, you have more rows when you come to select, so you'll slow down the read operation.
So the real question then becomes - what do you care about more, a quick insert or a speedy read. Again, this is dependant on certain factors - particularly (and not yet mentioned) DB engine, such as InnoDB (which is now standard in PHPMyAdmin incidentally).
I agree with everyone else though - there's too much to consider on a case-by-case basis and therefore you really need to run your own tests and assess the situation from there based on your needs.
There's a lot of non-practical answers here. Yes, theoretically inserts are slower because they have to do the extra step of looking up the row. But this is not at all the full picture if you're working with a database made after 1992.
Short answer: they're the same speed. (Don't pick one operation over the other for the sake of speed, just pick the right operation).
Long answer: When updating, you're writing to memory pages and marking them as dirty. Any modern database will detect this and keep these pages in cache longer (this is opposed to a normal select statement which doesn't set this flag). This cache is also smart enough to hold on to pages that are accessed frequently (See LRU-K). So subsequent updates to the same rows will be pretty much instant, no lookups needed. This is assuming you're updating based on index'd columns such as IDs (I'll talk about that in a second).
Compare this to a rapid amount of inserts, new pages will need to be made and these pages needed to be loaded into the cache. Sure you can put multiple new rows on the same page, but as you continue to insert this page is filled up and tossed away never to be used again. Thus, not taking advantage of re-using pages in the cache. (And just as a note, "loading pages into the cache" is also known as a "page fault", which is the #1 slower-downer of database technology in most environments, MonogoDB is always inclined to share this idea).
If you're inserting on basis of a column that isn't index: yeah that is WAY slower than inserting. This should be made infrequent in any app. But mind you, if you DO have indexes on a table, it will speed up your updating but also will slow your inserting because this means newly inserted rows will have to insert new index data as well (as compared to updates which re-use existing index data instead of generating new ones). See here for more details on that in terms of how MySQL does it.
Finally, Multi-threaded/multi-processing environments can also turn this idea on its head. Which, I'm not going to get into that. That's a whole 'nother can of worms. You can do your research on your type of database + storage engine for this as well as gauge your apps use of concurrent enviroment... Or, you can just ignore all that and just use the most intuitive operation.

InnoDB row level locking performance - how many rows?

I just read a lot of stuff about MyISAM and InnoDB as I have to decide which type to use.
There was always mentioned 'row level locking'-support for InnoDB. Of course this only makes sense at a certain amount of rows.
How many would that (circa) be?
EDIT: Apparently I mis-worded my question. I know what table locking and row locking mean but I wondered when this does matter.
If I have just 100 rows inserted per day, of course table locking would be way enough but for a case of, let's say 100 rows per SECOND I think, InnoDB would be the better choice.
My question: Does row also locking make sense for 10 rows per second or 5 rows per second? When does this choice significantly affect performance?
It's not entirely clear what you're asking. Locking ensures that only one user attempts to modify a given row at any given time. Row-level locking means only the one row they're modifying is locked. The usual alternatives are to either lock the entire table for the duration of the modification, or else to lock some subset of the table. Row-level locking simply reduces that subset of the rows to the smallest number that still ensures integrity.
The idea is to allow one user to modify one thing without preventing other users from modifying other things. It's worth noting, however, that in some cases this can be something of a false positive, so to speak. A few databases support row-level locking, but make a row-level lock considerably more expensive that locking a larger part of the table -- enough more expensive that it can be counterproductive.
Edit: Your edit to the original post helps, but not really a lot. First of all, the sizes of rows and levels of hardware involved have a huge effect (inserting an 8-byte row onto a dozen striped 15K SAS hard drives is just a tad faster than inserting a one megabyte row onto a single consumer class hard drive).
Second, it's largely about the number of simultaneous users, so the pattern of insertion makes a big difference. 1000 rows inserted at 3 AM probably won't be noticed at all. 1000 rows inserted evenly throughout the day means a bit more (but probably only a bit). 1000 rows inserted as a batch right when 100 other users need data immediately might get somebody fired (especially if one of those 100 is the owner of the company).
MyISAM tables support concurrent inserts (aka no table lock for inserts). So if you meet the criteria, there's no problem:
http://dev.mysql.com/doc/refman/5.0/en/concurrent-inserts.html
So, like most things, the answer is "it depends". There is no bright line test. Only you can make the determination; we know nothing about your application/hardware/usage statistics/etc. and, by definition, can't know more about it than you do.

How can I fix this scaling issue with soft deleting items?

I have a database where most tables have a delete flag for the tables. So the system soft deletes items (so they are no longer accessible unless by admins for example)
What worries me is in a few years, when the tables are much larger, is that the overall speed of the system is going to be reduced.
What can I do to counteract effects like that.
Do I index the delete field?
Do I move the deleted data to an identical delete table and back when undeleted?
Do I spread out the data over a few MySQL servers over time? (based on growth)
I'd appreciate any and all suggestions or stories.
UPDATE:
So partitioning seems to be the key to this. But wouldn't partitioning just create two "tables", one with the deleted items and one without the deleted items.
So over time the deleted partition will grow large and the occasional fetches from it will be slow (and slower over time)
Would the speed difference be something I should worry about? Since I fetch most (if not all) data by some key value (some are searches but they can be slow for this setup)
I'd partition the table on the DELETE flag.
The deleted rows will be physically kept in other place, but from SQL's point of view the table remains the same.
Oh, hell yes, index the delete field. You're going to be querying against it all the time, right? Compound indexes with other fields you query against a lot, like parent IDs, might also be a good idea.
Arguably, this decision could be made later if and only if performance problems actually appear. It very much depends on how many rows are added at what rate, your box specs, etc. Obviously, the level of abstraction in your application (and the limitations of any libraries you are using) will help determine how difficult such a change will be.
If it becomes a problem, or you are certain that it will be, start by partitioning on the deleted flag between two tables, one that holds current data and one that holds historical/deleted data. IF, as you said, the "deleted" data will only be available to administrators, it is reasonable to suppose that (in most applications) the total number of users (here limited only to admins) will not be sufficient to cause a problem. This means that your admins might need to wait a little while longer when searching that particular table, but your user base (arguably more important in most applications) will experience far less latency. If performance becomes unacceptable for the admins, you will likely want to index the user_id (or transaction_id or whatever) field you access the deleted records by (I generally index every field by which I access the table, but at certain scale there can be trade-offs regarding which indexes are most worthwhile).
Depending on how the data is accessed, there are other simple tricks you can employ. If the admin is looking for a specific record most of the time (as opposed to, say, reading a "history" or "log" of user activity), one can often assume that more recent records will be looked at more often than old records. Some DBs include tuning options for making recent records easier to find than older records, but you'll have to look it up for your particular database. Failing that, you can manually do it. The easiest way would be to have an ancient_history table that contains all records older than n days, weeks or months, depending on your constraints and suspected usage patterns. Newer data then lives inside a much smaller table. Even if the admin is going to "browse" all the records rather than searching for a specific one, you can start by showing the first n days and have a link to see all days should they not find what they are looking for (eg, most online banking applications that lets you browse transactions but shows only the first 30 days of history unless you request otherwise.)
Hopefully you can avoid having to go a step further, and sharding on user_id or some such scheme. Depending on the scale of the rest of your application, you might have to do this anyway. Unless you are positive that you will need to, I strongly suggest using vertical partitioning first (eg, keeping your forum_posts on a separate machine than your sales_records), as it is FAR easier to setup and maintain. If you end up needing to shard on user_id, I suggest using google ;-]
Good luck. BTW, I'm not a DBA so take this with a grain of salt.