I have a field in my database which the majority of the time, the case sensitivity doesnt matter.
However there are specific circumstances where this isn't the case.
So the options as I see it are, add the BINARY keyword to the query
select id from table where BINARY fieldname= 'tesT'
or alter the table to make it a case sensitive field.
The question is, does using the BINARY keyword cause additional overhead on the query, or is this negligible.
(it would actually be used in an update where statement)
update table set field='tesT' where BINARY anotherfield='4Rtrtactually '
The short answer is binary has been recommended in a similar SO question, however whether overhead is negligible depends on other parts.
Part one is how you define negligible. If you mean by performance then define your negligible threshold return time(s) and try selects using binary and not using binary to see what kind of return time you get (these need to be cases where case won't matter). Further this by doing an explain to root out any difference.
The second part is it depends on how big the table is and how the where clause column is indexed. Generally speaking, the more rows there are then the more prominent any difference would be. So if you're looking at say ten records in the table then it's probably nothing given your doing a single table update. Ten million records then better chance to be noticeable.
Third part is based on Yann Neuhaus' comment in the MSQL 5.7 reference guide. It might depend on exactly where you put the binary keyword. Assuming his findings are accurate then putting the binary on the constant side of the equals might be faster (or at least different). So test a rewrite of your update to:
update table set field='tesT' where anotherfield = BINARY '4Rtrtactually '
To be honest I have my doubts on this third part but I've seen weirder MySQL magic. Kinda of curious if you test it out what the results would be for your scenario.
Final part might be what type of collation you are using. Based on your question it seems you are using case insensitive collation. Depending on how you are using this data you might want to look into case sensitive or binary collation. This SO question was educational for me on this subject.
Related
Suppose I have a database with several columns. In each column there are lots of values that are often similar.
For example I can have a column with the name "Description" and a value could be "This is the description for the measurement". This description can occur up to 1000000 times in this column.
My question is not how I could optimize the design of this database but how a database handles such redundant values. Are these redundant values stored as effectively as with a perfect design (with respect to the total size of the database)? If so, how are the values compressed?
The only correct answer would be: depends on the database and the configuration. Because there is no silver bullet for this one. Some databases do only store values of each column once (some column stores or the like) but technically there is no necessity to do or not do this.
In some databases you can let the DBMS propose optimizations and in such a case it could possibly propose an ENUM field that holds only existing values, which would reduce the string to an id that references the string. This "optimization" comes at a price, for example, when you want to add a new value in the field description you have to adapt the ENUM field.
Depending on the actual use case those optimizations are worth nothing or are even a show stopper, for example when the data changes very often (inserts or updates). The dbms would spend more time managing uniqueness/duplicates than actually processing queries.
On the question of compression: also depends on the configuration and the database system I guess, depends on the field type too. text data can be compressed and in the case of non-indexed text fields there should be almost no drawback in using a simple compression algorithm. Which algorithm depends on the dbms and configuration, I suspect.
Unless you become more specific, there is no more specific answer, I believe.
I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.
I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.
But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.
PostgreSQL:
It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.
MSSQL Server:
MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.
MySQL Server:
The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.
SQLite:
Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.
So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).
It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?
Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.
I think in some cases query parsing time (if it was not prepared) alone can kill performance.
Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.
But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.
Do you?
Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.
Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:
Returns 1 if expr is equal to any of the values in the IN list, else
returns 0. If all values are constants, they are evaluated according
to the type of expr and sorted. The search for the item then is done
using a binary search. This means IN is very quick if the IN value
list consists entirely of constants.
This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.
I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.
I think it is bad application design. Those values using IN operator are most probably not hardcoded but dynamic. In such case we should always use prepared statements the only reliable mechanism to prevent SQL injection.
In each case it will result in dynamically formatting the prepared statement (as number of placeholders is dynamic too) and it will also result in having excessive hard parsing (as many unique queries as we have number of IN values - IN (?), IN (?,?), ...).
I would either load these values into table as use join as you mentioned (unless loading is too overhead) or use Oracle pipelined function IN foo(params) where params argument can be complex structure (array) coming from memory (PLSQL/Java etc).
If the number of values is larger I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN. In such case EXISTS provides better performance than IN.
I was asked this question few days back, in a interview. What is the reason for this slowness?
According to the documentation, in with a list of constants implements a binary search:
The search for the item then is done using a binary search. This
means IN is very quick if the IN value list consists entirely of
constants.
It is not explicitly stated that NOT IN is handled the same way, but it is reasonable to assume based on this:
expr NOT IN (value,...)
This is the same as NOT (expr IN (value,...)).
That would leave two possible differences in performance. The first would be index usage. Because NOT IN is documented as essentially using IN, I would expect the index optimizations to be the same. This is an expectation, but it might not be correct.
The second is that there are fewer comparisons with the NOT IN than with the IN. This would lead to the NOT IN being faster -- but the difference is less than microscopic given the binary search method used by MySQL. In other words, this should not make a difference.
Finally, if you are using a subquery, then all bets are off. The NOT IN has to process the entire list to check if the value is not really in the list. IN can stop processing at the first match. Under some circumstances, this could have an impact on performance.
It is up to your data.
And i found an another answer for you :
Use "IN" as it will most likely make the DBMS use an index on the
corresponding column.
"NOT IN" could in theory also be translated into an index usage, but
in a more complicated way which DBMS might not "spend overhead time"
using.
From : https://stackoverflow.com/a/16892027/3444315
It's because NOT IN is simply NOT (IN). It's always one more step than IN. NOT IN take the result of IN then negates it.
I've got a table in a MySQL db with about 25000 records. Each record has about 200 fields, many of which are TEXT. There's nothing I can do about the structure - this is a migration from an old flat-file db which has 16 years of records, and many fields are "note" type free-text entries.
Users can be viewing any number of fields, and order by any single field, and any number of qualifiers. There's a big slowdown in the sort, which is generally taking several seconds, sometimes as much as 7-10 seconds.
an example statement might look like this:
select a, b, c from table where b=1 and c=2 or a=0 order by a desc limit 25
There's never a star-select, and there's always a limit, so I don't think the statement itself can really be optimized much.
I'm aware that indexes can help speed this up, but since there's no way of knowing what fields are going to be sorted on, i'd have to index all 200 columns - what I've read about this doesn't seem to be consistent. I understand there'd be a slowdown when inserting or updating records, but assuming that's acceptable, is it advisable to add an index to each column?
I've read about sort_buffer_size but it seems like everything I read conflicts with the last thing I read - is it advisable to increase this value, or any of the other similar values (read_buffer_size, etc)?
Also, the primary identifier is a crazy pattern they came up with in the nineties. This is the PK and so should be indexed by virtue of being the PK (right?). The records are (and have been) submitted to the state, and to their clients, and I can't change the format. This column needs to sort based on the logic that's in place, which involves a stored procedure with string concatenation and substring matching. This particular sort is especially slow, and doesn't seem to cache, even though this one field is indexed, so I wonder if there's anything I can do to speed up the sorting on this particular field (which is the default order by).
TYIA.
I'd have to index all 200 columns
That's not really a good idea. Because of the way MySQL uses indexes most of them would probably never be used while still generating quite a large overhead. (see chapter 7.3 in link below for details). What you could do however, is to try to identify which columns appear most often in WHERE clause, and index those.
In the long run however, you will probably need to find a way, to rework your data structure into something more manageable, because as it is now, it has the smell of 'spreadsheet turned into database' which is not a nice smell.
I've read about sort_buffer_size but it seems like everything I read
conflicts with the last thing I read - is it advisable to increase
this value, or any of the other similar values (read_buffer_size,
etc)?
In general he answer is yes. However the actual details depend on your hardware, OS and what storage engine you use. See chapter 7.11 (especially 7.11.4 in link below)
Also, the primary identifier is a crazy pattern they came up with in
the nineties.[...] I wonder if there's anything I can do to speed up
the sorting on this particular field (which is the default order by).
Perhaps you could add a primarySortOrder column to your table, into which you could store numeric values that would map the PK order (precaluclated from the store procedure you're using).
Ant the link you've been waiting for: Chapter 7 from MySQL manual: Optimization
Add an index to all the columns that have a large number of distinct values, say 100 or even 1000 or more. Tune this number as you go.
Please explain, which one will be faster in Mysql for the following query?
SELECT * FROM `userstatus` where BINARY Name = 'Raja'
[OR]
SELECT * FROM `userstatus` where Name = 'raja'
Db entry for Name field is 'Raja'
I have 10000 records in my db, i tried with "explain" query but both saying same execution time.
Your question does not make sense.
The collation of a row determines the layout of the index and whether tests wil be case-sensitive or not.
If you cast a row, the cast will take time.
So logically the uncasted operation should be faster....
However, if the cast makes it to find fewer rows than the casted operation will be faster or the other way round.
This of course changes the whole problem and makes the comparison invalid.
A cast to BINARY makes the comparison to be case-sensitive, changing the nature of the test and very probably the number of hits.
My advice
Never worry about speed of collations, the percentages are so small it is never worth bothering about.
The speed penalty from using select * (a big no no) will far outweigh the collation issues.
Start with putting in an index. That's a factor 10,000 speedup with a million rows.
Assuming that the Names field is a simple latin-1 text type, and there's no index on it, then the BINARY version of the query will be faster. By default, MySQL does case-insensitive comparisons, which means the field values and the value you're comparing against both get smashed into a single case (either all-upper or all-lower) and then compared. Doing a binary comparison skips the case conversion and does a raw 1:1 numeric comparison of each character value, making it a case-sensitive comparison.
Of course, that's just one very specific scenario, and it's unlikely to be met in your case. Too many other factors affect this, especially the presence of an index.