Django startswith vs endswith performance on MySQL - mysql

Lets say I have the following model
class Person(models.Model):
name = models.CharField(max_length=20, primary_key=True)
So I would have objects in the database like
Person.objects.create(name='alex white')
Person.objects.create(name='alex chen')
Person.objects.create(name='tony white')
I could then subsequently query for all users whose first name is alex or last name is white by doing the following
all_alex = Person.objects.filter(name__startswith='alex')
all_white = Person.objects.filter(name__endswith='white')
I do not know how Django implements this under the hood, but I am going to guess it is with a SQL LIKE 'alex%' or LIKE '%white'
However, since according to MySQL index documentation, since the primary key index can only be used (e.g. as opposed to a full table scan) if % appears on the end of the LIKE query.
Does that mean that, as the database grows, startswith will be viable - whereas endswith will not be since it will resort to full table scans?
Am I correct or did I go wrong somewhere? Keep in mind these are not facts but just my deductions that I made from general assumptions - hence why I am asking for confirmation.

Assuming you want AND -- that is only Alex White and not Alex Chen or Tony White, ...
Even better (assuming there is an index starting with name) is
SELECT ...
WHERE name LIKE 'Alex%White'
If Django can't generate that, then it is getting in the way of efficient use of MySQL.
This construct will scan all the names starting with alex, further filtering on the rest of the expression.
If you do want OR (and 3 names), then you are stuck with
SELECT ...
WHERE ( name LIKE 'Alex%'
OR name LIKE '%White' )
And there is no choice but to scan all the names.
In some situations, perhaps this one, FULLTEXT would be better:
FULLTEXT(name) -- This index is needed for the following:
SELECT ...
WHERE MATCH(name) AGAINST('Alex White' IN BOOLEAN MODE) -- for OR
SELECT ...
WHERE MATCH(name) AGAINST('+Alex +White' IN BOOLEAN MODE) -- for AND
(Again, I don't know the Django capabilities.)

Yes, your understanding is correct.
select *
from foo
where bar like 'text1%' and bar like '%text2'
is not necessarily optimal. This could be an improvement:
select *
from (select *
from foo
where foo.bar like 'text1%') t
where t.bar like '%text2'
You need to make measurements to check whether this is better. If it is, the cause is thatin the inner query you use an index, while in the outer query you not use an index, but the set is prefiltered by the first query, therefore you have a much smaller set to query.
I am not at all a Django expert, so my answer might be wrong, but I believe chaining your filter would be helpful if filter actually executes the query. If that is the case, then you can use the optimization described above. If filter just prepares a query and chaining filters will result in a single query different from the one above, then I recommend using hand-written MySQL. However, if you do not have performance issues yet, then it is premature to optimize it, since you cannot really test the amount of performance you gained.

Related

Which is better between prepared statement with 'LIKE '%' or generate query on the fly

I'm implementing a typical list Rest endpoint /items with some optional filtering URI query parameters like ?attr=val&attr2=val etc..
The Rest server is backed with Go/MySQL
About query performances, i wonder if it is better to create a prepared statement which make uses of LIKE statements :
SELECT cols from items WHERE attr LIKE ? and attr2 LIKE ?;
and simply set values to '%' to attributes not filled by the user
or generate the query on the fly based on given attrs ?
Exemple with no attrs:
SELECT cols from items;
Exemple with one attr:
SELECT cols from items where attr LIKE 'val';
More generally i wonder if using LIKE '%' has a performance cost (considering indexes are configured properly on theses cols). And if theses performances costs are worth in a prepared statement compared to the cost of generate the query on the fly (parsing etc.).
Note: The number of distinct filtering attrs beeing pretty significant it is not conceivable to generate specific prepared statement for every possible attrs combination.
There are three parts involved when you are doing a query.
Parsing the query.
Optimizing the query based on the query structure and the used parameter values.
Doing the query using the optimized query.
The query optimizer technically could optimize LIKE '%' away to something not using LIKE, but it seems as if MySQL doesn't do that (but I'm not 100% sure about that).
For booleans, the query optimizer however does such optimizations.
If you do:
SELECT * FROM test WHERE (attr='val' OR TRUE) AND (attr2='val' OR FALSE);
The resulting query will be:
SELECT * FROM test WHERE attr2='val';
Because (attr='val' OR TRUE) will always be TRUE, and OR FALSE doesn't do anything.
So you could always have something like:
SELECT * FROM test WHERE (attr=#attr OR !#useAttr) AND (attr2=#attr2 OR !#useAttr2);
And enable/disable the usage of the corresponding filter using a boolean.
Or something like this if the value is null if it is not set:
SELECT * FROM test WHERE (attr=? OR ISNULL(?)) AND (attr2=? OR ISNULL(?));
And call the query like that stmnt.execute(attr, attr, attr2, attr2).
I recommend you spend a little effort building the query based on the data provided. That is, construct the WHERE clause with only the items that the user wants to search on.
This may or may not actually speed things up, but it does help you think about your UI and could lead to better UI designs.
When a column could be, say, 0 or NULL and they mean the same thing, performance suffers by using OR. Instead, rethink the schema -- pick either 0 or NULL, not both, as an indicator of whatever.
If you tend to have LIKE '%xyz%', consider whether a FULLTEXT index would be more appropriate (and a lot faster).
(Meanwhile, #t.niese discusses optimization nicely.)

SQL - speeding up performance

I have merged several css imports into one table - but this call takes about 70 seconds to complete. Is there a way of rewriting this to speed it up?
SELECT
`table_merged`.Property AS 'Property',
AVG(`table_merged`.`Value`) AS 'Average Asking Price'
FROM
`table_merged`
WHERE
`table_merged`.`Area` LIKE '%NW1%'
AND `table_merged`.`Property` LIKE '%2-bed flat%'
AND `table_merged`.`Year` = '2016'
GROUP BY
`table_merged`.Property, `table_merged`.`Year`
ORDER BY
`table_merged`.Property ASC
Output is
| Property | Average Asking Price
| 2-bed flat | 751427.1935581862
There is not a lot that can be done with the query as it is. Your biggest performance hit right now is your use of %...% style LIKE statements. As Gordon has already mentioned in the comments, this basically removes MySQL's ability to use indexes.
The best solution here would be to slightly alter your schema. I would suggest creating a table that stores the type of property (like 2-bed flat) and then some kind of code (could just be an int). Searching on a code without the LIKE statement would give you the ability to use an index, and if you absolutely have to search on a like string, you can do it as a subselect on a much smaller table.
Edit for further clarification:
You're searching for any string of text that matches %2-bed flat%. My assumption is that you're doing this because you have strings like cute 2-bed flat, 2-bed flat near the water, or other random marketing things. There is no way to optimize your LIKE STATEMENTS. They're slow, and that's an unfortunate fact of life.
I would add a column to your current tables. If the string matches the string %2-bed flat% then give it a value of something like 2bf. While you're at it, you can do this with any other search strings you may use. Put an index on this column.
Once that's done, instead of using an intensive LIKE statement you can just do an indexed search on the new column.

How can I improve the response time on my query when using ibatis/spring/mysql?

I have a database with 2 tables, I must run a simple query `
select *
from tableA,tableB
where tableA.user = tableB.user
and tablea.email LIKE "%USER_INPUT%"
Where user_input is a part of the string of tablea.email that has to match.
The problem:
The table will carry about 10 million registers and its taking a while, the cache of ibatis (as far as I know) will be used only if the previous query looks the same. example: for USER_INPUT = john_doe if the second query is john_doe again the cache willt work, but if is john_do will not work(that is, as I said, as far as I know).
current, the tableA structure is like this:
id int(11) not_null auto_increment
email varchar(255)not_null
many more fields...
I dont know if email , a varchar of 255 might be too long and could take longer time because of that, if I decrease it to 150 characters for example, would the response time will be shorter?
Right now the query is taking too long... I know I could upgrade to more memory to the servers but I would like to know if there is other way to improve this code.
tableA and tableB have about 30 fields each and they are related by the ID on a relational schema.
Im going to create an index for tableA.email.
Ideas?
I'd recommend running an execution plan for that query in your DB. That'll tell how the DB plans to execute your query, and what you're looking for is something like a "full table scan". I'd guess you'll see just that, due to the like clause, and an index the email field won't help that part.
If you need to search by substrings of email addresses you might want to consider the granularity of how you store your data. For example, instead of storing email addresses in a single field as usual you could split them into two fields (or maybe more), where everything before the '#' is in one field and the domain name is in another. Then you could search by either component without needing a like and then indexes would significantly speed things up significantly. For example, you could do this to search:
WHERE tableA.email_username = 'USER_INPUT' OR tableA.email_domain = 'USER_INPUT'
Of course you then have to concatenate the two fields to recreate the email address, but I think iBatis will let you add a method to your data object to do that in a single place instead of all over your app (been a while since I used iBatis, though, so I could be wrong).
MySQL cannot utilize indexes on LIKE queries where the wildcard precedes the search string (%query).
You can try a Full-Text search instead. You'll have to add a FULLTEXT index to your email column:
ALTER TABLE tablea
ADD FULLTEXT(email);
From there you can revise your query
SELECT *
FROM tableA,tableB
WHERE tableA.user = tableB.user
AND MATCH (tablea.email) AGAINST ('+USER_INPUT' IN BOOLEAN MODE)
You'll have to make sure you can use full text indexes.
Full-text indexes can be used only with MyISAM tables. (In MySQL 5.6 and up, they can also be used with InnoDB tables.)
http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html

Can Postgres use a function in a partial index where clause?

I have a large Postgres table where I want to partial index on 1 of the 2 columns indexed. Can I and how do I use a Postgres function in the where clause of a partial index and then have the select query utilize that partial index?
Example Scenario
First column is "magazine" and the second column is "volume" and the third column is "issue". All the magazines can have same "volume" and "issue" #'s but I want the index to only contain the two most recent volumes for that magazine. This is because a magazine could be older than others and have higher volume numbers than younger magazines.
Two immutable strict functions were created to determine the current and last years volumes for a magazine f_current_volume('gq') and f_previous_volume('gq'). Note: current/past volume # only changes once per year.
I tried creating a partial index with the functions however when using explain on a query it only does a seq scan for a current volume magazine.
CREATE INDEX ix_issue_magazine_volume ON issue USING BTREE ( magazine, volume )
WHERE volume IN (f_current_volume(magazine), f_previous_volume(magazine));
-- Both these do seq scans.
select * from issue where magazine = 'gq' and volume = 100;
select * from issue where magazine = 'gq' and volume = f_current_volume('gq');
What am I doing wrong to get this work? And if it is possible why does it need to be done that way for Postgres to use the index?
-- UPDATE: 2013-06-17, the following surprisingly used the index.
-- Why would using a field name rather than value allow the index to be used?
select * from issue where magazine = 'gq' and volume = f_current_volume(magazine);
Immutability and 'current'
If your f_current_volume function ever changes its behaviour - as is implied by its name, and the presence of an f_previous_volume function, then the database is free to return completely bogus results.
PostgreSQL would've refused to let you create the index, complaining that you can only use IMMUTABLE functions. The thing is, marking a function IMMUTABLE means that you are telling PostgreSQL something about the function's behaviour, as per the documentation. You're saying "I promise this function's results won't change, feel free to make assumptions on that basis."
One of the biggest assumptions made is when building an index. If the function returns different outputs for different inputs on multiple invocations, things go splat. Or possibly boom if you're unlucky. In theory you can kind-of get away with changing an immutable function by REINDEXing everything, but the only really safe way is to DROP every index that uses it, DROP the function, re-create the function with its new definition and re-create the indexes.
That can actually be really useful to do if you have something that changes only infrequently, but you really have two different immutable functions at different points in time that just happen to have the same name.
Partial index matching
PostgreSQL's partial index matching is pretty dumb - but, as I found when writing test cases for this, a lot smarter than it used to be. It ignores a dummy OR true. It uses an index on WHERE (a%100=0 OR a%1000=0) for a WHERE a = 100 query. It even got it with a non-inline-able identity function:
regress=> CREATE TABLE partial AS SELECT x AS a, x
AS b FROM generate_series(1,10000) x;
regress=> CREATE OR REPLACE FUNCTION identity(integer)
RETURNS integer AS $$
SELECT $1;
$$ LANGUAGE sql IMMUTABLE STRICT;
regress=> CREATE INDEX partial_b_fn_idx
ON partial(b) WHERE (identity(b) % 1000 = 0);
regress=> EXPLAIN SELECT b FROM partial WHERE b % 1000 = 0;
QUERY PLAN
---------------------------------------------------------------------------------------
Index Only Scan using partial_b_fn_idx on partial (cost=0.00..13.05 rows=50 width=4)
(1 row)
However, it was unable to prove the IN clause match, eg:
regress=> DROP INDEX partial_b_fn_idx;
regress=> CREATE INDEX partial_b_fn_in_idx ON partial(b)
WHERE (b IN (identity(b), 1));
regress=> EXPLAIN SELECT b FROM partial WHERE b % 1000 = 0;
QUERY PLAN
----------------------------------------------------------------------------
Seq Scan on partial (cost=10000000000.00..10000000195.00 rows=50 width=4)
So my advice? Rewrite IN as an OR list:
CREATE INDEX ix_issue_magazine_volume ON issue USING BTREE ( magazine, volume )
WHERE (volume = f_current_volume(magazine) OR volume = f_previous_volume(magazine));
... and on a current version it might just work, so long as you keep the immutability rules outlined above in mind. Well, the second version:
select * from issue where magazine = 'gq' and volume = f_current_volume('gq');
might. Update: No, it won't; for it to be used, Pg would have to recognise that magazine='gq' and realise that f_current_volume('gq') was therefore equiavalent to f_current_volume(magazine). It doesn't attempt to prove equivalences on that level with partial index matching, so as you've noted in your update you have to write f_current_volume(magazine) directly. I should've spotted that. In theory PostgreSQL could use the index with the second query if the planner was smart enough, but I'm not sure how you'd go about efficiently looking for places where a substitution like this would be worthwhile.
The first example, volume = 100 will never use the index, since at query planning time PostgreSQL has no idea that f_current_volumne('gg'); will evaluate to 100. You could add an OR clause OR volume = 100 to your partial index WHERE clause and PostgreSQL would figure it out then, though.
First off, I'd like to volunteer a wild guess, because you're making it sound like your f_current_volume() function calculates something based on a separate table.
If so, be wary because this means your function volatile, in that it needs to be recalculated on every call (a concurrent transaction might be inserting, updating or deleting rows). Postgres won't allow to index those, and I presume you worked around this by declaring the function immutable. Not only is this incorrect, but you also run into the issue of the index containing garbage, because the function gets evaluated as you edit the row, rather than at run time. What you'd probably want instead -- again if my guess is correct -- is to store and maintain the totals in the table itself using triggers.
Regarding your specific question, partial indexes need to have their where condition be met in the query to prompt Postgres to use them. I'm quite sure that Postgres is smart enough to identify that e.g. 10 is between 5 and 15 and use a partial index with that clause. I'm very suspicious that it would know that f_current_volume('gq') is 100 in your case, however, considering the above-mentioned caveat.
You could try this query and see if the index gets used:
select *
from issue
where magazine = 'gq'
and volume in (f_current_volume('gq'), f_previous_volume('gq'));
(Though again, if your function is in fact volatile, you'll get a seq scan as well.)

Selecting all fields (but one) instead of using asterix (*) decreases running time by 10 times [duplicate]

I've Googled this question and can't seem to find a consistent opinion, or many opinions that are based on solid data. I simply would like to know if using the wildcard in a SQL SELECT statement incurs additional overhead than calling each item out individually. I have compared the execution plans of both in several different test queries, and it seems that the estimates always read the same. Is it possible that some overhead is incurred elsewhere, or are they truly handled identically?
What I am referring to specifically:
SELECT *
vs.
SELECT item1, item2, etc.
SELECT * FROM...
and
SELECT every, column, list, ... FROM...
will perform the same because both are an unoptimised scan
The difference is:
the extra lookup in sys.columns to resolve *
the contract/signature change when the table schema changes
inability to create a covering index. In fact, no tuning options at all, really
have to refresh views needed if non schemabound
can not index or schemabind a view using *
...and other stuff
Other SO questions on the same subject...
What is the reason not to use select * ?
Is there a difference betweeen Select * and Select list each col
SQL Query Question - Select * from view or Select col1,col2…from view
“select * from table” vs “select colA,colB,etc from table” interesting behaviour in SqlServer2005
Do you mean select * from ... instead of select col1, col2, col3 from ...?
I think it's always better to name the column and retrieve the minimal amount of information, because
your code will work independently of the physical order of the columns in the db. The column order should not impact your application, but it will be the case if you use *. It can be dangerous in case of db migration, etc.
if you name the columns, the DBMS can optimize further the execution. For instance, if there is an index that contains all the data your are interested in, the table will not be accessed at all.
If you mean something else with "wildcard", just ignore my answer...
EDIT: If you are talking about the asterisk wild card as in Select * From ... then see other responses...
If you are talking about wildcards in predicate clauses, or other query expressions using Like operator, (_ , % ) as described below, then:
This has to do with whether using the Wildcard affects whether the SQL is "SARG-ABLE" or not. SARGABLE, (Search-ARGument-able)means whether or not the query's search or sort arguments can be used as entry parameters to an existing index. If you prepend the wild card to the beginning of an argument
Where Name Like '%ing'
Then there is no way to traverse an index on the name field to find the nodes that end in 'ing'.
If otoh you append the wildcard to the end,
Where Name like 'Donald%'
then the optimizer can still use an index on the name column, and the query is still SARG-able
If that you call SQL wild car is *. It does not imply performance overhead by it self. However, if the table is extended you could find yourself retrieving fields you doesn't search.
In general not being specific in the fields you search or insert is a bad habit.
Consider
insert into mytable values(1,2)
What happen if the table is extended to three fields?
It may not be more work from an execution plan standpoint. But if you're fetching columns you don't actually need, that's additional network bandwidth being used between the database and your application. Also if you're using a high-level client API that performs some work on the returned data (for example, Perl's selectall_hashref) then those extra columns will impose performance cost on the client side. How much? Depends.