SQL - speeding up performance - mysql

I have merged several css imports into one table - but this call takes about 70 seconds to complete. Is there a way of rewriting this to speed it up?
SELECT
`table_merged`.Property AS 'Property',
AVG(`table_merged`.`Value`) AS 'Average Asking Price'
FROM
`table_merged`
WHERE
`table_merged`.`Area` LIKE '%NW1%'
AND `table_merged`.`Property` LIKE '%2-bed flat%'
AND `table_merged`.`Year` = '2016'
GROUP BY
`table_merged`.Property, `table_merged`.`Year`
ORDER BY
`table_merged`.Property ASC
Output is
| Property | Average Asking Price
| 2-bed flat | 751427.1935581862

There is not a lot that can be done with the query as it is. Your biggest performance hit right now is your use of %...% style LIKE statements. As Gordon has already mentioned in the comments, this basically removes MySQL's ability to use indexes.
The best solution here would be to slightly alter your schema. I would suggest creating a table that stores the type of property (like 2-bed flat) and then some kind of code (could just be an int). Searching on a code without the LIKE statement would give you the ability to use an index, and if you absolutely have to search on a like string, you can do it as a subselect on a much smaller table.
Edit for further clarification:
You're searching for any string of text that matches %2-bed flat%. My assumption is that you're doing this because you have strings like cute 2-bed flat, 2-bed flat near the water, or other random marketing things. There is no way to optimize your LIKE STATEMENTS. They're slow, and that's an unfortunate fact of life.
I would add a column to your current tables. If the string matches the string %2-bed flat% then give it a value of something like 2bf. While you're at it, you can do this with any other search strings you may use. Put an index on this column.
Once that's done, instead of using an intensive LIKE statement you can just do an indexed search on the new column.

Related

Django startswith vs endswith performance on MySQL

Lets say I have the following model
class Person(models.Model):
name = models.CharField(max_length=20, primary_key=True)
So I would have objects in the database like
Person.objects.create(name='alex white')
Person.objects.create(name='alex chen')
Person.objects.create(name='tony white')
I could then subsequently query for all users whose first name is alex or last name is white by doing the following
all_alex = Person.objects.filter(name__startswith='alex')
all_white = Person.objects.filter(name__endswith='white')
I do not know how Django implements this under the hood, but I am going to guess it is with a SQL LIKE 'alex%' or LIKE '%white'
However, since according to MySQL index documentation, since the primary key index can only be used (e.g. as opposed to a full table scan) if % appears on the end of the LIKE query.
Does that mean that, as the database grows, startswith will be viable - whereas endswith will not be since it will resort to full table scans?
Am I correct or did I go wrong somewhere? Keep in mind these are not facts but just my deductions that I made from general assumptions - hence why I am asking for confirmation.
Assuming you want AND -- that is only Alex White and not Alex Chen or Tony White, ...
Even better (assuming there is an index starting with name) is
SELECT ...
WHERE name LIKE 'Alex%White'
If Django can't generate that, then it is getting in the way of efficient use of MySQL.
This construct will scan all the names starting with alex, further filtering on the rest of the expression.
If you do want OR (and 3 names), then you are stuck with
SELECT ...
WHERE ( name LIKE 'Alex%'
OR name LIKE '%White' )
And there is no choice but to scan all the names.
In some situations, perhaps this one, FULLTEXT would be better:
FULLTEXT(name) -- This index is needed for the following:
SELECT ...
WHERE MATCH(name) AGAINST('Alex White' IN BOOLEAN MODE) -- for OR
SELECT ...
WHERE MATCH(name) AGAINST('+Alex +White' IN BOOLEAN MODE) -- for AND
(Again, I don't know the Django capabilities.)
Yes, your understanding is correct.
select *
from foo
where bar like 'text1%' and bar like '%text2'
is not necessarily optimal. This could be an improvement:
select *
from (select *
from foo
where foo.bar like 'text1%') t
where t.bar like '%text2'
You need to make measurements to check whether this is better. If it is, the cause is thatin the inner query you use an index, while in the outer query you not use an index, but the set is prefiltered by the first query, therefore you have a much smaller set to query.
I am not at all a Django expert, so my answer might be wrong, but I believe chaining your filter would be helpful if filter actually executes the query. If that is the case, then you can use the optimization described above. If filter just prepares a query and chaining filters will result in a single query different from the one above, then I recommend using hand-written MySQL. However, if you do not have performance issues yet, then it is premature to optimize it, since you cannot really test the amount of performance you gained.

Disadvantages of quoting integers in a Mysql query?

I am curious about the disadvantage of quoting integers in MYSQL queries
For example
SELECT col1,col2,col3 FROM table WHERE col1='3';
VS
SELECT col1,col2,col3 FROM table WHERE col1= 3;
If there is a performance cost, what is the size of it and why does it occur? Are there any other disavantages other that performance?
Thanks
Andrew
Edit: The reason for this question
1. Because I want to learn the difference because I am curious
2. I am experimenting with a way of passing composite keys from my database around in my php code as psudo-Id-keys(PIK). These PIK's are the used to target the record.
For example, given a primary key (AreaCode,Category,RecordDtm)
My PIK in the url would look like this:
index.php?action=hello&Id=20001,trvl,2010:10:10 17:10:45
And I would select this record like this:
$Id = $_POST['Id'];//equals 20001,trvl,2010:10:10 17:10:45
$sql = "SELECT AreaCode,Category,RecordDtm,OtherColumns.... FROM table WHERE (AreaCode,Category,RecordDtm) = ({$Id});
$mysqli->query($sql):
......and so on.
At this point the query won't work because of the datetime(which must be quoted) and it is open to sql injection because I haven't escaped those values. Given the fact that I won't always know how my PIK's are constructed I would write a function splits the Id PIK at the commas, cleans each part with real_escape_string and puts It back together with the values quoted. For Example:
$Id = "'20001','trvl','2010:10:10 17:10:45'"
Of course, in this function that is breaking apart and cleaning the Id I could check if the value is a number or not. If it is a number, don't quote it. If it is anything but a string then quote it.
The performance cost is that whenever mysql needs to do a type conversion from whatever you give it to datatype of the column. So with your query
SELECT col1,col2,col3 FROM table WHERE col1='3';
If col1 is not a string type, MySQL needs to convert '3' to that type. This type of query isn't really a big deal, as the performance overhead of that conversion is negligible.
However, when you try to do the same thing when, say, joining 2 table that have several million rows each. If the columns in the ON clause are not the same datatype, then MySQL will have to convert several million rows every single time you run your query, and that is where the performance overhead comes in.
Strings also have a different sort order from numbers.
Compare:
SELECT 312 < 41
(yields 0, because 312 numerically comes after 41)
to:
SELECT '312' < '41'
(yields 1, because '312' lexicographically comes before '41')
Depending on the way your query is built using quotes might give wrong results or none at all.
Numbers should be used as such, so never use quotes unless you have a special reason to do so.
According to me, I think there is no performance/size cost in the case you have mentioned. Even if there is, then it is very much negligible and wont affect your application as such.
It gives the wrong impression about the data type for the column. As an outsider, I assume the column in question is CHAR/VARCHAR & choose operations accordingly.
Otherwise MySQL, like most other databases, will implicitly convert the value to whatever the column data type is. There's no performance issue with this that I'm aware of but there's a risk that supplying a value that requires explicit conversion (using CAST or CONVERT) will trigger an error.

MySQL: One row or more?

I have any kind of content what has an ID now here I can specify multiple types for the content.
The question is, should I use multiple rows to add multiple types or use the type field and put there the types separated with commas and parse them in PHP
Multiple Rows
`content_id` | `type`
1 | 1
1 | 2
1 | 3
VS
Single Row
`content_id` | `type`
1 | 1,2,3
EDIT
I'm looking for the faster answer, not the easier, please consider this. Performance is really important for me. So I'm talking about a really huge database with millions or ten millions of rows.
I'd generally always recommend the "multiple rows" approach as it has several advantages:
You can use SQL to return for example WHERE type=3 without any great difficulty as you don't have to use WHERE type LIKE '%3%', which is less efficient
If you ever need to store additional data against each content_id and type pair, you'll find it a lot easier in the multiple row version
You'll be able to apply one, or more, indexes to your table when it's stored in the "multiple row" format to improve the speed at which data is retrieved
It's easier to write a query to add/remove content_id and type pairs when each pair is stored separately than when you store them as a comma seaparated list
It'll (nearly) always be quicker to let SQL process the data to give you a subset than to pass it to PHP, or anything else, for processing
In general, let SQL do what it does best, which is allow you to store the data, and obtain subsets of the data.
I always use multiple rows. If you use single rows your data is hard to read and you have to split it up once you grab it from the database.
Use multiple rows. That way, you can index that type column later, and search it faster if you need to in the future. Also it removes a dependency on your front-end language to do parsing on query results.
Normalised vs de-normalised design.
usually I would recommend sticking to the "multiple rows" style (normalised)
Although sometimes (for performance/storage reasons) people deliberately implement "single row" style.
Have a look here:
http://www.databasedesign-resource.com/denormalization.html
The single row could be better in a few cases. Reporting tends to be easer with some denormalization is the main example. So if your code is cleaner/performs better with the single row, then go for that. Other wise the multiple rows would be the way to go.
Never, ever, ever cram multiple logical fields into a single field with comma separators.
The right way is to create multiple rows.
If there's some performance reason that demands you use a single row, at least make multiple fields in the row. But that said, there is almost never a good performance reason to do this. First make a good design.
Do you ever want to know all the records with, say, type=2? With multiple rows, this is easy: "select content_id from mytable where type=2". With the crammed field, you would have to say "select content_id from mytable where type like '%2%'". Oh, except what happens if there are more than 11 types? The above query would find "12". Okay, you could say "where type like '%,2,%'". Except that doesn't work if 2 is the first or the last in the list. Even if you came up with a way to do it reliably, a LIKE search with an initial % means a sequential read of every record in the table, which is very slow.
How big will you make the cram field? What if the string of types is too big to fit in your maximum?
Do you carry any data about the types? If you create a second table with key of "type" and, say, a description of that type, how will you join to that table. With multiple rows, you could simply write "select content_id, type_id, description from content join type using (type_id)". With a crammed field ... not so easy.
If you add a new type, how do you make it consistent? Suppose it used to say "3,7,9" and now you add "5". Can you say "3,7,9,5" ? Or do they have to be in order? If they're not in order, it's impossible to check for equality, because "1,2" and "2,1" will not look equal but they are really equivalent. In either case, updating a type field now becomes a program rather than a single SQL statement.
If there is some trivial performace gain, it's just not worth it.

MySQL : Does a query searches inside the whole table?

1. So if I search for the word ball inside the toys table where I have 5.000.000 entries does it search for all the 5 millions?
I think the answer is yes because how should it know else, but let me know please.
2. If yes: If I need more informations from that table isn't more logic to query just once and work with the results?
An example
I have this table structure for example:
id | toy_name | state
Now I should query like this
mysql_query("SELECT * FROM toys WHERE STATE = 1");
But isn't more logical to query for all the table
mysql_query("SELECT * FROM toys"); and then do this if($query['state'] == 1)?
3. And something else, if I put an ORDER BY id LIMIT 5 in the mysql_query will it search for the 5 million entries or just the last 5?
Thanks for the answers.
Yes, unless you have a LIMIT clause it will look through all the rows. It will do a table scan unless it can use an index.
You should use a query with a WHERE clause here, not filter the results in PHP. Your RDBMS is designed to be able to do this kind of thing efficiently. Only when you need to do complex processing of the data is it more appropriate to load a resultset into PHP and do it there.
With the LIMIT 5, the RDBMS will look through the table until it has found you your five rows, and then it will stop looking. So, all I can say for sure is, it will look at between 5 and 5 million rows!
Read this about indexes :-)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
It makes it uber-fast :-)
Full table scan is here only if there are no matching indexes and indeed very slow operation.
Sorting is also accelerated by indexes.
And for the #2 - this is slow because transfer rate from MySQL -> PHP is slow, and MySQL is MUCH faster at doing filtering.
For your #1 question: Depends on how you're searching for 'ball'. If there's no index on the column(s) where you're searching, then the entire table has to be read. If there is an index, then...
WHERE field LIKE 'ball%' will use an index
WHERE field LIKE '%ball%' will NOT use an index
For your #2, think of it this way: Doing SELECT * FROM table and then perusing the results in your application is exactly the same as going to the local super walmart, loading the store's complete inventory into your car, driving it home, picking through every box/package, and throwing out everything except the pack of gum from the impulse buy rack by the front till that you'd wanted in the first place. The whole point of a database is to make it easy to search for data and filter by any kind of clause you could think of. By slurping everything across to your application and doing the filtering there, you've reduced that shiny database to a very expensive disk interface, and would probably be better off storing things in flat files. That's why there's WHERE clauses. "SELECT something FROM store WHERE type=pack_of_gum" gets you just the gum, and doesn't force you to truck home a few thousand bottles of shampoo and bags of kitty litter.
For your #3, yes. If you have an ORDER BY clause in a LIMIT query, the result set has to be sorted before the database can figure out what those 5 records should be. While it's not quite as bad as actually transferring the entire record set to your app and only picking out the first five records, it still involves a bit more work than just retrieving the first 5 records that match your WHERE clause.

How does a hash table work? Is it faster than "SELECT * from .."

Let's say, I have :
Key | Indexes | Key-values
----+---------+------------
001 | 100001 | Alex
002 | 100002 | Micheal
003 | 100003 | Daniel
Lets say, we want to search 001, how to do the fast searching process using hash table?
Isn't it the same as we use the "SELECT * from .. " in mysql? I read alot, they say, the "SELECT *" searching from beginning to end, but hash table is not? Why and how?
By using hash table, are we reducing the records we are searching? How?
Can anyone demonstrate how to insert and retrieve hash table process in mysql query code? e.g.,
SELECT * from table1 where hash_value="bla" ...
Another scenario:
If the indexes are like S0001, S0002, T0001, T0002, etc. In mysql i could use:
SELECT * from table WHERE value = S*
isn't it the same and faster?
A simple hash table works by keeping the items on several lists, instead of just one. It uses a very fast and repeatable (i.e. non-random) method to choose which list to keep each item on. So when it is time to find the item again, it repeats that method to discover which list to look in, and then does a normal (slow) linear search in that list.
By dividing the items up into 17 lists, the search becomes 17 times faster, which is a good improvement.
Although of course this is only true if the lists are roughly the same length, so it is important to choose a good method of distributing the items between the lists.
In your example table, the first column is the key, the thing we need to find the item. And lets suppose we will maintain 17 lists. To insert something, we perform an operation on the key called hashing. This just turns the key into a number. It doesn't return a random number, because it must always return the same number for the same key. But at the same time, the numbers must be "spread out" widely.
Then we take the resulting number and use modulus to shrink it down to the size of our list:
Hash(key) % 17
This all happens extremely fast. Our lists are in an array, so:
_lists[Hash(key % 17)].Add(record);
And then later, to find the item using that key:
Record found = _lists[Hash(key % 17)].Find(key);
Note that each list can just be any container type, or a linked list class that you write by hand. When we execute a Find in that list, it works the slow way (examine the key of each record).
Do not worry about what MySQL is doing internally to locate records quickly. The job of a database is to do that sort of thing for you. Just run a SELECT [columns] FROM table WHERE [condition]; query and let the database generate a query plan for you. Note that you don't want to use SELECT *, since if you ever add a column to the table that will break all your old queries that relied on there being a certain number of columns in a certain order.
If you really want to know what's going on under the hood (it's good to know, but do not implement it yourself: that is the purpose of a database!), you need to know what indexes are and how they work. If a table has no index on the columns involved in the WHERE clause, then, as you say, the database will have to search through every row in the table to find the ones matching your condition. But if there is an index, the database will search the index to find the exact location of the rows you want, and jump directly to them. Indexes are usually implemented as B+-trees, a type of search tree that uses very few comparisons to locate a specific element. Searching a B-tree for a specific key is very fast. MySQL is also capable of using hash indexes, but these tend to be slower for database uses. Hash indexes usually only perform well on long keys (character strings especially), since they reduce the size of the key to a fixed hash size. For data types like integers and real numbers, which have a well-defined ordering and fixed length, the easy searchability of a B-tree usually provides better performance.
You might like to look at the chapters in the MySQL manual and PostgreSQL manual on indexing.
http://en.wikipedia.org/wiki/Hash_table
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indices sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.
I guess you could use a hash function to get the ID you want to select from. Like
SELECT * FROM table WHERE value = hash_fn(whatever_input_you_build_your_hash_value_from)
Then you don't need to know the id of the row you want to select and can do an exact query. Since you know that the row will always have the same id because of the input you build the hash value form and you can always recreate this id through the hash function.
However this isn't always true depending on the size of the table and the maximum number of hashvalues (you often have "X mod hash-table-size" somewhere in your hash). To take care of this you should have a deterministic strategy you use each time you get two values with the same id. You should check Wikipedia for more info on this strategy, its called collision handling and should be mentioned in the same article as hash-tables.
MySQL probably uses hashtables somewhere because of the O(1) feature norheim.se (up) mentioned.
Hash tables are great for locating entries at O(1) cost where the key (that is used for hashing) is already known. They are in widespread use both in collection libraries and in database engines. You should be able to find plenty of information about them on the internet. Why don't you start with Wikipedia or just do a Google search?
I don't know the details of mysql. If there is a structure in there called "hash table", that would probably be a kind of table that uses hashing for locating the keys. I'm sure someone else will tell you about that. =)
EDIT: (in response to comment)
Ok. I'll try to make a grossly simplified explanation: A hash table is a table where the entries are located based on a function of the key. For instance, say that you want to store info about a set of persons. If you store it in a plain unsorted array, you would need to iterate over the elements in sequence in order to find the entry you are looking for. On average, this will need N/2 comparisons.
If, instead, you put all entries at indexes based on the first character of the persons first name. (A=0, B=1, C=2 etc), you will immediately be able to find the correct entry as long as you know the first name. This is the basic idea. You probably realize that some special handling (rehashing, or allowing lists of entries) is required in order to support multiple entries having the same first letter. If you have a well-dimensioned hash table, you should be able to get straight to the item you are searching for. This means approx one comparison, with the disclaimer of the special handling I just mentioned.