How do you design a database to allow fast multicolumn searching? - mysql

I am creating a real estate search from RETS data using MySQL, but this is a general question. When you have a variety of columns that you would like the user to be able to filter their search result by, how do you optimize this?
For example, http://www.charlestonrealestateguide.com/listings.php has 16 or so optional filters. Granted, he only has up to 11,000 entries (I have the same data), but I don't imagine the search is performed with just a giant WHERE AND AND AND ... clause. Or is this typically accomplished with one giant multicolumn index?
Newegg, Amazon, and countless others also have cool & fast filtering systems for large amounts of data. How do they do it? And is there a database optimization reason for the tendency to provide ranges instead of empty inputs, or is that merely for user convenience?

I believe this post by Explain Extended addresses your question. It's long and detailed showing many examples. I'll cut/paste his summary to wet your appetite:
In some cases, a range predicate (like
"less than", "greater than" or
"between") can be rewritten as an IN
predicate against the list of values
that could satisfy the range
condition.
Depending on the column datatype,
check constraints and statistics, that
list could be comprised of all
possible values defined by the
column’s domain; all possible values
defined by column’s minimal and
maximal value, or all actual distinct
values contained in the table. In the
latter case, a loose index scan could
be used to retrieve the list of such
values.
Since an equality condition is applied
to each value in the list, more access
and join methods could be used to
build the query plain, including range
conditions on secondary index columns,
hash lookups etc.
Whenever the optimizer builds a plan
for a query that contains a range
predicate, it should consider
rewriting the range condition as an IN
predicate and use the latter method if
it proves more efficient.

MySQL Edit
Seems that some RDBMS's have some capacity in this regards.
Mysql does have some index "joins" according to the documentation.
[Before MySQL5], MySQL was able to use at most only one index for each referenced table
But in 5 it supports some limited index merging.
You really need to understand how indexes work and when they are useful. At what percentage of rows does a Full Table Scan make more sense than an index? Would you believe that in some scenarios a FTS is cheaper than an Index scan that returns 2% of rows? If your Bedroom histogram looks like this 1 = 25%, 2 = 50%, 3 = 20%, >3 = 5%... the only time an index on that column is useful is finding more than 3 bedrooms and it won't use it then because of bind variables and clustering factors.
think of it like this. Assume my percentage of bedrooms is correct. Let's say you have 8k pages (dunno what Mysql uses) and each row is 80 bytes long. Ignoring overhead, you have 100 rows (listings) per page of disk. Since houses are added in random order (random insofar as bedrooms go) in each page you'll have 50 2-bedroom houses, 25 1-bedroom houses, 20 3-bedroom houses and maybe a 4 or 5 or so house on that page. EVERY page will have at least one 1 bedroom house, so you'll read EVERY page for BEDROOMS = 1, same for 2, same for 3. It could help for 5 bedroom houses... but if MySQL bind variable work like Oracle's then it won't switch plans for a given value of Bedrooms.
As you can see, there's a lot to understand... Far more than Jon Skeet has indicated.
Original Post
Most RDBMS can't combine indexes on a single table. If you have a table with columns A, B and C, with single column indexes on A, B and C. and you search where A = a and B = b and C = c. It will pick the most selective index and use only that one.
If you create a single, multicolumn index on A, B, C then that index won't work unless you include A = a in the WHERE. If your where is B = b and C = c then that index is ignored - in most RDBMS's.
This is why Oracle invented the Bitmap index. Bitmap index on A, B and C can be combined with Bitwise AND and Bitwise OR operations. Until a final set of Rowids is determined and Selected columns retrieved.
A bitmap index on the REGION column is shown in the last four columns.
Row Region North East West South
1 North 1 0 0 0
2 East 0 1 0 0
3 West 0 0 1 0
4 West 0 0 1 0
5 South 0 0 0 1
6 North 1 0 0 0
So if you say you want a house WHERE Region in (North, East). You'd Bitwise OR the the North index and the East index and wind up with rows 1, 2, 6
If you had another column with bedroom count such as
Row Bedrooms 1BR 2BR
1 1 1 0
2 2 0 1
3 1 1 0
4 1 1 0
5 2 0 1
6 2 0 1
if you AND Bedrooms = 2, that index would return 2, 5, 6 and when Bitwise AND'ed to the Region column would result in rows 2 and 6.
But since you failed to mention the RDBMS I may have completely wasted my time. Oh well.

Wouldn't it be a WHERE x='y' AND a='b' etc query instead?
I'd have thought that several separate indexes should be fine there - no need for anything special.

I'm assuming that your search criteria is discrete, not free-form, that is, you are filtering on something you can quantify like number of bedrooms, size of plot, etc. not whether or not it's in a "sunny location." In that case, I'd suggest that you want to build the query dynamically so that the query only considers the columns of interest in the database. Single column indexes are probably adequate, especially given that you don't seem to have a lot of data. If you find, though, that people are always specifying a couple of columns -- number of bedrooms and number of bathrooms, for example -- then adding a compound index for that combination of columns might be useful. I'd certainly let the statistics and performance drive those decisions, though.
If you're only querying a single table, it will choose the best index to use, if one is applicable. From this perspective, you want to choose columns that are good discriminators and likely to be used in the filter. Limiting the number of indexes can be a good thing, if you know that certain columns will either quickly limit the number of results returned or, conversely, that a particular column isn't a good discriminator. If, for instance, 90% of your houses listed have a plot size of less than an acre and most people search for plots of less than an acre (or don't care), then an index scan based on this index is typically no better than a table scan and there's no need for the index. Indexes do cost something to compute, though for a small database such as yours with infrequent inserts that's likely not an issue.
#Jon is right, I think you probably want to combine the filter properties using an AND rather than an OR. That is, people are generally looking for a house with 3 bedrooms AND 2 bathrooms, not 3 bedroooms OR 2 bathrooms. If you have a filter that allows multiple choices, then you may want to use IN -- say PropertyType IN ('Ranch','SplitLevel',...) instead of an explicty OR (works out the same, but more readable). Note you'd likely being using the foreign key to the PropertyTypes table rather than the text here, but I used the values just for illustration.

What you need is a full-text search engine. Amazon and others use the same. Have a look at http://lucene.apache.org/ and if your platform is based on Java then a much higher level of abstractions could be www.elasticsearch.com and Hibernate Search.

Related

MySQL: Long table vs wide table

What is the more efficient (in terms of query performance) database table design - long or wide?
I.e., this
id size price
1 S 12.4
1 M 23.1
1 L 33.3
2 S 3.3
2 M 5.3
2 L 11.0
versus this
id S M L
1 12.4 23.1 33.3
2 3.3 5.3 11.0
Generally (I reckon) it comes down to the comparison of performance between GROUP BY and selecting the columns directly:
SELECT AVG(price) FROM table GROUP BY size
or
SELECT AVG(S), AVG(M), AVG(L) FROM table
Second one is a bit longer to write (in terms of many columns), but what about the performance of the two? If possible, what are the general advantages/disadvantages of each of these tables formats?
First of all, these are two different data models suitable for different purposes.
That being said, I'd expect1 the second model will be faster for aggregation, simply because the data is packed more compactly, therefore needing less I/O:
The GROUP BY in the first model can be satisfied by a full scan on the index {size, price}. The alternative to index is too slow when the data is too large to fit in RAM.
The query in the second model can be satisfied by a full table scan. No index needed2.
Since the first approach requires table + index and the second one just the table, the cache utilization is better in the second case. Even if we disregard caching and compare the index (without table) in the first model with the table in the second model, I suspect the index will be larger than the table, simply because it physically records the size and has unused "holes" typical for B-Trees (though the same is true for the table if it is clustered).
And finally, the second model does not have the index maintenance overhead, which could impact the INSERT/UPDATE/DELETE performance.
Other than that, you can consider caching the SUM and COUNT in a separate table containing just one row. Update both the SUM and COUNT via triggers whenever a row is inserted, updated or deleted in the main table. You can then easily get the current AVG, simply by dividing SUM and COUNT.
1 But you should really measure on representative amounts of data to be sure.
2 Since there is no WHERE clause in your query, all rows will be scanned. Indexes are only useful for getting a relatively small subset of table's rows (and sometimes for index-only scans). As a rough rule of thumb, if more than 10% of rows in the table are needed, indexes won't help and the DBMS will often opt for a full table scan even when indexes are available.
The first option results in more rows and will generally be slower than the second option.
However, as Deltalima also indicated, the first option is more flexible. Not only when it comes to different query options, but also if/when you one day need to extend the table with other sizes, colors etc.
Unless you have a very large dataset or need ultra-fast lookup time, you'll probably be better off with the first option.
If you do have or need a very large dataset, you may be better off creating a table with pre-calculated summary values.
The long is more flexible in use. It allows you to filter on size for example
SELECT MAX(price) where size='L'
Also it allows for indexing on the size and on the id. This speeds up the GROUP BY and any queries where other tables are joined on id and/or size such a product stock table.

Which sql will be faster and why?

Suppose I have a student table contains id, class, school_id having 1000 records.
There are 3 schools and 12 classes.
Which of these 2 queries would be faster(if there is a difference)
Query 1:
SELECT * FROM student WHERE school = 2 and class = 5;
Query 2:
SELECT * FROM student WHERE class = 5 and school = 2;
Note: I just changed the places of the 2 conditions in WHERE.
Then which will be faster and Is the following true?
->probable number of records in query1 is 333
->probable number of records in query2 is 80.
It seriously doesn't matter one little bit. 1000 records is a truly tiny database table and, if there's a difference at all, you need to upgrade from such a brain-dead DBMS.
A decent DBMS would have already collected the stats from tables (or the DBA would have done it as part of periodic tuning) and the order of the where clauses would be irrelevant.
The execution engine would choose the one which reduced the cardinality (ie, reduced the candidate group of rows) the fastest. That means that (assuming classes and schools are roughly equally distributed) the class = 5 filter would happen first, no matter the order in the select statement.
Explaining the cardinality issue in a little more depth, for a roughly evenly distributed spread of those 1000 records, there would be 333 for each school and 83 for each class.
What a DBMS would do would be to filter first on what gives you the smallest result set. So it would tend to prefer using the class filter. That would immediately drop the candidate list of rows to about 83. Then, it's a simple matter of tossing out those which have a school other than 2.
In both cases, you end up with the same eventual row set but the initial filter is often faster since it can use an index to only select desired rows. The second filter, on the other hand, most likely goes through those rows in a less efficient manner so the quicker you can reduce the number of rows, the better.
If you really want to know, you need to measure rather than guess. That's one of the primary responsibilities of a DBA, tuning the database for optimal execution of queries.
These 2 queries are strictly the same :)
hypothetical; to teach a DB concept
"How your DB uses cardinality to optiize your queries"
So, it's basically true that they are identical, but I will mention one thought hinting at the "why" which will actually introduce a good RDBMS concept.
Let's just say hypothetically that your RDBMS used the WHERE clauses strictly in the order you specified them.
In that case, the optimal query would be the one in which the column with maximum cardinality was specified first. What that means is that specifying class=5 first would be faster, as it more quickly eliminates rows from consideration, meaning if the row's "class" column does not contain 5 (which is statistically more likely than it's "school" column not containing 2), then it doesn't even need to evaluate the "school" column.
Coming back to reality, however, you should know that almost all modern relational database management systems do what is called "building a query plan" and "compiling the query". This involves, among other things, evaluating the cardinality of columns specified in the WHERE clause (and what indexes are available, etc). So essentially, it is probably true to say they are identical, and the number of results will be, too.
The number of rows affected will not and may not change simply because you reorder the conditions in the "where clause" of the sql-statement.
The execution time will also not be affected since the sql-server will look for a matching index first.
First query executes faster than 2nd query because in where clause it filters school first so it is easier to get the class details later

Is there an indexable way to store several bitfields in MySQL?

I have a MySQL table which needs to store several bitfields...
notification.id -- autonumber int
association.id -- BIT FIELD 1 -- stores one or more association ids (which are obtained from another table)
type.id -- BIT FIELD 2 -- stores one or more types that apply to this notification (again, obtained from another table)
notification.day_of_week -- BIT FIELD 3 -- stores one or more days of the week
notification.target -- where to send the notification -- data type is irrelevant, as we'll never index or sort on this field, but
will probably store an email address.
My users will be able to configure their notifications to trigger on one or more days, in one or more associations, for one or more types. I need a quick, indexable way to store this data.
Bit fields 1 and 2 can expand to have more values than they do presently. Currently 1 has values as high as 125, and 2 has values as high as 7, but both are expected to go higher.
Bit field 3 stores days of the week, and as such, will always have only 7 possible values.
I'll need to run a script frequently (every few minutes) that scans this table based on type, association, and day, to determine if a given notification should be sent. Queries need to be fast, and the simpler it is to add new data, the better. I'm not above using joins, subqueries, etc as needed, but I can't imagine these being faster.
One last requirement -- if I have 1000 different notifications stored in here, with 125 association possibilities, 7 types, and 7 days of the week, the combination of records is too high for my taste if just using integers, and storing multiple copies of the row, instead of using bit fields, so it seems like using bit fields is a requirement.
However, from what I've heard, if I wanted to select everything from a particular day of the week, say Tuesday (b0000100 in a bit field, perhaps), bit fields are not indexed such that I can do...
SELECT * FROM \`mydb\`.\`mytable\` WHERE \`notification.day_of_week\` & 4 = 4;
This, from my understanding, would not use an index at all.
Any suggestions on how I can do this, or something similar, in an indexable fashion?
(I work on a pretty standard LAMP stack, and I'm looking for specifics on how the MySQL indexing works on this or a similar alternative.)
Thanks!
There's no "good" way (that I know of) to accomplish what you want to.
Note that the BIT datatype is limited to a size of 64 bits.
For bits that can be statically defined, MySQL provides the SET datatype, which is in some ways the same as BIT, and in other ways it is different.
For days of the week, for example, you could define a column
dow SET('SUN','MON','TUE','WED','THU','FRI','SAT')
There's no builtin way (that I know of of getting the internal bit represntation back out, but you can add a 0 to the column, or cast to unsigned, to get a decimal representation.
SELECT dow+0, CONVERT(dow,UNSIGNED), dow, ...
1 1 SUN
2 2 MON
3 3 SUN,MON
4 4 TUE
5 5 SUN,TUE
6 6 MON,TUE
7 7 SUN,MON,TUE
It is possible for MySQL to use a "covering index" to satisfy a query with a predicate on a SET column, when the SET column is the leading column in the index. (i.e. EXPLAIN shows 'Using where; Using index') But MySQL may be performing a full scan of the index, rather than doing a range scan. (And there may be differences between the MyISAM engine and the InnoDB engine.)
SELECT id FROM notification WHERE FIND_IN_SET('SUN',dow)
SELECT id FROM notification WHERE (dow+0) MOD 2 = 1
BUT... this usage is non-standard, and can't really be recommended. For one thing, this behavior is not guaranteed, and MySQL may change this behavior in a future release.
I've done a bit more research on this, and realized there's no way to get the indexing to work as I outlined above. So, I've created an auxiliary table (somewhat like the WordPress meta table format) which stores entries for day of week, etc. I'll just join these tables as needed. Fortunately, I don't anticipate having more than ~10,000 entries at present, so it should join quickly enough.
I'm still interested in a better answer if anyone has one!

mySQL SELECT rows where a specific bit of an integer is set

i have to do a select query in a posting table where a specific bit of an integer is set.
The integer represents a set of categories in a bitmask:
E.g.
1 => health
2 => marketing
3 => personal
4 => music
5 => video
6 => design
7 => fashion
8 => ......
Data example:
id | categories | title
1 | 11 | bla bla
2 | 48 | blabla, too
I need a mysql query that selects postings, that are marked with a specific category.
Let's say "all video postings"
This means i need a result set of postings where the 5th bit of the catgories column is set (e.g. 16,17,48 ....)
SELECT * FROM postings WHERE ....????
Any ideas ?
You can use bitwise operators like this. For video (bit 5):
WHERE categories & 16 = 16
Substitute the value 16 using the following values for each bit:
1 = 1
2 = 2
3 = 4
4 = 8
5 = 16
6 = 32
7 = 64
8 = 128
This goes from least significant bit to highest, which is opposite of the way most programmers think. They also start at zero.
How about
SELECT * FROM postings WHERE (categories & 16) > 0; -- 16 is 5th bit over
One issue with this is you probably won't hit an index, so you could run into perf issues if it's a large amount of data.
Certain databases (such as PostgreSQL) let you define an index on an expression like this. I'm not sure if mySQL has this feature. If this is important, you might want to consider breaking these out into separate Boolean columns or a new table.
SQL (not just mySQL) is not suitable for bitwise operations. If you do a bitwise AND you will force a table scan as SQL will not be able to use any index and will have to check each row one at a time.
It would be better if you created a separate "Categories" table and a properly indexed many-to-many PostingCategories table to connect the two.
UPDATE
For people insisting that bitmap fields aren't an issue, it helps to check Joe Celko's BIT of a Problem.  At the bottom of the article is a list of serious problems caused by bitmaps.
Regarding the comment that a blanket statement can't be right, note #10 - it breaks 1NF so yes, bitmap fields are bad:
The data is unreadable. ...
Constraints are a b#### to write....
You are limited to two values per field. That is very restrictive; even the ISO sex code cannot fit into such a column...
There is no temporal element to the bit mask (or to single bit flags). For example, a flag “is_legal_adult_flg” ... A DATE for the birth date (just 3 bytes) would hold complete fact and let us compute what we need to know; it would always be correct, too. ...
You will find out that using the flags will tend to split the status of an entity over multiple tables....
Bit flags invite redundancy. In the system I just mentioned, we had “is_active_flg” and “is_completed_flg” in in the same table. A completed auction is not active and vice verse. It is the same fact in two flags. Human psychology (and the English language) prefers to hear an affirmative wording (remember the old song “Yes, we have no bananas today!” ?).
All of these bit flags, and sequence validation are being replaced by two sets of state transition tables, one for bids and one for shipments. For details on state transition constraints. The history of each auction is now in one place and has to follow business rules.
By the time you disassemble a bit mask column, and throw out the fields you did not need performance is not going to be improved over simpler data types. 
Grouping and ordering on the individual fields is a real pain. Try it.
You have to index the whole column, so unless you luck up and have them in the right order, you are stuck with table scans.
Since a bit mask is not in First Normal Form (1NF), you have all the anomalies we wanted to avoid in RDBMS.
I'd also add, what about NULLs? What about missing flags? What if something is neither true or false?
Finally, regarding the compression claim, most databases pack bit fields into bytes and ints internally. The bitmap field doesn't offer any kind of compression in this case. Other databases (eg PostgreSQL) actually have a Boolean type that can be true/false/unknown. It may take 1 byte but that's not a lot of storage and transparent compression is available if a table gets too large.
In fact, if a table gets large the bitmap fields problems become a lot more serious. Saving a few MBs in a GB table is no gain if you are forced to use table scans, or if you lose the ability to group

Simple MySQL output questions

I have a 2 row table pertaining of a number, and that numbers cube. Right now, I have about 13 million numbers inserted, and that's growing very, very quickly.
Is there a faster way to output simple tables quicker than using a command like SELECT * FROM table?
My second question pertains to the selection of a range of numbers. As stated above, I have a large database growing extremely fast to hold numbers and their cubes. If you're wondering, I'm trying to find the 3 numbers that will sum up to 33 when cubed. So, I'm doing this by using a server/client program to send a range of numbers to a client so they can do the equations on said range of numbers.
So, for example, let's say that the first client chimes in. I give him a range of 0, 100. He than goes off to compute the numbers and report back to tell the server if he found the triplet. If he didn't the loop will just continue.
When the client is doing the calculations for the numbers by itself, it goes extremely slow. So, I have decided to use a database to store the cubed numbers so the client does not have to do the calculations. The problem is, I don't know how to access only a range of numbers. For example, if the client had the range 0-100, it would need to access the cubes of all numbers from 0-100.
What is the select command that will return a range of numbers?
The engine I am using for the table is MyISAM.
If your table "mytable" has two columns
number cube
0 0
1 1
2 8
3 27
the query command will be (Assuming the start of the range is 100 and the end is 200):
select number, cube from mytable where number between 100 and 200 order by number;
If you want this query to be as fast as possible, make sure of the following:
number is an index. Thus you don't need to do a table scan to find the start of your range.
the index you create is clustered. Clustered indexes are way faster for
scans like this as the leaf in the index is the record (in comparison, the leaf in a
non-clustered index is a pointer to the record which may be in a completely different
part of the disk). As well, the clustered index
forces a sorted structure on the
data. Thus you may be able to read all 100
records from a single block.
Of course, adding an index will make writing to the table slightly slower. As well, I am assuming you are writing to the table in order (i.e. 0,1,2,3,4 etc. not 10,5,100,3 etc.). Writes to tables with clustered indexes are very slow if you write to the table in a random order (as the DB has to keep moving records to fit the new ones in).