mySQL SELECT rows where a specific bit of an integer is set - mysql

i have to do a select query in a posting table where a specific bit of an integer is set.
The integer represents a set of categories in a bitmask:
E.g.
1 => health
2 => marketing
3 => personal
4 => music
5 => video
6 => design
7 => fashion
8 => ......
Data example:
id | categories | title
1 | 11 | bla bla
2 | 48 | blabla, too
I need a mysql query that selects postings, that are marked with a specific category.
Let's say "all video postings"
This means i need a result set of postings where the 5th bit of the catgories column is set (e.g. 16,17,48 ....)
SELECT * FROM postings WHERE ....????
Any ideas ?

You can use bitwise operators like this. For video (bit 5):
WHERE categories & 16 = 16
Substitute the value 16 using the following values for each bit:
1 = 1
2 = 2
3 = 4
4 = 8
5 = 16
6 = 32
7 = 64
8 = 128
This goes from least significant bit to highest, which is opposite of the way most programmers think. They also start at zero.

How about
SELECT * FROM postings WHERE (categories & 16) > 0; -- 16 is 5th bit over
One issue with this is you probably won't hit an index, so you could run into perf issues if it's a large amount of data.
Certain databases (such as PostgreSQL) let you define an index on an expression like this. I'm not sure if mySQL has this feature. If this is important, you might want to consider breaking these out into separate Boolean columns or a new table.

SQL (not just mySQL) is not suitable for bitwise operations. If you do a bitwise AND you will force a table scan as SQL will not be able to use any index and will have to check each row one at a time.
It would be better if you created a separate "Categories" table and a properly indexed many-to-many PostingCategories table to connect the two.
UPDATE
For people insisting that bitmap fields aren't an issue, it helps to check Joe Celko's BIT of a Problem.  At the bottom of the article is a list of serious problems caused by bitmaps.
Regarding the comment that a blanket statement can't be right, note #10 - it breaks 1NF so yes, bitmap fields are bad:
The data is unreadable. ...
Constraints are a b#### to write....
You are limited to two values per field. That is very restrictive; even the ISO sex code cannot fit into such a column...
There is no temporal element to the bit mask (or to single bit flags). For example, a flag “is_legal_adult_flg” ... A DATE for the birth date (just 3 bytes) would hold complete fact and let us compute what we need to know; it would always be correct, too. ...
You will find out that using the flags will tend to split the status of an entity over multiple tables....
Bit flags invite redundancy. In the system I just mentioned, we had “is_active_flg” and “is_completed_flg” in in the same table. A completed auction is not active and vice verse. It is the same fact in two flags. Human psychology (and the English language) prefers to hear an affirmative wording (remember the old song “Yes, we have no bananas today!” ?).
All of these bit flags, and sequence validation are being replaced by two sets of state transition tables, one for bids and one for shipments. For details on state transition constraints. The history of each auction is now in one place and has to follow business rules.
By the time you disassemble a bit mask column, and throw out the fields you did not need performance is not going to be improved over simpler data types. 
Grouping and ordering on the individual fields is a real pain. Try it.
You have to index the whole column, so unless you luck up and have them in the right order, you are stuck with table scans.
Since a bit mask is not in First Normal Form (1NF), you have all the anomalies we wanted to avoid in RDBMS.
I'd also add, what about NULLs? What about missing flags? What if something is neither true or false?
Finally, regarding the compression claim, most databases pack bit fields into bytes and ints internally. The bitmap field doesn't offer any kind of compression in this case. Other databases (eg PostgreSQL) actually have a Boolean type that can be true/false/unknown. It may take 1 byte but that's not a lot of storage and transparent compression is available if a table gets too large.
In fact, if a table gets large the bitmap fields problems become a lot more serious. Saving a few MBs in a GB table is no gain if you are forced to use table scans, or if you lose the ability to group

Related

Is there an indexable way to store several bitfields in MySQL?

I have a MySQL table which needs to store several bitfields...
notification.id -- autonumber int
association.id -- BIT FIELD 1 -- stores one or more association ids (which are obtained from another table)
type.id -- BIT FIELD 2 -- stores one or more types that apply to this notification (again, obtained from another table)
notification.day_of_week -- BIT FIELD 3 -- stores one or more days of the week
notification.target -- where to send the notification -- data type is irrelevant, as we'll never index or sort on this field, but
will probably store an email address.
My users will be able to configure their notifications to trigger on one or more days, in one or more associations, for one or more types. I need a quick, indexable way to store this data.
Bit fields 1 and 2 can expand to have more values than they do presently. Currently 1 has values as high as 125, and 2 has values as high as 7, but both are expected to go higher.
Bit field 3 stores days of the week, and as such, will always have only 7 possible values.
I'll need to run a script frequently (every few minutes) that scans this table based on type, association, and day, to determine if a given notification should be sent. Queries need to be fast, and the simpler it is to add new data, the better. I'm not above using joins, subqueries, etc as needed, but I can't imagine these being faster.
One last requirement -- if I have 1000 different notifications stored in here, with 125 association possibilities, 7 types, and 7 days of the week, the combination of records is too high for my taste if just using integers, and storing multiple copies of the row, instead of using bit fields, so it seems like using bit fields is a requirement.
However, from what I've heard, if I wanted to select everything from a particular day of the week, say Tuesday (b0000100 in a bit field, perhaps), bit fields are not indexed such that I can do...
SELECT * FROM \`mydb\`.\`mytable\` WHERE \`notification.day_of_week\` & 4 = 4;
This, from my understanding, would not use an index at all.
Any suggestions on how I can do this, or something similar, in an indexable fashion?
(I work on a pretty standard LAMP stack, and I'm looking for specifics on how the MySQL indexing works on this or a similar alternative.)
Thanks!
There's no "good" way (that I know of) to accomplish what you want to.
Note that the BIT datatype is limited to a size of 64 bits.
For bits that can be statically defined, MySQL provides the SET datatype, which is in some ways the same as BIT, and in other ways it is different.
For days of the week, for example, you could define a column
dow SET('SUN','MON','TUE','WED','THU','FRI','SAT')
There's no builtin way (that I know of of getting the internal bit represntation back out, but you can add a 0 to the column, or cast to unsigned, to get a decimal representation.
SELECT dow+0, CONVERT(dow,UNSIGNED), dow, ...
1 1 SUN
2 2 MON
3 3 SUN,MON
4 4 TUE
5 5 SUN,TUE
6 6 MON,TUE
7 7 SUN,MON,TUE
It is possible for MySQL to use a "covering index" to satisfy a query with a predicate on a SET column, when the SET column is the leading column in the index. (i.e. EXPLAIN shows 'Using where; Using index') But MySQL may be performing a full scan of the index, rather than doing a range scan. (And there may be differences between the MyISAM engine and the InnoDB engine.)
SELECT id FROM notification WHERE FIND_IN_SET('SUN',dow)
SELECT id FROM notification WHERE (dow+0) MOD 2 = 1
BUT... this usage is non-standard, and can't really be recommended. For one thing, this behavior is not guaranteed, and MySQL may change this behavior in a future release.
I've done a bit more research on this, and realized there's no way to get the indexing to work as I outlined above. So, I've created an auxiliary table (somewhat like the WordPress meta table format) which stores entries for day of week, etc. I'll just join these tables as needed. Fortunately, I don't anticipate having more than ~10,000 entries at present, so it should join quickly enough.
I'm still interested in a better answer if anyone has one!

MySQL performance; large data table or multiple data tables?

I have a membership database that I am looking to rebuild. Every member has 1 row in a main members table. From there I will use a JOIN to reference information from other tables. My question is, what would be better for performance of the following:
1 data table that specifies a data type and then the data. Example:
data_id | member_id | data_type | data
1 | 1 | email | test#domain.com
2 | 1 | phone | 1234567890
3 | 2 | email | test#domain2.com
Or
Would it be better to make a table of all the email addresses, and then a table of all phone numbers, etc and then use a select statement that has multiple joins
Keep in mind, this database will start with over 75000 rows in the member table, and will actually include phone, email, fax, first and last name, company name, address city state zip (meaning each member will have at least 1 of each of those but can be have multiple (normally 1-3 per member) so in excess of 75000 phone numbers, email addresses etc)
So basically, join 1 table of in excess of 750,000 rows or join 7-10 tables of in excess of 75,000 rows
edit: performance of this database becomes an issue when we are inserting sales data that needs to be matched to existing data in the database, so taking a CSV file of 10k rows of sales and contact data and querying the database to try to find which member attributes to which sales row from the CSV? Oh yeah, and this is done on a web server, not a local machine (not my choice)
The obvious way to structure this would be to have one table with one column for each data item (email, phone, etc) you need to keep track of. If a particular data item can occur more than once per member, then it depends on the exact nature of the relationship between that item and the member: if the item can naturally occur a variable number of times, it would make sense to put these in a separate table with a foreign key to the member table. But if the data item can occur multiple times in a limited, fixed set of roles (say, home phone number and mobile phone number) then it makes more sense to make a distinct column in the member table for each of them.
If you run into performance problems with this design (personally, I don't think 75000 is that much - it should not give problems if you have indexes to properly support your queries) then you can partition the data. Mysql supports native partitioning (http://dev.mysql.com/doc/refman/5.1/en/partitioning.html), which essentially distributes collections of rows over separate physical compartments (the partitions) while maintaining one logical compartment (the table). The obvious advantage here is that you can keep querying a logical table and do not need to manually bunch up the data from several places.
If you still don't think this is an option, you could consider vertical partitioning: that is, making groups of columns or even single columns an put those in their own table. This makes sense if you have some queries that always need one particular set of columns, and other queries that tend to use another set of columns. Only then would it make sense to apply this vertical partitioning, because the join itself will cost performance.
(If you're really running into the billions then you could consider sharding - that is, use separate database servers to keep a partition of the rows. This makes sense only if you can either quickly limit the number of shards that you need to query to find a particular member row or if you can efficiently query all shards in parallel. Personally it doesn't seem to me you are going to need this.)
I would strongly recommend against making a single "data" table. This would essentially spread out each thing that would naturally be a column to a row. This requires a whole bunch of joins and complicates writing of what otherwise would be a pretty straightforward query. Not only that, it also makes it virtually impossible to create proper, efficient indexes over your data. And on top of that it makes it very hard to apply constraints to your data (things like enforcing the data type and length of data items according to their type).
There are a few corner cases where such a design could make sense, but improving performance is not one of them. (See: entity attribute value antipattern http://karwin.blogspot.com/2009/05/eav-fail.html)
YOu should research scaling out vs scaling up when it comes to databases. In addition to aforementioned research, I would recommend that you use one table in our case if you are not expecting a great deal of data. If you are, then look up dimensions in database design.
75k is really nothing for a DB. You might not even notice the benefits of indexes with that many (index anyway :)).
Point is that though you should be aware of "scale-out" systems, most DBs MySQL inclusive, can address this through partitioning allowing your data access code to still be truly declarative vs. programmatic as to which object you're addressing/querying. It is important to note sharding vs. partitioning, but honestly are conversations when you start exceeding records approaching the count in 9+ digits, not 5+.
Use neither
Although a variant of the first option is the right approach.
Create a 'lookup' table that will store values of data type (mail, phone etc...). Then use the id from your lookup table in your 'data' table.
That way you actually have 3 tables instead of two.
Its best practice for a classic many-many relationship such as this

Too many fields in MySQL?

I developed a stats site for a game as a learning project a few years back. It's still used today and I'd like to get it cleaned up a bit.
The database is one area that needs improvement. I have a table for the game statistics, which has GameID, PlayerID, Kills, Deaths, DamageDealt, DamageTaken, etc. In total, there are about 50 fields in that single table and many more that could be added in the future. At what point are there too many fields? It currently has 57,341 rows and is 153.6 MiB by itself.
I also have a few fields that stores arrays in a BLOB in this same table. An example of the array is Player vs Player matchups. The array stores how many times that player killed another player in the game. These are the bigger fields in filesize. Is storing an array in a BLOB advised?
The array looks like:
[Killed] => Array
(
[SomeDude] => 13
[GameGuy] => 10
[AnotherPlayer] => 8
[YetAnother] => 7
[BestPlayer] => 3
[APlayer] => 9
[WorstPlayer] => 2
)
These tend to not exceed more than 10 players.
I prefer to not have one table with an undetermined number of columns (with more to come) but rather to have an associated table of labels and values, so each user has an id and you use that id as a key into the table of labels and values. That way you only store the data you need per user. I believe this approach is called EAV (as per Triztian's comment) and it's how medical databases are kept, since there are SO many potential fields for an individual patient, even while any given patient only has a very small number of those fields with actual data.
so, you'd have
user:
id | username | some_other_required_field
user_data:
id | user_id | label | value
Now you can have as many or as few user_data rows as you need per user.
[Edit]
As to your array, I would treat this with a relational table as well. Something like:
player_interraction:
id | player_id | player_id | interraction_type
here you would store the two players who had an interaction and what type of interaction it was.
The table design seems mostly fine. As long as the columns you are storing can't be calculated from the other columns within the same row. IE, you're not storing SelfKills, OtherDeath, and TotalDeaths (where TotalDeaths = SelfKills + OtherDeath). That wouldn't make sense and could be cut out of your table.
I'd be curious to learn more about how you are storing those Arrays in a BLOB - what purpose do they serve in a BLOB? Why aren't they normalized into a table for easy data transformation and analytics? (OR are they and they are just being stored as an array here for easy of data display to end users).
Also, I'd be curious how much data your BLOB's take up vs the rest of the table. Generally speaking, the size of the rows isn't as big of a deal as the number of rows, and ~60K is no big deal at all. As long as you aren't writing queries that need to check every column value (ideally you're ignoring blobs when trying to write a where clause).
With mysql you've got a hard limit of roughly 4000 columns (fields) and 65Kb total storage per row. If you need to store large strings, use a text field, they're stored on disk. Blobs really should be reserved for non-textual data (if you must).
Don't worry in general about the size of your db, but think about the structure and how it's organized and indexed. I've seen small db's run like crap.
If you still want numbers, when you're total DB gets in the GB range or past a couple hundred thousand rows in a single table, then start worrying more about things--150M in 60K rows isn't much and table scans aren't going to cost you much in performance. However, now's the time to make sure you create good covering indexes on your heavily used queries.
There's nothing wrong with adding columns to a database table as time goes on. Database designs change all the time. The thing to keep in mind is how the data is grouped. I have always treated a database table as a collection of like items.
Things I consider are as follows:
When inserting data into a row how many columns will be null?
Does this new column apply to 80% of my data that is already there?
Will I be doing several updates to a few columns in this table?
If so, do I need to keep track of what the previos values were just in case?
By thinking about you data like this you may discover that you need to break your table up into a handful of separate smaller tables linked together by foreign keys.

Which is better to control state and perform queries with? One TINYINT column, one BIT(8) column or eight BIT(1) Columns

I pretend to use bitmaps set state (like this guy) and make bitwise queries on my tables.
What column types should I use? And how would I perform the selects?
This article got me a little woried about going trough with this idea. I want to be able to index the fields, do joins and everything else I would do with a normal field.
So if I have a table with the lines:
|1234 5678|
|Id|Name|State |
|01| xxx|0111 0001|
|02| yyy|1101 1001|
|03| zzz|0101 0011|
I would want to get back the lines that:
StateColumn 234 = 101 and StateColumn 8 = 1
That would be => (0101 0001)
I should get back the lines with Id 02 and 03.
Is it a good idea to make this kind of searches or am I just crazy?
While a bitmasking approach does have some uses other than impressing friends, (may reduce storage requirements), I strongly advice against using it on data that need to be queried. The reason is that you can't index it efficiently. Most if not all queries have to be resolved using full scans. I was really burned on this one a long time ago, because I tested it on a too small data set while being alone in the database. Add a few hundred thousand rows, a dozen of users and it just doesn't scale up.
Therefore, unless you have some exceptional requirements, I advice you to put each piece of data in its own column (bit or int), along with appropriate indexes (single or compound columns) depending on your query needs.
The "downside" of the (in my opinion correct) approach is increased storage (due to separate indexes) but unless you have millions of rows it's hardly noticable.
If for some reasons that doesn't work for you, there are other options, that exploit patterns in the data to make an efficient search structure. But they all come with a price (severely limited flexibility, locking issues in multiuser environments etcetera).
My advice: Store each piece of data in it own column. This is how the database was intended to be used, and it will leverage all the benefits of a database. This also happens to be the best performing approach in all but the most exceptionally twisted circumstances.
I want to be able to index the fields,
do joins and everything else I would
do with a normal field.
"Do joins" implies that you hope to be able to select rows where the 8th bit of the State column in one table matches the 8th bit of the state column in another table.
Don't do that.
Create a column for each distinct attribute. Pick the right data type. Declare all relevant integrity constraints. Index the right columns.
Do that, and you can select and join 'till the cows come home.

How do you design a database to allow fast multicolumn searching?

I am creating a real estate search from RETS data using MySQL, but this is a general question. When you have a variety of columns that you would like the user to be able to filter their search result by, how do you optimize this?
For example, http://www.charlestonrealestateguide.com/listings.php has 16 or so optional filters. Granted, he only has up to 11,000 entries (I have the same data), but I don't imagine the search is performed with just a giant WHERE AND AND AND ... clause. Or is this typically accomplished with one giant multicolumn index?
Newegg, Amazon, and countless others also have cool & fast filtering systems for large amounts of data. How do they do it? And is there a database optimization reason for the tendency to provide ranges instead of empty inputs, or is that merely for user convenience?
I believe this post by Explain Extended addresses your question. It's long and detailed showing many examples. I'll cut/paste his summary to wet your appetite:
In some cases, a range predicate (like
"less than", "greater than" or
"between") can be rewritten as an IN
predicate against the list of values
that could satisfy the range
condition.
Depending on the column datatype,
check constraints and statistics, that
list could be comprised of all
possible values defined by the
column’s domain; all possible values
defined by column’s minimal and
maximal value, or all actual distinct
values contained in the table. In the
latter case, a loose index scan could
be used to retrieve the list of such
values.
Since an equality condition is applied
to each value in the list, more access
and join methods could be used to
build the query plain, including range
conditions on secondary index columns,
hash lookups etc.
Whenever the optimizer builds a plan
for a query that contains a range
predicate, it should consider
rewriting the range condition as an IN
predicate and use the latter method if
it proves more efficient.
MySQL Edit
Seems that some RDBMS's have some capacity in this regards.
Mysql does have some index "joins" according to the documentation.
[Before MySQL5], MySQL was able to use at most only one index for each referenced table
But in 5 it supports some limited index merging.
You really need to understand how indexes work and when they are useful. At what percentage of rows does a Full Table Scan make more sense than an index? Would you believe that in some scenarios a FTS is cheaper than an Index scan that returns 2% of rows? If your Bedroom histogram looks like this 1 = 25%, 2 = 50%, 3 = 20%, >3 = 5%... the only time an index on that column is useful is finding more than 3 bedrooms and it won't use it then because of bind variables and clustering factors.
think of it like this. Assume my percentage of bedrooms is correct. Let's say you have 8k pages (dunno what Mysql uses) and each row is 80 bytes long. Ignoring overhead, you have 100 rows (listings) per page of disk. Since houses are added in random order (random insofar as bedrooms go) in each page you'll have 50 2-bedroom houses, 25 1-bedroom houses, 20 3-bedroom houses and maybe a 4 or 5 or so house on that page. EVERY page will have at least one 1 bedroom house, so you'll read EVERY page for BEDROOMS = 1, same for 2, same for 3. It could help for 5 bedroom houses... but if MySQL bind variable work like Oracle's then it won't switch plans for a given value of Bedrooms.
As you can see, there's a lot to understand... Far more than Jon Skeet has indicated.
Original Post
Most RDBMS can't combine indexes on a single table. If you have a table with columns A, B and C, with single column indexes on A, B and C. and you search where A = a and B = b and C = c. It will pick the most selective index and use only that one.
If you create a single, multicolumn index on A, B, C then that index won't work unless you include A = a in the WHERE. If your where is B = b and C = c then that index is ignored - in most RDBMS's.
This is why Oracle invented the Bitmap index. Bitmap index on A, B and C can be combined with Bitwise AND and Bitwise OR operations. Until a final set of Rowids is determined and Selected columns retrieved.
A bitmap index on the REGION column is shown in the last four columns.
Row Region North East West South
1 North 1 0 0 0
2 East 0 1 0 0
3 West 0 0 1 0
4 West 0 0 1 0
5 South 0 0 0 1
6 North 1 0 0 0
So if you say you want a house WHERE Region in (North, East). You'd Bitwise OR the the North index and the East index and wind up with rows 1, 2, 6
If you had another column with bedroom count such as
Row Bedrooms 1BR 2BR
1 1 1 0
2 2 0 1
3 1 1 0
4 1 1 0
5 2 0 1
6 2 0 1
if you AND Bedrooms = 2, that index would return 2, 5, 6 and when Bitwise AND'ed to the Region column would result in rows 2 and 6.
But since you failed to mention the RDBMS I may have completely wasted my time. Oh well.
Wouldn't it be a WHERE x='y' AND a='b' etc query instead?
I'd have thought that several separate indexes should be fine there - no need for anything special.
I'm assuming that your search criteria is discrete, not free-form, that is, you are filtering on something you can quantify like number of bedrooms, size of plot, etc. not whether or not it's in a "sunny location." In that case, I'd suggest that you want to build the query dynamically so that the query only considers the columns of interest in the database. Single column indexes are probably adequate, especially given that you don't seem to have a lot of data. If you find, though, that people are always specifying a couple of columns -- number of bedrooms and number of bathrooms, for example -- then adding a compound index for that combination of columns might be useful. I'd certainly let the statistics and performance drive those decisions, though.
If you're only querying a single table, it will choose the best index to use, if one is applicable. From this perspective, you want to choose columns that are good discriminators and likely to be used in the filter. Limiting the number of indexes can be a good thing, if you know that certain columns will either quickly limit the number of results returned or, conversely, that a particular column isn't a good discriminator. If, for instance, 90% of your houses listed have a plot size of less than an acre and most people search for plots of less than an acre (or don't care), then an index scan based on this index is typically no better than a table scan and there's no need for the index. Indexes do cost something to compute, though for a small database such as yours with infrequent inserts that's likely not an issue.
#Jon is right, I think you probably want to combine the filter properties using an AND rather than an OR. That is, people are generally looking for a house with 3 bedrooms AND 2 bathrooms, not 3 bedroooms OR 2 bathrooms. If you have a filter that allows multiple choices, then you may want to use IN -- say PropertyType IN ('Ranch','SplitLevel',...) instead of an explicty OR (works out the same, but more readable). Note you'd likely being using the foreign key to the PropertyTypes table rather than the text here, but I used the values just for illustration.
What you need is a full-text search engine. Amazon and others use the same. Have a look at http://lucene.apache.org/ and if your platform is based on Java then a much higher level of abstractions could be www.elasticsearch.com and Hibernate Search.