(Bitwise) Supersets and Subsets in MySQL

(Bitwise) Supersets and Subsets in MySQL - mysql

Are the following queries effective in MySQL:
SELECT * FROM table WHERE field & number = number;
# to find values with superset of number's bits
SELECT * FROM table WHERE field | number = number;
# to find values with subset of number's bits
...if an index for the field has been created?
If not, is there a way to make it run faster?

Update:
See this entry in my blog for performance details:
Bitwise operations and indexes
SELECT * FROM table WHERE field & number = number
SELECT * FROM table WHERE field | number = number
This index can be effective in two ways:
To avoid early table scans (since the value to compare is contained in the index itself)
To limit the range of values examined.
Neither condition in the queries above is sargable, this is the index will not be used for the range scan (with the conditions as they are now).
However, point 1 still holds, and the index can be useful.
If your table contains, say, 100 bytes per row in average, and 1,000,000 records, then the table scan will need to scan 100 Mb of data.
If you have an index (with a 4-byte key, 6-byte row pointer and some internal overhead), the query will need to scan only 10 Mb of data plus additional data from the table if the filter succeeds.
The table scan is more efficient if your condition is not selective (you have high probablility to match the condition).
The index scan is more efficient if your condition is selective (you have low probablility to match the condition).
Both these queries will require scanning the whole index.
But by rewriting the AND query you can benefit from the ranging on the index too.
This condition:
field & number = number
can only match the fields if the highest bits of number set are set in the field too.
And you should just provide this extra condition to the query:
SELECT *
FROM table
WHERE field & number = number
AND field >= 0xFFFFFFFF & ~((2 << FLOOR(LOG(2, 0xFFFFFFFF & ~number))) - 1)
This will use the range for coarse filtering and the condition for fine filtering.
The more bits for number are unset at the end, the better.

I doubt the optimizer would figure that one...
Maybe you can call EXPLAIN on these queries and confirm my pessimistic guess. (remembering of course that much of query plan decisions are based on the specific instance of a given database, i.e. variable amounts of data and/ore merely data with a different statistical profile may produce distinct plans).
Assuming that the table has a significant amount of rows, and that the "bitwised" criteria remain selective enough) a possible optimization is achieved when avoiding a bitwise operation on every single row, by rewriting the query with an IN construct (or with a JOIN)
Something like that (conceptual, i.e. not tested)
CREATE TEMPORARY TABLE tblFieldValues
(Field INT);
INSERT INTO tblFieldValues
SELECT DISTINCT Field
FROM table;
-- SELECT * FROM table WHERE field | number = number;
-- now becomes
SELECT *
FROM table t
WHERE field IN
(SELECT Field
FROM tblFieldValues
WHERE field | number = number);
The full benefits of an approach like this need to be evaluated with different use cases (all of which with a sizeable number of rows in table, since otherwise the direct "WHERE field | number = number" approach is efficient enough), but I suspect this could be significantly faster. Further gains can be achieved if the "tblFieldValues" doesn't need to be recreated each time. Efficient creation of this table of course implies an index on Field in the original table.

I've tried this myself, and the bitwise operations are not enough to prevent Mysql from using an index on the "field" column. It is likely, though, that a full scan of the index is taking place.

Related

Fastest result when checking date range

User will select a date e.g. 06-MAR-2017 and I need to retrieve hundred thousand of records for date earlier than 06-MAR-2017 (but it could vary depends on user selection).
From above case, I am using this querySELECT col from table_a where DATE_FORMAT(mydate,'%Y%m%d') < '20170306' I feel that the record is kind of slow. Are there any faster or fastest way to get date results like this?

With 100,000 records to read, the DBMS may decide to read the table record for record (full table scan) and there wouldn't be much you could do.
If on the other hand the table contains billions of records, so 100,000 would just be a small part, then the DBMS may decide to use an index instead.
In any way you should at least give the DBMS the opportunity to select via an index. This means: create an index first (if such doesn't exist yet).
You can create an index on the date column alone:
create index idx on table_a (mydate);
or even provide a covering index that contains the other columns used in the query, too:
create index idx on table_a (mydate, col);
Then write your query such that the date column is accessed directly. You have no index on DATE_FORMAT(mydate,'%Y%m%d'), so above indexes don't help with your original query. You'd need a query that looks up the date itself:
select col from table_a where mydate < date '2017-03-06';
Whether the DBMS then uses the index or not is still up to the DBMS. It will try to use the fastest approach, which very well can still be the full table scan.

If you make a function call in any column at the left side of comparison, MySql will make a full table scan.
The fastest method would be to have an index created on mydate, and make the right side ('20170306') the same datatype of the column (and the index)

MySql Indexing Strategy With Multiple Shared Columns

We have a database table which stores browser data for visitors, broken down by multiple different subtypes. For simplicity, let's use the table schema below. The querying will basically be on any single id column, the metric column, the timestamp column (stored as seconds since epoch), and one of the device, browser, or os columns.
We are going to performance test the star vs snowflake schema (where all of the ids go into a single column, but then an additional column id_type is added to determine which type of identifier it is) for this table, but as long as the star schema (which is how it is now) is within 80% of the snowflake performance, we are going to keep it since it will make our load process much easier. Before I do that however, I want to make sure the indexes are optimized on the star schema.
create table browser_data (
id_1 int,
id_2 int,
id_3 int,
id_4 int,
metric varchar(20),
browser varchar(20),
device varchar(20),
os varchar(20),
timestamp bigint
)
Would it be better to create individual indexes on just the id columns, or also include the metric and timestamp columns in those indexes as well?

Do not normalize "continuous" values, such as DATETIME, FLOAT, INT. Do leave the values in the main table.
When you move the value to other table(s), especially "snowflake", it makes querying based on the values somewhere between a little slower and a lot slower. This especially happens when you need to filter on more than one metric that is not in the main table. Either of these perform very poorly because of "snowflake" or "over-normalization":
WHERE a.x = 123 AND b.y = 345
ORDER BY a.x, b.y
As for what indexes to create -- that depends entirely on the queries you need to perform. So, I strongly recommend you sketch out the likely SELECTs based on your tentative CREATE TABLEs.
INT is 4 bytes. TIMESTAMP is 5, FLOAT is 4, etc. That is, normalizing such things are also inefficient on space.
More
When doing JOINs, the Optimizer will almost always start with one table, then move on to the another table, etc. (See "Nested Loop Join".)
For example (building on the above 'code'), when 2 columns are normalized, and you are testing on the values, you do not have two ids in hand, you only have the two values. This makes the query execution very inefficient. For
SELECT ...
FROM main
JOIN a USING(a_id)
JOIN b USING(b_id)
WHERE a.x = 123 AND b.y = 345
The following is very likely to be the 'execution plan':
Reach into a to find the row(s) with x=123; get the id(s) for those rows. This may include many rows that are yet to be filtered by b.y. a needs INDEX(x)
Go back to the main table, looking up rows with those id(s). main needs INDEX(a_id). Again, more rows than necessary may be hauled around.
Only now, do you get to b (using b_id) to check for y=345; toss the unnecessary rows you have been hauling around. b needs INDEX(b_id)
Note my comment about "haul around". Blindly using * (in SELECT *) adds to the problem -- all the columns are being hauled around while performing the steps.
On the other hand... If x and y were in the main table, then the code works like:
WHERE main.x = 123
AND main.y = 345
only needs INDEX(x,y) (in either order). And it quickly locates exactly the rows desired.
In the case of ORDER BY a.x, b.y, it cannot use any index on any table. So the query must create a tmp table, sort it, then deliver the rows in the desired order.
But if x and y are in the same table, then INDEX(x,y) (in that order) may be useful for ORDER BY x,y and avoid the tmp table and the sort.
With a single table, the Optimizer might use an index for WHERE, or it might use an index for ORDER BY, depending on the phase of the moon. In some cases, one index can be used for both -- this is optimal.
Another note: If you also have LIMIT 10,... If the sort is avoided, then only 10 rows need to be looked at, not the entire set from the WHERE.

What is the most performant table index for a set of history records?

I have a simple history table that I am developing a new lookup for. I am wondering what is the best index (if any) to add to this table so that the lookups are as fast as possible.
The history table is a simple set of records of actions taken. Each action has a type and an action date (and some other attributes). Every day a new set of action records is generated by the system.
The relevant pseudo-schema is:
TABLE history
id int,
type int,
action_date date
...
INDEX
id
...
Note: the table is not indexed on type or action_date.
The new lookup function is intended to retrieve all the records of a specific type that occurred on a specific action date.
My initial inclination is to define a compound key consisting of both the type and the action_date.
However in my case there will be many actions with the same type and date. Further, the actions will be roughly evenly distributed in number each day.
Given all of the above: (a) is an index worthwhile; and (b) if so, what is the preferred index(es)?
I am using MySQL, but I think my question is not specific to this RDBMS.

The first field on index should be the one giving you the smallest dataset for the majority of queries after the condition is applied.
Depending on your business requirements, you may request a specific date or specific date range (most likely the date range). So the date should one the last field on the index. Most likely you will always have the date condition.
A common answer is to have the (type,date) index, but you should consider just the date index if you ever query more than one type value in the query or if you have just a few types (like less than 5) and they are not evenly distributed.
For example, you have type 1 70% of the table, type 2,3,4,... is less than few percent of the table, and you often query type 1, you better have just separate date index, and type index (for cases when you query type 2,3,4,), not compound (type, date) index.

INDEX(type, action_date), regardless of cardinality or distribution of either column. Doing so will minimize the number of 'rows' of the index's BTree` that need to be looked at. (Yes, I am disagreeing with Sergiy's Answer.)
Even for WHERE type IN (2,3) AND action_date ... can use that index.
For checking against a date range of, say 2 weeks, I recommend this pattern:
AND action_date >= '2016-10-16`
AND action_date < '2016-10-16` + INTERVAL 2 WEEK
A way to see how much "work" is needed for a query:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
The numbers presented will give you a feel for how many index (or data) rows need to be touched. This makes it easy to see which of two possible queries/indexes works better, even when the table is too small to get reliable timings.

Yes, an index is worthwhile. Especially if you search for a small subset of the table.
If your search would match 20% or more of the table (approximately), the MySQL optimizer decides that the index is more trouble than it's worth, and it'll do a table-scan even if the index is available.
If you search for one specific type value and one specific date value, an index on (type, date) or an index on (date, type) is a good choice. It doesn't matter much which column you list first.
If you search for multiple values of type or multiple dates, then the order of columns matters. Follow this guide:
The leftmost columns of the index should be the ones on which you do equality comparisons. An equality comparison is one that matches exactly one value (even if that value is found on many rows).
WHERE type = 2 AND date = '2016-10-19' -- both equality
The next column of the index can be part of a range comparison. A range comparison matches multiple values. For example, > or IN( ) or BETWEEN or !=.
WHERE type = 2 AND date > '2016-10-19' -- one equality, one range
Only one such column benefits from an index. If you have range comparisons on multiple columns, only the first column of the index will use the index to support lookups. The subsequent column(s) will have to search through those matching rows "the hard way".
WHERE type IN (2, 3, 4) AND date > '2016-10-19' -- multiple range
If you sometimes search using a range condition on type and equality on date, you'll need to create a second index.
WHERE type IN (2, 3, 4) AND date = '2016-10-19' -- make index on (date, type)
The order of terms in your WHERE clause doesn't matter. The SQL query optimizer will figure that out and reorder them to match the right columns defined in an index.

Best solution for saving boolean values and saving cpu and memory on searches

What is the best solution for inserting boolean values on database if you want more query performance and minimum losing of memory on select statement.
For example:
I have a table with 36 fields that 30 of them has boolean values (zero or one) and i need to search records using the boolean fields that just have true values.
SELECT * FROM `myTable`
WHERE
`field_5th` = 1
AND `field_12th` = 1
AND `field_20` = 1
AND `field_8` = 1
Is there any solution?

If you want to store boolean values or flags there are basically three options:
Individual columns
This is reflected in your example above. The advantage is that you will be able to put indexes on the flags you intend to use most often for lookups. The disadvantage is that this will take up more space (since the minimum column size that can be allocated is 1 byte.)
However, if you're column names are really going to be field_20, field_21, etc. Then this is absolutely NOT the way to go. Numbered columns are a sign you should use either of the other two methods.
Bitmasks
As was suggested above you can store multiple values in a single integer column. A BIGINT column would give you up to 64 possible flags.
Values would be something like:
UPDATE table SET flags=b'100';
UPDATE table SET flags=b'10000';
Then the field would look something like: 10100
That would represent having two flag values set. To query for any particular flag value set, you would do
SELECT flags FROM table WHERE flags & b'100';
The advantage of this is that your flags are very compact space-wise. The disadvantage is that you can't place indexes on the field which would help improve the performance of searching for specific flags.
One-to-many relationship
This is where you create another table, and each row there would have the id of the row it's linked to, and the flag:
CREATE TABLE main (
main_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
);
CREATE TABLE flag (
main_id INT UNSIGNED NOT NULL,
name VARCHAR(16)
);
Then you would insert multiple rows into the flag table.
The advantage is that you can use indexes for lookups, and you can have any number of flags per row without changing your schema. This works best for sparse values, where most rows do not have a value set. If every row needs all flags defined, then this isn't very efficient.
For performance comparisson you can read a blog post I wrote on the topic:
Set Performance Compare
Also when you ask which is "Best" that's a very subjective question. Best at what? It all really depends on what your data looks like and what your requirements are and how you want to query it.
Keep in mind that if you want to do a query like:
SELECT * FROM table WHERE some_flag=true
Indexes will only help you if few rows have that value set. If most of the rows in the table have some_flag=true, then mysql will ignore indexes and do a full table scan instead.

How many rows of data are you querying over? You can store the boolean values in an integer value and use bit operations to test for them them. It's not indexable, but storage is very well packed. Using TINYINT fields with indexes would pick one index to use and scan from there.

Should I avoid COUNT all together in InnoDB?

Right now, I'm debating whether or not to use COUNT(id) or "count" columns. I heard that InnoDB COUNT is very slow without a WHERE clause because it needs to lock the table and do a full index scan. Is that the same behavior when using a WHERE clause?
For example, if I have a table with 1 million records. Doing a COUNT without a WHERE clause will require looking up 1 million records using an index. Will the query become significantly faster if adding a WHERE clause decreases the number of rows that match the criteria from 1 million to 500,000?
Consider the "Badges" page on SO, would adding a column in the badges table called count and incrementing it whenever a user earned that particular badge be faster than doing a SELECT COUNT(id) FROM user_badges WHERE user_id = 111?
Using MyIASM is not an option because I need the features of InnoDB to maintain data integrity.

SELECT COUNT(*) FROM tablename seems to do a full table scan.
SELECT COUNT(*) FROM tablename USE INDEX (colname) seems to be quite fast if
the index available is NOT NULL, UNIQUE, and fixed-length. A non-UNIQUE index doesn't help much, if at all. Variable length indices (VARCHAR) seem to be slower, but that may just be because the index is physically larger. Integer UNIQUE NOT NULL indices can be counted quickly. Which makes sense.
MySQL really should perform this optimization automatically.

Performance of COUNT() is fine as long as you have an index that's used.
If you have a million records and the column in question is NON NULL then a COUNT() will be a million quite easily. If NULL values are allowed, those aren't indexed so the number of records is easily obtained by looking at the index size.
If you're not specifying a WHERE clause, then the worst case is the primary key index will be used.
If you specify a WHERE clause, just make sure the column(s) are indexed.

I wouldn't say avoid, but it depends on what you are trying to do:
If you only need to provide an estimate, you could do SELECT MAX(id) FROM table. This is much cheaper, since it just needs to read the max value in the index.
If we consider the badges example you gave, InnoDB only needs to count up the number of badges that user has (assuming an index on user_id). I'd say in most case that's not going to be more than 10-20, and it's not much harm at all.
It really depends on the situation. I probably would keep the count of the number of badges someone has on the main user table as a column (count_badges_awarded) simply because every time an avatar is shown, so is that number. It saves me having to do 2 queries.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008