MySql Indexing Strategy With Multiple Shared Columns - mysql

We have a database table which stores browser data for visitors, broken down by multiple different subtypes. For simplicity, let's use the table schema below. The querying will basically be on any single id column, the metric column, the timestamp column (stored as seconds since epoch), and one of the device, browser, or os columns.
We are going to performance test the star vs snowflake schema (where all of the ids go into a single column, but then an additional column id_type is added to determine which type of identifier it is) for this table, but as long as the star schema (which is how it is now) is within 80% of the snowflake performance, we are going to keep it since it will make our load process much easier. Before I do that however, I want to make sure the indexes are optimized on the star schema.
create table browser_data (
id_1 int,
id_2 int,
id_3 int,
id_4 int,
metric varchar(20),
browser varchar(20),
device varchar(20),
os varchar(20),
timestamp bigint
)
Would it be better to create individual indexes on just the id columns, or also include the metric and timestamp columns in those indexes as well?

Do not normalize "continuous" values, such as DATETIME, FLOAT, INT. Do leave the values in the main table.
When you move the value to other table(s), especially "snowflake", it makes querying based on the values somewhere between a little slower and a lot slower. This especially happens when you need to filter on more than one metric that is not in the main table. Either of these perform very poorly because of "snowflake" or "over-normalization":
WHERE a.x = 123 AND b.y = 345
ORDER BY a.x, b.y
As for what indexes to create -- that depends entirely on the queries you need to perform. So, I strongly recommend you sketch out the likely SELECTs based on your tentative CREATE TABLEs.
INT is 4 bytes. TIMESTAMP is 5, FLOAT is 4, etc. That is, normalizing such things are also inefficient on space.
More
When doing JOINs, the Optimizer will almost always start with one table, then move on to the another table, etc. (See "Nested Loop Join".)
For example (building on the above 'code'), when 2 columns are normalized, and you are testing on the values, you do not have two ids in hand, you only have the two values. This makes the query execution very inefficient. For
SELECT ...
FROM main
JOIN a USING(a_id)
JOIN b USING(b_id)
WHERE a.x = 123 AND b.y = 345
The following is very likely to be the 'execution plan':
Reach into a to find the row(s) with x=123; get the id(s) for those rows. This may include many rows that are yet to be filtered by b.y. a needs INDEX(x)
Go back to the main table, looking up rows with those id(s). main needs INDEX(a_id). Again, more rows than necessary may be hauled around.
Only now, do you get to b (using b_id) to check for y=345; toss the unnecessary rows you have been hauling around. b needs INDEX(b_id)
Note my comment about "haul around". Blindly using * (in SELECT *) adds to the problem -- all the columns are being hauled around while performing the steps.
On the other hand... If x and y were in the main table, then the code works like:
WHERE main.x = 123
AND main.y = 345
only needs INDEX(x,y) (in either order). And it quickly locates exactly the rows desired.
In the case of ORDER BY a.x, b.y, it cannot use any index on any table. So the query must create a tmp table, sort it, then deliver the rows in the desired order.
But if x and y are in the same table, then INDEX(x,y) (in that order) may be useful for ORDER BY x,y and avoid the tmp table and the sort.
With a single table, the Optimizer might use an index for WHERE, or it might use an index for ORDER BY, depending on the phase of the moon. In some cases, one index can be used for both -- this is optimal.
Another note: If you also have LIMIT 10,... If the sort is avoided, then only 10 rows need to be looked at, not the entire set from the WHERE.

Related

How to get Count for large tables?

Sample Table:
+----+-------+-------+-------+-------+-------+---------------+
| id | col1 | col2 | col3 | col4 | col5 | modifiedTime |
+----+-------+-------+-------+-------+-------+---------------+
| 1 | temp1 | temp2 | temp3 | temp4 | temp5 | 1554459626708 |
+----+-------+-------+-------+-------+-------+---------------+
above table has 50 million records
(col1, col2, col3, col4, col5 these are VARCHAR columns)
(id is PK)
(modifiedTime)
Every column is indexed
For Ex: I have two tabs in my website.
FirstTab - I print the count of above table with following criteria [col1 like "value1%" and col2 like "value2%"]
SeocndTab - I print the count of above table with following criteria [col3 like "value3%"]
As I have 50 million records, the count with those criteria takes too much time to get the result.
Note: I would change records data(rows in table) sometime. Insert new rows. Delete not needed records.
I need a feasible solution instead of querying the whole table. Ex: like caching the older count. Is anything like this possible.
While I'm sure it's possible for MySQL, here's a solution for Postgres, using triggers.
Count is stored in another table, and there's a trigger on each insert/update/delete that checks if the new row meets the condition(s), and if it does, add 1 to the count. Another part of the trigger checks if the old row meets the condition(s), and if it does, subtracts 1.
Here's the basic code for the trigger that counts the rows with temp2 = '5':
CREATE OR REPLACE FUNCTION updateCount() RETURNS TRIGGER AS
$func$
BEGIN
IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
EXECUTE 'UPDATE someTableCount SET cnt = cnt + 1 WHERE 1 = (SELECT 1 FROM (VALUES($1.*)) x(id, temp1, temp2, temp3) WHERE x.temp2 = ''5'')'
USING NEW;
END IF;
IF TG_OP = 'DELETE' OR TG_OP = 'UPDATE' THEN
EXECUTE 'UPDATE someTableCount SET cnt = cnt - 1 WHERE 1 = (SELECT 1 FROM (VALUES($1.*)) x(id, temp1, temp2, temp3) WHERE x.temp2 = ''5'')'
USING OLD;
END IF;
RETURN new;
END
$func$ LANGUAGE plpgsql;
Here's a working example on dbfiddle.
You could of course modify the trigger code to have dynamic where expressions and store counts for each in the table like:
CREATE TABLE someTableCount
(
whereExpr text,
cnt INT
);
INSERT INTO someTableCount VALUES ('temp2 = ''5''', 0);
In the trigger you'd then loop through the conditions and update accordingly.
FirstTab - I print the count of above table with following criteria [col1 like "value1%" and col2 like "value2%"]
That would benefit from a 'composite' index:
INDEX(col1, col2)
because it would be "covering". (That is, all the columns needed in the query are found in a single index.)
SeocndTab - I print the count of above table with following criteria [col3 like "value3%"]
You apparently already have the optimal (covering) index:
INDEX(col3)
Now, let's look at it from a different point of view. Have you noticed that search engines no longer give you an exact count of rows that match? You are finding out why -- It takes too long to do the tally not matter what technique is used.
Since "col1" gives me no clue of your app, nor any idea of what is being counted, I can only throw out some generic recommendations:
Don't give the counts.
Precompute the counts, save them somewhere and deliver 'stale' values. This can be handy if there are only a few different "values" being counted. It is probably not practical for arbitrary strings.
Say "about nnnn" in the output.
Play some tricks to decide whether it is practical to compute the exact value or just say "about".
Say "more than 1000".
etc
If you would like to describe the app and the columns, perhaps I can provide some clever tricks.
You expressed concern about "insert speed". This is usually not an issue, and the benefit of having the 'right' index for SELECTs outweighs the slight performance hit for INSERTs.
It sounds like you're trying to use a hammer when a screwdriver is needed. If you don't want to run batch computations, I'd suggest using a streaming framework such as Flink or Samza to add and subtract from your counts when records are added or deleted. This is precisely what those frameworks are built for.
If you're committed to using SQL, you can set up a job that performs the desired count operations every given time window, and stores the values to a second table. That way you don't have to perform repeated counts across the same rows.
As a general rule of thumb when it comes to optimisation (and yes, 1 SQL server node#50mio entries per table needs one!), here is a list of few possible optimisation techniques, some fairly easy to implement, others maybe need more serious modifications:
optimize your MYSQL field type and sizes, eg. use INT instead of VARCHAR if data can be presented with numbers, use SMALL INT instead of BIG INT, etc. In case you really need to have VARCHAR, then use as small as possible length of each field,
look at your dataset; is there any repeating values? Let say if any of your field has only 5 unique values in 50mio rows, then save those values to separate table and just link PK to this Sample Table,
MYSQL partitioning, basic understanding is shown at this link, so the general idea is so implement some kind of partitioning scheme, e.g. new partition is created by CRONJOB every day at "night" when server utilization is at minimum, or when you reach another 50k INSERTs or so (btw also some extra effort will be needed for UPDATE/DELETE operations on different partitions),
caching is another very simple and effective approach, since requesting (almost) same data (I am assuming your value1%, value2%, value3% are always the same?) over and over again. So do SELECT COUNT() once a while, and then use differencial index count to get actual number of selected rows,
in-memory database can be used alongside tradtional SQL DBs to get often-needed data: simple key-value pair style could be enough: Redis, Memcached, VoltDB, MemSQL are just some of them. Also, MYSQL also knows in-memory engine,
use other types of DBs, e.g NoSQL DB like MongoDB, if your dataset/system can utilize different concept.
If you are looking for aggregation performance and don't really care about insert times, I would consider changing your Row DBMS for a Column DBMS.
A Column RDBMS stores data as columns, meaning each column is indexed independantly from the others. This allows way faster aggregations, I have switched from Postgres to MonetDB (an open source column DBMS) and summing one field from a 6 milions lines table dropped down from ~60s to 50ms. I chose MonetDB as it supports SQL querying and odbc connections which were a plus for my use case, but you will experience similar performance improvements with other Column DBMS.
There is a downside to Column storing, which is that you lose performance on insert, update and delete queries, but from what you said, I believe it won't affect you that much.
In Postgres, you can get an estimated row count from the internal statistics that are managed by the query planner:
SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'mytable';
Here you have more details: https://wiki.postgresql.org/wiki/Count_estimate
You could create a materialized view first. Something like this:
CREATE MATERIALIZED VIEW mytable AS SELECT * FROM the_table WHERE col1 like "value1%" and col2 like "value2%";`
You can also materialize directly the count queries. If you have 10 tabs, then you should have to materialize 10 views:
CREATE MATERIALIZED VIEW count_tab1 AS SELECT count(*) FROM the_table WHERE col1 like "value1%" and col2 like "value2%";`
CREATE MATERIALIZED VIEW count_tab2 AS SELECT count(*) FROM the_table WHERE col2 like "value2%" and col3 like "value3%";`
...
After each insert, you should refresh views (asynchronously):
REFRESH MATERIALIZED VIEW count_tab1
REFRESH MATERIALIZED VIEW count_tab2
...
As noted in the critique, you have not posted what you have tried. So I would assume that the limit of question is exactly what you posted. So kindly report results of exactly that much
What is the current time you are spending for the subset of the problem, i.e. count of [col1 like "value1%" and col2 like "value2%"] and 2nd [col3 like "value3%]
The trick would be to scan the data source once and make the data source smaller by creating an index. So first create an index on col1,col2,col3,id. Purpose of col3 and id is so that database scans just the index. And I would get both counts in same SQL
select sum
(
case
when col1 like 'value1%' and col2 like 'value2%' then 1
else 0
end
) cnt_condition_1,
sum
(
case
when col3 like 'value3%' then 1
else 0
end
) cnt_condition_2
from table
where (col1 like 'value1%' and col2 like 'value2%') or
(col3 like 'value3%')
```
So the 50M row table is probably very wide right now. This should trim it down - on a reasonable server I would expect above to return in a few seconds. If it does not and each condition returns < 10% of the table, second option will be to create multiple indexes for each scenario and do count for each so that index is used in each case.
If there is no bulk insert/ bulk updates happening in your system, Can you try vertical partitioning in your table? By vertical partitioning, you can separate the data block of col1, col2 from other data of the table and so your searching space will reduce.
Also, indexing on every columns doesn't seem to be the best approach to go with. Index wherever it is absolutely needed. In this case, I would say Index(col1,col2) and Index(col3).
Even after indexing, you need to look into the fragmentation of those indexes and modify it accordingly to get the best results. Because, sometimes 50 million index of one column can sit as one huge chunk, which will restrict multi processing capabilities of your SQL server.
Each Database has their own peculiarities in how to "enhance" their RDBMS. I can't speak for MySQL or SQL Server but for PostgreSQL you should consider making the indexes that you search as GIN (Generalized Inverted Index)-based indexes.
CREATE INDEX name ON table USING gin(col1);
CREATE INDEX name ON table USING gin(col2);
CREATE INDEX name ON table USING gin(col3);
More information can be found here.
-HTH
this will work:
select count(*) from (
select * from tablename where col1 like 'value1%' and col2 like 'value2%' and col3
like'value3%')
where REGEXP_LIKE(col1,'^value1(.*)$') and REGEXP_LIKE(col2,'^value2(.*)$') and
REGEXP_LIKE(col1,'^value2(.*)$');
try not to apply index on all the columns as it slows down the processing of a sql
query and have it in required columns only.

Handling a very large table without index

I have a very large table 20-30 million rows that is completely overwritten each time it is updated by the system supplying the data over which I have no control.
The table is not sorted in a particular order.
The rows in the table are unique, there is no subset of columns that I can be assured to have unique values.
Is there a way I can run a SELECT query followed by a DELETE query on this table with a fixed limit without having to trigger any expensive sorting/indexing/partitioning/comparison whilst being certain that I do not delete a row not covered by the previous select.
I think you're asking for:
SELECT * FROM MyTable WHERE x = 1 AND y = 3;
DELETE * FROM MyTable WHERE NOT (x = 1 AND y = 3);
In other words, use NOT against the same search expression you used in the first query to get the complement of the set of rows. This should work for most expressions, unless some of your terms return NULL.
If there are no indexes, then both the SELECT and DELETE will incur a table-scan, but no sorting or temp tables.
Re your comment:
Right, unless you use ORDER BY, you aren't guaranteed anything about the order of the rows returned. Technically, the storage engine is free to return the rows in any arbitrary order.
In practice, you will find that InnoDB at least returns rows in a somewhat predictable order: it reads rows in some index order. Even if your table has no keys or indexes defined, every InnoDB table is stored as a clustered index, even if it has to generate an internal key called GEN_CLUST_ID behind the scenes. That will be the order in which InnoDB returns rows.
But you shouldn't rely on that. The internal implementation is not a contract, and it could change tomorrow.
Another suggestion I could offer:
CREATE TABLE MyTableBase (
id INT AUTO_INCREMENT PRIMARY KEY,
A INT,
B DATE,
C VARCHAR(10)
);
CREATE VIEW MyTable AS SELECT A, B, C FROM MyTableBase;
With a table and a view like above, your external process can believe it's overwriting the data in MyTable, but it will actually be stored in a base table that has an additional primary key column. This is what you can use to do your SELECT and DELETE statements, and order by the primary key column so you can control it properly.

Ensure certain default sort order in MySql table

I have a large MySql table with over 11 million rows. This is just a huge data set and my task is to be able to analyze the dataset based on certain rules.
Each row belongs to a certain category. There are 2 million different categories. I want to get all rows for a category and perform operations on that.
So currently, I do the following:
Select distinct categories from the table.
for each category : Select fields from table WHERE category=category
Even though my category column is indexed, it takes a really long time to execute Step 2. This is mainly because of the huge data set.
Alternatively, I can use GROUP BY clause, however I am not sure if it will be as fast since GROUP BY on such a huge dataset may be expensive, especially when considering that I will be running my analysis several times on parts of the dataset. A way to permanently ensure a sorted table would be useful.
Therefore as an alternative, I can speed up my queries if only my table is pre-sorted by category. Now I can just read the table row by row and perform the same operations in a much faster time, as all rows of one category will be fetched consecutively.
As the dataset (MySql table) is fixed and no update, delete, insert operations will be performed on it. I want to be able to ensure a way to maintain a default sort order by category. Can anyone suggest a trick to ensure the default sort order of the rows.
Maybe read all rows and rewrite them to a new table or add a new primary key which ensures this order?
Even though my category column is indexed
Indexed by a secondary index? If so, you can encounter the following performance problems:
InnoDB tables are always clustered and the secondary index in clustered table can require a double-lookup (see the "Disadvantages of clustering" in this article).
Indexed rows can be scattered all over the place (index can have bad clustering factor - the link is for Oracle but the principle is the same). If so, an index range scan (such as WHERE category = whatever) can end-up loading many table pages, even though the index is actually used and only a small subset of rows is actually selected. This can destroy the range scan performance.
In alternative to the secondary index, consider using a natural primary key, which in InnoDB tables also acts as a clustering key. The primary/clustering key such as {category, no} will keep the rows of the same category physically close together, making both of your queries (and especially the second one) maximally efficient.
OTOH, if you want to keep the secondary index, consider covering all the fields that you query, so the primary B-Tree doesn't have to be touched at all.
You can do this in one step regardless of indexing by doing something like (pseudo code):
Declare #LastCategory int = Null
Declare #Category int
For Each Row In
Select
#Category = Category,
...
From
Table
Order By
Category
If #LastCategory Is Null Or #LastCategory != #Category
Do any "New Category Steps"
Set #LastCategory = #Category
End
Process Row
End For
With the index on category I'd expect this to perform OK. Your performance issues may be down to what you are doing when processing each row.
Here's an example: http://sqlfiddle.com/#!2/e53c98/1

(Bitwise) Supersets and Subsets in MySQL

Are the following queries effective in MySQL:
SELECT * FROM table WHERE field & number = number;
# to find values with superset of number's bits
SELECT * FROM table WHERE field | number = number;
# to find values with subset of number's bits
...if an index for the field has been created?
If not, is there a way to make it run faster?
Update:
See this entry in my blog for performance details:
Bitwise operations and indexes
SELECT * FROM table WHERE field & number = number
SELECT * FROM table WHERE field | number = number
This index can be effective in two ways:
To avoid early table scans (since the value to compare is contained in the index itself)
To limit the range of values examined.
Neither condition in the queries above is sargable, this is the index will not be used for the range scan (with the conditions as they are now).
However, point 1 still holds, and the index can be useful.
If your table contains, say, 100 bytes per row in average, and 1,000,000 records, then the table scan will need to scan 100 Mb of data.
If you have an index (with a 4-byte key, 6-byte row pointer and some internal overhead), the query will need to scan only 10 Mb of data plus additional data from the table if the filter succeeds.
The table scan is more efficient if your condition is not selective (you have high probablility to match the condition).
The index scan is more efficient if your condition is selective (you have low probablility to match the condition).
Both these queries will require scanning the whole index.
But by rewriting the AND query you can benefit from the ranging on the index too.
This condition:
field & number = number
can only match the fields if the highest bits of number set are set in the field too.
And you should just provide this extra condition to the query:
SELECT *
FROM table
WHERE field & number = number
AND field >= 0xFFFFFFFF & ~((2 << FLOOR(LOG(2, 0xFFFFFFFF & ~number))) - 1)
This will use the range for coarse filtering and the condition for fine filtering.
The more bits for number are unset at the end, the better.
I doubt the optimizer would figure that one...
Maybe you can call EXPLAIN on these queries and confirm my pessimistic guess. (remembering of course that much of query plan decisions are based on the specific instance of a given database, i.e. variable amounts of data and/ore merely data with a different statistical profile may produce distinct plans).
Assuming that the table has a significant amount of rows, and that the "bitwised" criteria remain selective enough) a possible optimization is achieved when avoiding a bitwise operation on every single row, by rewriting the query with an IN construct (or with a JOIN)
Something like that (conceptual, i.e. not tested)
CREATE TEMPORARY TABLE tblFieldValues
(Field INT);
INSERT INTO tblFieldValues
SELECT DISTINCT Field
FROM table;
-- SELECT * FROM table WHERE field | number = number;
-- now becomes
SELECT *
FROM table t
WHERE field IN
(SELECT Field
FROM tblFieldValues
WHERE field | number = number);
The full benefits of an approach like this need to be evaluated with different use cases (all of which with a sizeable number of rows in table, since otherwise the direct "WHERE field | number = number" approach is efficient enough), but I suspect this could be significantly faster. Further gains can be achieved if the "tblFieldValues" doesn't need to be recreated each time. Efficient creation of this table of course implies an index on Field in the original table.
I've tried this myself, and the bitwise operations are not enough to prevent Mysql from using an index on the "field" column. It is likely, though, that a full scan of the index is taking place.

How do I effectively find duplicate blob rows in MySQL?

I have a table of the form
CREATE TABLE data
{
pk INT PRIMARY KEY AUTO_INCREMENT,
dt BLOB
};
It has about 160,000 rows and about 2GB of data in the blob column (avg. 14kb per blob). Another table has foreign keys into this table.
Something like 3000 of the blobs are identical. So what I want is a query that will give me a re map table that will allow me to remove the duplicates.
The naive approach took about an hour on 30-40k rows:
SELECT a.pk, MIN(b.pk)
FROM data AS a
JOIN data AS b
ON a.dt=b.dt
WHERE b.pk < a.pk
GROUP BY a.pk;
I happen to have, for other reasons, a table that has the sizes of the blobs:
CREATE TABLE sizes
(
fk INT, // note: non-unique
sz INT
// other cols
);
By building indexes for both fk and another for sz the direct query from that takes about 24 sec with 50k rows:
SELECT da.pk,MIN(db.pk)
FROM data AS da
JOIN data AS db
JOIN sizes AS sa
JOIN sizes AS sb
ON
sa.size=sb.size
AND da.pk=sa.fk
AND db.pk=sb.fk
WHERE
sb.fk<sa.fk
AND da.dt=db.dt
GROUP BY da.pk;
However that is doing a full table scan on da (the data table). Given that the hit rate should be fairly low I'd think that an index scan would be better. With that in mind in added a 3rd copy of data as a 5th join to get that, and lost about 3 sec.
OK so for the question: Am I going to get much better than the second select? If so, how?
A bit of a corollary is: if I have a table where the key column's get very heavy use but the rest should only get rarely used, will I ever be better off adding another join of that table to encourage an index scan vs. a full table scan?
Xgc on #mysql#irc.freenode.net points out that the adding a utility table like sizes but with a unique constraint on fk might help a lot. Some fun with triggers and what not might make it even not to bad to keep up to date.
You can always use a hashing function (MD5 or SHA1) for your data and then compare the hashes.
The question is if you can save the hashes in your database?