How do I effectively find duplicate blob rows in MySQL?

How do I effectively find duplicate blob rows in MySQL? - mysql

I have a table of the form
CREATE TABLE data
{
pk INT PRIMARY KEY AUTO_INCREMENT,
dt BLOB
};
It has about 160,000 rows and about 2GB of data in the blob column (avg. 14kb per blob). Another table has foreign keys into this table.
Something like 3000 of the blobs are identical. So what I want is a query that will give me a re map table that will allow me to remove the duplicates.
The naive approach took about an hour on 30-40k rows:
SELECT a.pk, MIN(b.pk)
FROM data AS a
JOIN data AS b
ON a.dt=b.dt
WHERE b.pk < a.pk
GROUP BY a.pk;
I happen to have, for other reasons, a table that has the sizes of the blobs:
CREATE TABLE sizes
(
fk INT, // note: non-unique
sz INT
// other cols
);
By building indexes for both fk and another for sz the direct query from that takes about 24 sec with 50k rows:
SELECT da.pk,MIN(db.pk)
FROM data AS da
JOIN data AS db
JOIN sizes AS sa
JOIN sizes AS sb
ON
sa.size=sb.size
AND da.pk=sa.fk
AND db.pk=sb.fk
WHERE
sb.fk<sa.fk
AND da.dt=db.dt
GROUP BY da.pk;
However that is doing a full table scan on da (the data table). Given that the hit rate should be fairly low I'd think that an index scan would be better. With that in mind in added a 3rd copy of data as a 5th join to get that, and lost about 3 sec.
OK so for the question: Am I going to get much better than the second select? If so, how?
A bit of a corollary is: if I have a table where the key column's get very heavy use but the rest should only get rarely used, will I ever be better off adding another join of that table to encourage an index scan vs. a full table scan?
Xgc on #mysql#irc.freenode.net points out that the adding a utility table like sizes but with a unique constraint on fk might help a lot. Some fun with triggers and what not might make it even not to bad to keep up to date.

You can always use a hashing function (MD5 or SHA1) for your data and then compare the hashes.
The question is if you can save the hashes in your database?

Related

MySql Indexing Strategy With Multiple Shared Columns

We have a database table which stores browser data for visitors, broken down by multiple different subtypes. For simplicity, let's use the table schema below. The querying will basically be on any single id column, the metric column, the timestamp column (stored as seconds since epoch), and one of the device, browser, or os columns.
We are going to performance test the star vs snowflake schema (where all of the ids go into a single column, but then an additional column id_type is added to determine which type of identifier it is) for this table, but as long as the star schema (which is how it is now) is within 80% of the snowflake performance, we are going to keep it since it will make our load process much easier. Before I do that however, I want to make sure the indexes are optimized on the star schema.
create table browser_data (
id_1 int,
id_2 int,
id_3 int,
id_4 int,
metric varchar(20),
browser varchar(20),
device varchar(20),
os varchar(20),
timestamp bigint
)
Would it be better to create individual indexes on just the id columns, or also include the metric and timestamp columns in those indexes as well?

Do not normalize "continuous" values, such as DATETIME, FLOAT, INT. Do leave the values in the main table.
When you move the value to other table(s), especially "snowflake", it makes querying based on the values somewhere between a little slower and a lot slower. This especially happens when you need to filter on more than one metric that is not in the main table. Either of these perform very poorly because of "snowflake" or "over-normalization":
WHERE a.x = 123 AND b.y = 345
ORDER BY a.x, b.y
As for what indexes to create -- that depends entirely on the queries you need to perform. So, I strongly recommend you sketch out the likely SELECTs based on your tentative CREATE TABLEs.
INT is 4 bytes. TIMESTAMP is 5, FLOAT is 4, etc. That is, normalizing such things are also inefficient on space.
More
When doing JOINs, the Optimizer will almost always start with one table, then move on to the another table, etc. (See "Nested Loop Join".)
For example (building on the above 'code'), when 2 columns are normalized, and you are testing on the values, you do not have two ids in hand, you only have the two values. This makes the query execution very inefficient. For
SELECT ...
FROM main
JOIN a USING(a_id)
JOIN b USING(b_id)
WHERE a.x = 123 AND b.y = 345
The following is very likely to be the 'execution plan':
Reach into a to find the row(s) with x=123; get the id(s) for those rows. This may include many rows that are yet to be filtered by b.y. a needs INDEX(x)
Go back to the main table, looking up rows with those id(s). main needs INDEX(a_id). Again, more rows than necessary may be hauled around.
Only now, do you get to b (using b_id) to check for y=345; toss the unnecessary rows you have been hauling around. b needs INDEX(b_id)
Note my comment about "haul around". Blindly using * (in SELECT *) adds to the problem -- all the columns are being hauled around while performing the steps.
On the other hand... If x and y were in the main table, then the code works like:
WHERE main.x = 123
AND main.y = 345
only needs INDEX(x,y) (in either order). And it quickly locates exactly the rows desired.
In the case of ORDER BY a.x, b.y, it cannot use any index on any table. So the query must create a tmp table, sort it, then deliver the rows in the desired order.
But if x and y are in the same table, then INDEX(x,y) (in that order) may be useful for ORDER BY x,y and avoid the tmp table and the sort.
With a single table, the Optimizer might use an index for WHERE, or it might use an index for ORDER BY, depending on the phase of the moon. In some cases, one index can be used for both -- this is optimal.
Another note: If you also have LIMIT 10,... If the sort is avoided, then only 10 rows need to be looked at, not the entire set from the WHERE.

Use an index on a join without "where"

I'm trying to understand if it's possible to use an index on a join if there is no limiting where on the first table.
Note: this is not a line-by-line real-case usage, just a thing I draft together for understanding purposes. Don't point out the obvious "what are your trying to obtain with this schema?", "you should use UNSIGNED" or the likes because that's not the question.
Note2: this MySQL JOINS without where clause is somehow related but not the same
Schema:
CREATE TABLE posts (
id_post INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
text VARCHAR(100)
);
CREATE TABLE related (
id_relation INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
id_post1 INT NOT NULL,
id_post2 INT NOT NULL
);
CREATE INDEX related_join_index ON related(id_post1) using BTREE;
Query:
EXPLAIN SELECT * FROM posts FORCE INDEX FOR JOIN(PRIMARY) INNER JOIN related ON id_post=id_post1 LIMIT 0,10;
SQL Fiddle: http://sqlfiddle.com/#!2/84597/3
As you can see, the index is being used on the second table, but the engine is doing a full table scan on the first one (the FORCE INDEX is there just to highlight the general question).
I'd like to understand if it's possible to get a "ref" on the left side too.
Thanks!
Update: if the first table has significantly more record than the second, the thing swap: the engine uses an index for the first one and a full table scan for the second http://sqlfiddle.com/#!2/3a3bb/1 Still, no way to get indexes used on both.

The DBMS has an optimizer to figure out the best plan to execute a query. It's up to the optimizer to decide whether to use an index or simply read the table directly.
An index makes sense when the DBMS expects only few records to read from a table (say 1% of all rows only). But once it expects to read many records (say 99% of all rows) it will not use the index. The threshold may lie at low as 5% (i.e. <= 5% -> index; > 5% table scan).
There are exceptions. One is when an index holds all columns needed. Then the table itself doesn't have to be read at all. Another may be when the optimizer thinks an index access may result faster in spite of having to read many rows. It's also always possible the optimizer simply guesses wrong.

There is a page on the MySQL documentation about this subject.
Regarding the possibility to get a ref on the first table from the query, the short answer is NO.
The reason is obvious: because there is no WHERE clause ALL the rows from table posts are analyzed because they could be included in the result set. There is no reason to use an index for that, a full table scan is better because it gets all the rows; and because the order doesn't matter, the access is (more or less) sequential. Using an index requires reading more information from the storage (index and data).
MySQL will use the join type index if all the columns that appear in the SELECT clause are present in an index. In this case MySQL will perform a full index scan (join type index) instead of a full table scan (join type ALL) because it requires reading less information from the storage (an index is usually smaller than the entire table data).

Optimising MySQL queries with heavy joins

I currently run a site which tracks up-to-the-minute scores and ratings in a list. The list has thousands of entries that are updated frequently, and the list should be sortable by these score and ratings columns.
My SQL for getting this data currently looks like (roughly):
SELECT e.*, SUM(sa.amount) AS score, AVG(ra.rating) AS rating
FROM entries e
LEFT JOIN score_adjustments sa ON sa.entry_id = e.id
HAVING sa.created BETWEEN ... AND ...
LEFT JOIN rating_adjustments ra ON ra.entry_id = e.id
HAVING ra.rating > 0
ORDER BY score
LIMIT 0, 10
Where the tables are (simplified):
entries:
id: INT(11) PRIMARY
...other data...
score_adjustments:
id: INT(11), PRIMARY
entry_id: INT(11), INDEX, FOREIGN KEY (entries.id)
created: DATETIME
amount: INT(4)
rating_adjustments:
id: INT(11), PRIMARY
entry_id: INT(11), INDEX, FOREIGN KEY (entries.id)
rating: DOUBLE
There are approx 300,000 score_adjustments entries and they grow at about 5,000 a day. The rating_adjustments is about 1/4 that.
Now, I'm no DBA expert but I'm guessing calling SUM() and AVG() all the time isn't a good thing - especially when sa and ra contain hundreds of thousands of records - right?
I already do caching on the query, but I want the query itself to be fast - yet still as up to date as possible. I was wondering if anyone could share any solutions to optimise heavy join/aggregation queries like this? I'm willing to make structural changes if necessary.
EDIT 1
Added more info about the query.

Your data is badly clustered.
InnoDB will store rows with "close" PKs physically close together. Since your child tables use surrogate PKs, their rows will be stored in effect randomly. When the time comes to make calculations for the given row in the "master" table, DBMS must jump all over the place to gather the related rows from the child tables.
Instead of surrogate keys, try using more "natural" keys, with the parent's PK in the leading edge, similar to this:
score_adjustments:
entry_id: INT(11), FOREIGN KEY (entries.id)
created: DATETIME
amount: INT(4)
PRIMARY KEY (entry_id, created)
rating_adjustments:
entry_id: INT(11), FOREIGN KEY (entries.id)
rating_no: INT(11)
rating: DOUBLE
PRIMARY KEY (entry_id, rating_no)
NOTE: This assumes created's resolution is fine enough and the rating_no was added to allow multiple ratings per entry_id. This is just an example - you may vary the PKs according to your needs.
This will "force" rows belonging to the same entry_id to be stored physically close together, so a SUM or AVG can be calculated by just a range scan on the PK/clustering key and with very few I/Os.
Alternatively (e.g. if you are using MyISAM that doesn't support clustering), cover the query with indexes so the child tables are not touched during querying at all.
On top of that, you could denormalize your design, and cache the current results in the parent table:
Store SUM(score_adjustments.amount) as a physical field and adjust it via triggers every time a row is inserted, updated or deleted from score_adjustments.
Store SUM(rating_adjustments.rating) as "S" and COUNT(rating_adjustments.rating) as "C". When a row is added to rating_adjustments, add it to S and increment C. Calculate S/C at run-time to get the average. Handle updates and deletes similarly.

If you're worried about performance you could add the score and rating columns to the corresponding tables and update them on insert or update to the referenced tables using a trigger. This would cache the new results every time they are updated and you won't have to recalculate them every time, significantly reducing the amount of joining needed to get the results... just guessing but in most cases the results of your query are probably much more often fetched than updated.
Check out this sql fiddle http://sqlfiddle.com/#!2/b7101/1 to see how to make the triggers and their effect, I only added triggers on insert, you can add update triggers just as easily, if you ever delete data add triggers for delete as well.
Didn't add the datetime field, if the between ... and ... parameters change often you might have to still do that manually every time, otherwise you can just add the between clause to the score_update trigger.

Creating an Index that is a subset of another Index in MySQL

I'm using MySQL, although I suspect this is a generic database question.
I have a table consisting of 6 numeric columns. The first 5 of these make up the primary key.
It is a large table (20 million rows and growing), so some queries take time - about 10secs, which in itself is not too long, but I need to run a lot of them.
I understand that the primary key is automatically indexed - is there any advantage in me separately indexing some groups of columns within the primary key that I usually query on?
That is,, if I regularly query on the first 3 of the 5 primary key columns, should I create an additional index for these 3, or is that redundant because it's already part of the primary key index?

Ten seconds is quite a long time for a query that returns one or a tiny handful of rows. If the query is returning 3% of the table's contents, though, ten seconds is not too long.
Your primary unique key is backed up by a composite index, let's say an index on
(I1,I2,I3,I4,I5)
You are correct that a query like
WHERE I1 = val AND I2 = val AND I3 = val
and
WHERE I3 = val AND I2 = val AND I1 = val
should use the index created for the primary key. The important thing is that the columns in the composite index are all used, starting with the leftmost one. A query like
WHERE I3 = val AND I4 = val AND I5 = val
won't use the primary key's composite index very well, if at all. Neither will a query that does some kind of computation on the column values mentioned in the key, like
WHERE I1+I2+I3=sumvalue
Keep in mind that "should work" is not the same as "does work." Try using the EXPLAIN command in MySQL to figure out whether the DBMS is doing what you expect it to for your query.
http://dev.mysql.com/doc/refman/5.1/en/explain.html

Why not just create a few test queries, create the index on a copy of the table and see how it performs?
When it comes to performance, measuring is always better than trusting an opinion.
The "best" solution in a database largely depends on the specific details of the table(s) involved. What range of values in the columns, what distribution of values, what type of queries, relative frequency of select/delete/insert/update queries, etc.
That being said, my guess is that an index on a subset will help if that subset contains all columns used in a query. You might get better performance if you include the result set (column in the select) in the index.

Can I optimize my database by splitting one big table into many small ones?

Assume that I have one big table with three columns: "user_name", "user_property", "value_of_property". Lat's also assume that I have a lot of user (let say 100 000) and a lot of properties (let say 10 000). Then the table is going to be huge (1 billion rows).
When I extract information from the table I always need information about a particular user. So, I use, for example where user_name='Albert Gates'. So, every time the mysql server needs to analyze 1 billion lines to find those of them which contain "Albert Gates" as user_name.
Would it not be wise to split the big table into many small ones corresponding to fixed users?

No, I don't think that is a good idea. A better approach is to add an index on the user_name column - and perhaps another index on (user_name, user_property) for looking up a single property. Then the database does not need to scan all the rows - it just need to find the appropriate entry in the index which is stored in a B-Tree, making it easy to find a record in a very small amount of time.
If your application is still slow even after correctly indexing it can sometimes be a good idea to partition your largest tables.
One other thing you could consider is normalizing your database so that the user_name is stored in a separate table and use an integer foriegn key in its place. This can reduce storage requirements and can increase performance. The same may apply to user_property.

you should normalise your design as follows:
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists properties;
create table properties
(
property_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists user_property_values;
create table user_property_values
(
user_id int unsigned not null,
property_id smallint unsigned not null,
value varchar(255) not null,
primary key (user_id, property_id),
key (property_id)
)
engine=innodb;
insert into users (username) values ('f00'),('bar'),('alpha'),('beta');
insert into properties (name) values ('age'),('gender');
insert into user_property_values values
(1,1,'30'),(1,2,'Male'),
(2,1,'24'),(2,2,'Female'),
(3,1,'18'),
(4,1,'26'),(4,2,'Male');
From a performance perspective the innodb clustered index works wonders in this similar example (COLD run):
select count(*) from product
count(*)
========
1,000,000 (1M)
select count(*) from category
count(*)
========
250,000 (500K)
select count(*) from product_category
count(*)
========
125,431,192 (125M)
select
c.*,
p.*
from
product_category pc
inner join category c on pc.cat_id = c.cat_id
inner join product p on pc.prod_id = p.prod_id
where
pc.cat_id = 1001;
0:00:00.030: Query OK (0.03 secs)

Properly indexing your database will be the number 1 way of improving performance. I once had a query take a half an hour (on a large dataset, but none the less). Then we come to find out that the tables had no index. Once indexed the query took less than 10 seconds.

Why do you need to have this table structure. My fundemental problem is that you are going to have to cast the data in value of property every time you want to use it. That is bad in my opinion - also storing numbers as text is crazy given that its all binary anyway. For instance how are you going to have required fields? Or fields that need to have constraints based on other fields? Eg start and end date?
Why not simply have the properties as fields rather than some many to many relationship?
have 1 flat table. When your business rules begin to show that properties should be grouped then you can consider moving them out into other tables and have several 1:0-1 relationships with the users table. But this is not normalization and it will degrade performance slightly due to the extra join (however the self documenting nature of the table names will greatly aid any developers)
One way i regularly see databqase performance get totally castrated is by having a generic
Id, property Type, Property Name, Property Value table.
This is really lazy but exceptionally flexible but totally kills performance. In fact on a new job where performance is bad i actually ask if they have a table with this structure - it invariably becomes the center point of the database and is slow. The whole point of relational database design is that the relations are determined ahead of time. This is simply a technique that aims to speed up development at a huge cost to application speed. It also puts a huge reliance on business logic in the application layer to behave - which is not defensive at all. Eventually you find that you wan to use properties in a key relationsip which leads to all kinds of casting on the join which further degrades performance.
If data has a 1:1 relationship with an entity then it should be a field on the same table. If your table gets to more than 30 fields wide then consider movign them into another table but dont call it normalisation because it isnt. It is a technique to help developers group fields together at the cost of performance in an attempt to aid understanding.
I don't know if mysql has an equivalent but sqlserver 2008 has sparse columns - null values take no space.
SParse column datatypes
I'm not saying a EAV approach is always wrong, but i think using a relational database for this approach is probably not the best choice.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008