Removing duplicate TEXTS from large mysql table - mysql

I have mysql table, which has structure
+------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| content | longtext | NO | | NULL | |
| valid | tinyint(1) | NO | | NULL | |
| created_at | timestamp | YES | | NULL | |
| updated_at | timestamp | YES | | NULL | |
+------------+------------------+------+-----+---------+----------------+
I need to remove duplicate entries by content column, everything would be easy if it wasn't longtext, the main issue is that entries in that column vary in length from 1 char to over 12,000 chars and more, and I have over 4,000,000 entries, simple query like select id from table where content like "%stackoverflow%"; takes 15s to execute, what would be best approach to remove duplicate entries and not wait 2 days on executing query?

md5 is your friend here. Make a separate hashvalues table (to avoid locking/contention with this table in production) with columns for the id and hash. The primary key for this table should actually be the hash column, rather than id.
Once the new empty table is created, use MySql's md5() function to populate the new table from your original data, with the original id and the md5(content) for the field values. If necessary you can even populate the table in batches, if it would take too long or slow things down too much to do it all at once.
When the new table is fully populated with data, you can JOIN it to itself like this:
SELECT h1.*
FROM hashvalues h1
INNER JOIN hashvalues h2 on h1.hash = h2.hash and h1.id <> h2.id
This should be MUCH faster than comparing the content directly, since the database only has to compare pre-computed hash values. I'd expect to run almost instantly. It will tell you which records are potential duplicates. There is still a potential for hash collisions, so you also need to compare this back to the original data to be sure, or include an originalcontent column in the new table you can use with the query above. That done, you will know which records to remove.
This system can be even better if you can add a column to the original table to keep the md5() hash of your content field up to date every time it changes. A Generated Column will work well for this if you have the right storage engine. Otherwise, you can use a trigger. This column will allow you to re-run your duplicates check as needed, without all the extra work with the separate table.
Finally, there are also Sha(), Sha1(), and Sha2() functions that might be more collision-resistant. However, the md5() will be much faster and the additional collision resistance isn't enough to avoid the need for also comparing the original data. This also isn't a security situation where collision potential will matter, and so md5() is the better choice here. These aren't passwords, after all.

Related

Could a slow join on two tables be avoided by adding the desired columns from one to the other, or do I have bad relationships?

I have Table A
+-------------+---------+----------+
| id | int | NOT NULL |
| name | varchar | NOT NULL |
| number | varchar | NOT NULL |
| description | varchar | NOT NULL |
| type | varchar | NOT NULL |
+-------------+---------+----------+
I then create table C
+--------+---------+----------+
| B_id | int | NOT NULL |
| number | varchar | NOT NULL |
| qty | int | NOT NULL |
+--------+---------+----------+
Our current query looks like the following:
SELECT C.*, A.* FROM C
JOIN A ON A.number = C.number
WHERE C.B_id = '<insert any number here>'
This join seems to be running a little slow even though we've created an INDEX on A.number. My question is, could we simply avoid the join by taking the desired columns we want from A and add them as columns in table C, or is this bad practice.
I ask this also because at my day job, in our schemas, we have several tables that reference the same column names from table to table. The are indexed, and from millions of rows of data pull seamlessly. Why can I not achieve this with such small tables? Am I setting up the relationships incorrectly?
Yes, violating normalization is sometimes necessary to resolve performance problems.
But you should try to avoid this, and only do it when absolutely necessary. Adding the redundant columns means you need to ensure that the columns in C are always in sync with A. You may be able to do this with triggers, but it adds complexity and performance impacts to all queries that update the tables.
This shouldn't normally be needed for individual columns that can be fetched using a simple join on indexed columns. It can be more useful for aggregated data, since queries that perform grouping and aggregation can be very expensive for large datasets. For instance, if you frequently need transaction totals by date, you could use the Event Scheduler to update a table with these totals every night. Past transactions are not usually changed, so you don't have to worry about this getting out of sync with the raw transactions table.
Your particular query would benefit from this index:
C: INDEX(B_id)
The query then would
find the index rows in C for the given B_id
reach over to C's data BTree to get C.*
use C.number to reach into A's INDEX(number)
reach over to A's data Btree to get A.*
If you don't need all of *, there may be further optimizations (by using a "covering" index).
Note: The above assumes ENGINE=InnoDB.

(mysql query performance issues) Indexing of large historical share price database

this might be a trivial question for some of you but I haven't found/understood a solution to the following problem:
I have a large c 60 GB database structured the following way:
| Field | Type | Null | Key | Default | Extra |
+------------+----------+------+-----+---------+-------+
| date | datetime | YES | MUL | NULL | |
| chgpct1d | double | YES | | NULL | |
| pair | text | YES | | NULL | |
The database stores the last 10 years of daily percentage changes for c 200k different pair-trades. Thus, neither date nor pair is a unique key (a combination of date + pair would be). There are c 2600 distinct date entries and c 200k distinct pairs which generate > 520 MM rows.
The following query takes c multiple minutes to return a result.
SELECT date, chgpct1d, pair FROM db WHERE date = '2018-12-20';
What can I do to speed things up?
I've read about multiple-column indices but I'm not sure if that would help in my case given that all of the WHERE-queries will only ever point to the 'date' field.
MySQL probably does a full table scan to satisfy your query. That's like looking up a word in a dictionary that has its entries in random order: very slow.
Two things:
Create an index on these columns: (date, chgpct1d, pair).
Because the column named date has the DATETIME data type, it can potentially contain values like 2018-12-20 10:17:20. When you say WHERE date = '2018-12-20' it actually means WHERE date = '2018-12-20 00:00:00'. So, use this instead
WHERE date >= '2018-12-20'
AND date < '2018-12-21`
That will capture all the date values at any time on your chosen date.
Why does this help? Because your multicolumn index starts with date, MySQL can do a range scan on it given the WHERE statement you have. And, because the index contains everything needed by your query the database server doesn't have to look anywhere else, but can satisfy the query directly from the index. That index is said to cover the query.
Notice that with half a gigarow in your table, creating the index will take a while. Do it overnight or something.

How to create the index on MUL key in Mysql?

We need to create the index on "source path" column, which is already in MUL - Key. For Example it have /src/com/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph and we need to search like '%Sal/2016/Jan%' it have almost 10 Million records.
Please suggest any idea for performance improvement.
| Field | Type | Null | Key | Default | Extra |
+------------+----------+------+-----+---------+----------------+
| Id | int(11) | NO | PRI | NULL | auto_increment |
| Name | char(35) | NO | | | |
| Country | char(3) | NO | UNI | | |
| source Path| char(20) | YES | MUL | | |
| Population | int(11) | NO | | 0 |
Unfortunately, a search that starts with % cannot use an index (it has not much to do with being in a composite index).
You have some options though:
The values in your path seem to have actual meaning. The ideal solution would be to use the meta-data, e.g. the month, name, whatever "SAL" stands for, and store it in their own columns or an attribute table, and then query for that meta-data instead. This is obviously only possible in very specific cases where you have the required meta-data for every path, so it is probably not an option here.
You can add a "search table" (e.g. (id, subpath)) that contains all subpaths of your source path, e.g.
'/src/com/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph'
'/com/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph'
'/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph'
...
'/Sal/2016/Jan/31-01/Joseph'
...
'/31-01/Joseph'
'/Joseph'
so 11 rows in your example. It's now possible to use an index on that, e.g. in
...
where exists
(select * from subpaths s
where s.subpath like '/Sal/2016/Jan%' and s.id = outerquery.id)
This relies on knowing the start of your search term. If Sal in your example %Sal/2016/Jan should actually include word endings, e.g. /NoSal/2016/Jan, you would have to modify your input term to remove the first word, so %Sal/2016/Jan% would require you to search for /2016/Jan% (with an index) and then recheck the resultset afterwards if it also fits %Sal/2016/Jan% (see the fulltext option for an example, it has the same "problem" to only look for the beginning of words).
You will have to maintain the search table, which is usually done in a trigger (update the subpath table when you insert, update or delete values in your original table).
Since this is a new table, you cannot combine it (directly) with another index, to e.g. optimize where country = 'A' and subpath like 'Sal/2016/Jan%' if country = 'A' would already get rid of 99.99% of the rows. You may have to check explain for your query if MySQL actually uses the index (because the optimizer can try something different) and then maybe reorganize your query (e.g. use a join or force index).
You can use a fulltext search. From the userinput, you would have to generate a query like
select * from
(select * from table
where match(`source Path`) against ('+SAL +2016 +Jan' in boolean mode)) subquery
where `source path` like '%Sal/2016/Jan%'
The fulltext search will not care about the order of the words, so you have to recheck the resultset if it actually is the correct path, but the fulltext search will use the (fulltext) index to speed it up. It will only look for the beginning of words, so similar to the "search table" option, if Sal can be the end of the word, you have to remove it from the fulltext search. By default, only words with at least 3 or 4 letters (depending on your engine) will be added to the index, so you have to set the value of either ft_min_word_len or innodb_ft_min_token_size to whatever fits your requirements.
The search table approach is probably the most convenient solution, as it can be used very similar to your current search: you can add the userinput directly in one place (without having to interpret it to create the against (...) expression) and you can also use it easily in other situations (e.g. in something like join table2 on concat(table2.Year,'/',table2.Month,'%') like ...); but you will have to set up the triggers (or however else you maintain the table), which is a little more complicated than just adding a fulltext index.

MySQL EXPLAIN 'type' changes from 'range' to 'ref' when the date in the where statement is changed?

I've been testing out different ideas for optimizing some of the tables we have in our system at work. Today I came across a table that tracks every view on each vehicle in our system. Create table below.
SHOW CREATE TABLE vehicle_view_tracking;
CREATE TABLE `vehicle_view_tracking` (
`vehicle_view_tracking_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`public_key` varchar(45) NOT NULL,
`vehicle_id` int(10) unsigned NOT NULL,
`landing_url` longtext NOT NULL,
`landing_port` int(11) NOT NULL,
`http_referrer` longtext,
`created_on` datetime NOT NULL,
`created_on_date` date NOT NULL,
`server_host` longtext,
`server_uri` longtext,
`referrer_host` longtext,
`referrer_uri` longtext,
PRIMARY KEY (`vehicle_view_tracking_id`),
KEY `vehicleViewTrackingKeyCreatedIndex` (`public_key`,`created_on_date`),
KEY `vehicleViewTrackingKeyIndex` (`public_key`)
) ENGINE=InnoDB AUTO_INCREMENT=363439 DEFAULT CHARSET=latin1;
I was playing around with multi-column and single column indexes. I ran the following query:
EXPLAIN EXTENDED SELECT dealership_vehicles.vehicle_make, dealership_vehicles.vehicle_model, vehicle_view_tracking.referrer_host, count(*) AS count
FROM vehicle_view_tracking
LEFT JOIN dealership_vehicles
ON dealership_vehicles.dealership_vehicle_id = vehicle_view_tracking.vehicle_id
WHERE vehicle_view_tracking.created_on_date >= '2011-09-07' AND vehicle_view_tracking.public_key IN ('ab12c3')
GROUP BY (dealership_vehicles.vehicle_make) ASC , dealership_vehicles.vehicle_model, referrer_host
+----+-------------+-----------------------+--------+----------------------------------------------------------------+------------------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------------------+--------+----------------------------------------------------------------+------------------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | vehicle_view_tracking | range | vehicleViewTrackingKeyCreatedIndex,vehicleViewTrackingKeyIndex | vehicleViewTrackingKeyCreatedIndex | 50 | NULL | 23086 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dealership_vehicles | eq_ref | PRIMARY | PRIMARY | 8 | vehicle_view_tracking.vehicle_id | 1 | 100.00 | |
+----+-------------+-----------------------+--------+----------------------------------------------------------------+------------------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
(Execution time for actual select query was .309 seconds)
then I change the date in the where clause from '2011-09-07' to '2011-07-07' and got the following explain results
EXPLAIN EXTENDED SELECT dealership_vehicles.vehicle_make, dealership_vehicles.vehicle_model, vehicle_view_tracking.referrer_host, count(*) AS count
FROM vehicle_view_tracking
LEFT JOIN dealership_vehicles
ON dealership_vehicles.dealership_vehicle_id = vehicle_view_tracking.vehicle_id
WHERE vehicle_view_tracking.created_on_date >= '2011-07-07' AND vehicle_view_tracking.public_key IN ('ab12c3')
GROUP BY (dealership_vehicles.vehicle_make) ASC , dealership_vehicles.vehicle_model, referrer_host
+----+-------------+-----------------------+--------+----------------------------------------------------------------+-----------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------------------+--------+----------------------------------------------------------------+-----------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | vehicle_view_tracking | ref | vehicleViewTrackingKeyCreatedIndex,vehicleViewTrackingKeyIndex | vehicleViewTrackingKeyIndex | 47 | const | 53676 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dealership_vehicles | eq_ref | PRIMARY | PRIMARY | 8 | vehicle_view_tracking.vehicle_id | 1 | 100.00 | |
+----+-------------+-----------------------+--------+----------------------------------------------------------------+-----------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
(Execution time for actual select query was .670 seconds)
I see 4 main changes:
type changed from range to ref
key changed from vehicleViewTrackingKeyCreatedIndex to vehicleViewTrackingKeyIndex
key_len changed from 50 to 47 (caused by the change in key)
rows changed from 23086 to 53676 (caused by the change in key)
At this point, the execution time is only .6 seconds for the slow query however we only have about 10% of our vehicles in our database.
It's getting late and I may have overlooked something in the mysql docs but I can't seem to find why the key (and in turn the type and rows) are changing when the date is changed in the where clause.
The help is greatly appreciated. I searched for someone having the same/similar issue with a date causing this change and was not able to find anything. If I missed a previous post, please link me :-)
Different search strategies make sense for different data. In particular, index scans (such as range) often have to do a seek to actually read the row. At some point, doing all those seeks is slower than not using the index at all.
Take a trivial example, a table with three columns: id (primary key), name (indexed), birthday. Say it has a lot of data. If you ask MySQL to look for Bob's birthday, it can do that fairly quickly: first, it finds Bob in the name index (this takes a few seeks, log(n) where n is the row count), then one additional seek to read the actual row in the data file and read the birthday from it. That's very quick, and far quicker than scanning the entire table.
Next, consider doing a name like 'Z%'. That is probably a fairly small portion of the table. So its still faster to find where the Zs start in the name index, then for each one seek the data file to read the row. (This is a range scan).
Finally, consider asking for all names starting with M-Z. That's probably around half the data. It could do a range scan, and then a lot of seeks, but seeking randomly over the datafile with the ultimate goal of reading half the rows isn't optimal: it'd be faster to just do a big sequential read over the data file. So, in this case, the index will be ignored.
This is what you're seeing—except in your case, there is another key it can fall back on. (Its also possible that it might actually use the date index if it didn't have the other, it should pick whichever index will be quickest. Beware that MySQL's optimizer often makes errors in this.)
So, in short, this is expected. A query doesn't say how to retrieve the data, rather it says what data to retrieve. The database's optimizer is supposed to find the quickest way to retrieve it.
You may find an index on both columns, in the order (public_key,created_on_date) is preferred in both cases, and speeds up your query. This is because MySQL can only ever use one index per table (per query). Also, the date goes at the end because a range scan can only be done efficiently on the last column in an index.
[InnoDB actually has another layer of indirection, I believe, but it'd just confuse the point. It doesn't make a difference to the explanation.]

Keeping page changes history. A bit like SO does for revisions

I have a CMS system that stores data across tables like this:
Entries Table
+----+-------+------+--------+--------+
| id | title | text | index1 | index2 |
+----+-------+------+--------+--------+
Entries META Table
+----+----------+-------+-------+
| id | entry_id | value | param |
+----+----------+-------+-------+
Files Table
+----+----------+----------+
| id | entry_id | filename |
+----+----------+----------+
Entries-to-Tags Table
+----+----------+--------+
| id | entry_id | tag_id |
+----+----------+--------+
Tags Table
+----+-----+
| id | tag |
+----+-----+
I am in trying to implement a revision system, a bit like SO has. If I was just doing it for the Entries Table I was planning to just keep a copy of all changes to that table in a separate table. As I have to do it for at least 4 tables (the TAGS table doesn't need to have revisions) this doesn't seem at all like an elegant solution.
How would you guys do it?
Please notice that the Meta Tables are modeled in EAV (entity-attribute-value).
Thank you in advance.
Hi am currently working on solution to similar problem, I am solving it by splitting my tables into two, a control table and a data table. The control table will contain a primary key and reference into the data table, the data table will contain auto increment revision key and the control table's primary key as a foreign key.
taking your entries table as an example
Entries Table
+----+-------+------+--------+--------+
| id | title | text | index1 | index2 |
+----+-------+------+--------+--------+
becomes
entries entries_data
+----+----------+ +----------+----+--------+------+--------+--------+
| id | revision | | revision | id | title | text | index1 | index2 |
+----+----------+ +----------+----+--------+------+--------+--------+
to query
select * from entries join entries_data on entries.revision = entries_data.revision;
instead of updating the entries_data table you use an insert statement and then update the entries table's revision with the new revision of the entries table.
The advantage of this system is that you can move to different revisions simply by changing the revision property within the entries table. The disadvantage is you need to update your queries. I am currently integrating this into an ORM layer so the developers don't have worry about writing SQL anyway. Another idea I am toying with is for there to be a centralised revision table which all the data tables use. This would allow you to describe the state of the database with a single revision number, similar to how subversion revision numbers work.
Have a look at this question: How to version control a record in a database
Why not have a separate history_table for each table (as per the accepted answer on the linked question)? That simply has a compound primary key of the original tables' PK and the revision number. You will still need to store the data somewhere after all.
For one of our projects we went the following way:
Entries Table
+----+-----------+---------+
| id | date_from | date_to |
+----+--------_--+---------+
EntryProperties Table
+----------+-----------+-------+------+--------+--------+
| entry_id | date_from | title | text | index1 | index2 |
+----------+-----------+-------+------+--------+--------+
Pretty much complicated, still allows to keep track of full object's lifecycle. So for querying active entities we were going for:
SELECT
entry_id, title, text, index1, index2
FROM
Entities INNER JOIN EntityProperties
ON Entities.id = EntityProperties.entity_id
AND Entities.date_to IS NULL
AND EntityProperties.date_to IS NULL
The only concern was for a situation with entity being removed (so we put a date_to there) and then restored by admin. Using given scheme there's no way to track such kind of tricks.
Overall downside of any attempt like that is obvious - you've to write tons of TSQL where non-versioned DBs will go for something like select A join B.