MySQL how do i delete duplicate rows from very large table? - mysql

I need to know most effective way of deleting duplicated rows from very large table, (over 1 billion rows in this table) so i need to know a very efficient way of doing this as this may take days if i execute a ineffective query.
I need to delete all duplicate urls in the search table,
i.e
DELETE FROM search WHERE (url) NOT IN
(
SELECT url FROM
(
SELECT url FROM search GROUP BY url
) X
);

Depends entirely on your indexes. Do this in two steps: (1) create the highest-selectivity indexes your DBMS supports on the URL field combined with any other field that can distinguish records with the same URL, such as a primary key or time stamp field; (2) write procedural code (not just a query) to process a small fraction if the records at a time and commit results in these small batches, e.g. sliced by PK mod 1000, or the 3 characters of the URL preceding the .TLD part.
This is the best way to have a predictable result, unless you are sure the DB process won't run out of memory, log file space etc. during the long cycle of deletes a straight query would require.

DELETE from search
where id not in (
select min(id) from search
group by url
having count(*)=1
union
SELECT min(id) FROM search
group by url
having count(*) > 1
)

Related

Slow Query: Data categorization

I currently have a table (AllProducts) which contains product information. It has 16 columns and approximately 125000 rows.
I need to create a unique value in the database, as there is no unique value present in the table. I can not use the auto increment feature as my database gets emptied out and filled again on a daily basis (and thus id's for specific products will change).
I want to use a varchar field (url) to be a unique value. In order to do this I created a view (AllProductsCategories) which makes sure the combination of url and shop is unique.
select min(`a`.`insertionTime`) AS `insertionTime`,
`a`.`shop` AS `shop`,
min(`a`.`name`) AS `name`,
min(`a`.`category`) AS `category`,
max(`a`.`description`) AS `description`,
min(`a`.`price`) AS `price`,
`a`.`url` AS `url`,
avg(`a`.`image`) AS `image`,
min(`a`.`fromPrice`) AS `fromPrice`,
min(`a`.`deliveryCosts`) AS `deliveryCosts`,
max(`a`.`stock`) AS `stock`,
max(`a`.`deliveryTime`) AS `deliveryTime`,
max(`a`.`ean`) AS `ean`,
max(`a`.`color`) AS `color`,
max(`a`.`size`) AS `size`,max(`a`.`brand`) AS `brand`
from `AllProducts` `a` group by `a`.`url`,`a`.`shop`
order by NULL
This works fine but is quite slow. The query below takes 51 seconds to complete:
SELECT * FROM ProductsCategories ORDER BY NULL LIMIT 50
I am quite new to MySQL and experimented by indexing the following columns: category, name, url, shop and shop/url.
Now my questions:
1) Is this the correct approach if I want to ensure that the url field is unique? I currently use a group by to merge all info about one url. An alternative approach could be to delete duplicates (not sure how to do this though).
2) If the current approach is OK, how can I speed up this process?
If the data is re-loaded every day, then you should just fix it when it is reloaded.
Perhaps that is not possible. I would suggest the following approach, assuming that the triple url, shop, InsertionTime is unique. First, build an index on url, shop, InsertionTime. Then use this query:
select ap.*
from AllProducts ap
where ap.InsertionTime = (select InsertionTime
from AllProducts ap2
where ap2.url = ap.url and
ap2.shop = ap.shop
order by InsertionTime
limit 1
);
MySQL does not allow subqueries in the from clause of a view. It does allow them in the select and where (and having) clauses. This should cycle through the table, doing an index lookup for each row, just returning the ones that have the minimum insertion time.

get last record in file

I have a table (rather ugly designed, but anyway), which consists only of strings. The worst is that there is a script which adds records time at time. Records will never be deleted.
I believe, that MySQL store records in a random access file, and I can get last or any other record using C language or something, since I know the max length of the record and I can find EOF.
When I do something like "SELECT * FROM table" in MySQL I get all the records in the right order - cause MySQL reads this file from the beginning to the end. I need only the last one(s).
Is there a way to get the LAST record (or records) using MySQL query only, without ORDER BY?
Well, I suppose I've found a solution here, so my current query is
SELECT
#i:=#i+1 AS iterator,
t.*
FROM
table t,
(SELECT #i:=0) i
ORDER BY
iterator DESC
LIMIT 5
If there's a better solution, please let me know!
The order is not guaranteed unless you use an ORDER BY. It just happens that the records you're getting back are sorted the way need them.
Here is the importance of keys (primary key for example).
You can make some modification in your table by adding a primary key column with auto_increment default value.
Then you can query
select * from your_table where id =(select max(id) from your_table);
and get the last inserted row.

mysql query using where clause with 24 million rows

SELECT DISTINCT `Stock`.`ProductNumber`,`Stock`.`Description`,`TComponent_Status`.`component`, `TComponent_Status`.`certificate`,`TComponent_Status`.`status`,`TComponent_Status`.`date_created`
FROM Stock , TBOM , TComponent_Status
WHERE `TBOM`.`Component` = `TComponent_Status`.`component`
AND `Stock`.`ProductNumber` = `TBOM`.`Product`
Basically table TBOM HAS :
24,588,820 rows
The query is ridiculously slow, i'm not too sure what i can do to make it better. I have indexed all the other tables in the query but TBOM has a few duplicates in the columns so i can't even run that command. I'm a little baffled.
To start, index the following fields:
TBOM.Component
TBOM.Product
TComponent_Status.component
Stock.ProductNumber
Not all of the above indexes may be necessary (e.g., the last two), but it is a good start.
Also, remove the DISTINCT if you don't absolutely need it.
The only thing I can really think of is having an index on your Stock table on
(ProductNumber, Description)
This can help in two ways. Since you are only using those two fields in the query, the engine wont be required to go to the full data row of each stock record since both parts are in the index, it can use that. Additionally, you are doing DISTINCT, so having the index available to help optimize the DISTINCTness, should also help.
Now, the other issue for time. Since you are doing a distinct from stock to product to product status, you are asking for all 24 million TBOM items (assume bill of materials), and each BOM component could have multiple status created, you are getting every BOM for EVERY component changed.
If what you are really looking for is something like the most recent change of any component item, you might want to do it in reverse... Something like...
SELECT DISTINCT
Stock.ProductNumber,
Stock.Description,
JustThese.component,
JustThese.certificate,
JustThese.`status`,
JustThese.date_created
FROM
( select DISTINCT
TCS.Component,
TCS.Certificate,
TCS.`staus`,
TCS.date_created
from
TComponent_Status TCS
where
TCS.date_created >= 'some date you want to limit based upon' ) as JustThese
JOIN TBOM
on JustThese.Component = TBOM.Component
JOIN Stock
on TBOM.Product = Stock.Product
If this is a case, I would ensure an index on the component status table, something like
( date_created, component, certificate, status, date_created ) as the index. This way, the WHERE clause would be optimized, and distinct would be too since pieces already part of the index.
But, how you currently have it, if you have 10 TBOM entries for a single "component", and that component has 100 changes, you now have 10 * 100 or 1,000 entries in your result set. Take this and span 24 million, and its definitely not going to look good.

Improve performance for query to delete duplicates

My hosting company recently gave me this entry from the slow-query log. The rows examined seem excessive and might be helping to slow down the server. A test in phpMyAdmin resulted in duration of 0.9468 seconds.
The Check_in table ordinarily contains 10,000 to 17,000 rows. It also has one index: Num, unique = yes, cardinality = 10852, collation = A.
I would like to improve this query. The first five conditions following WHERE contain the fields to check to throw out duplicates.
# User#Host: fxxxxx_member[fxxxxx_member] # localhost []
# Query_time: 5 Lock_time: 0 Rows_sent: 0 Rows_examined: 701321
use fxxxxx_flifo;
SET timestamp=1364277847;
DELETE FROM Check_in USING Check_in,
Check_in as vtable WHERE
( Check_in.empNum = vtable.empNum )
AND ( Check_in.depCity = vtable.depCity )
AND ( Check_in.travelerName = vtable.travelerName )
AND ( Check_in.depTime = vtable.depTime )
AND ( Check_in.fltNum = vtable.fltNum )
AND ( Check_in.Num > vtable.Num )
AND ( Check_in.accomp = 'NO' )
AND Check_in.depTime >= TIMESTAMPADD ( MINUTE, 3, NOW() )
AND Check_in.depTime < TIMESTAMPADD ( HOUR, 26, NOW() );
Edit:
empNum int (6)
lastName varchar (30)
travelerName varchar (40) (99.9% = 'All')
depTime datetime
fltNum varchar (6)
depCity varchar (4)
23 fields total (including one blob, holding 25K images)
Edit:
ADD INDEX deleteQuery (empNum, lastName, travelerName, depTime, fltNum, depCity, Num)
Is this a matter of creating an index? If so, what type and what fields?
The last 3 conditions limit the number of rows, by asking if accomplished and within time period. Could they be better positioned (earlier) in the query? Is the 5th AND ... necessary?
Open to all ideas. Thanks for looking.
It's hard to know exactly how to help without seeing the table definition.
Don't delete the self-join (the same table mentioned twice) because this query is clearing out duplicates (check_in.Num > vtable.Num).
Do you have an index on depTime? If not, add one.
You may also want to add a compound index on
(empNum,depCity,travelerName,depTime,fltNum)
to optimize the self-join. You probably have to muck about a bit to figure out what works.
If your objective is to delete dupicates, the the solution is to avoid having duplicates in the first place - define a unique index across the fields that you deeem to collectively define a duplicate (but you won't be able to create the index while you have duplicates in the database).
The index you need for this query is on (deptime,empnum,depcity,travellername,fltnum,num,accomp} in that order. The deptime field has to come first for it to optimize the 2 accesses on the table. Once you've removed the duplicates, make the index unique.
Leaving that aside for now, you've got a whole load of performance problems.
1) you appear to be offering some sort of commercial service - so why are you waiting for your ISP to tell you that your site is running like a dog?
2) while your indexes should be designed to prevent duplicates, there are many cases where other indexes will help with performance - but in order to understand what those are you needto look at all the queries running against your data.
3) the blob should probably be in a separate table
Could they be better positioned (earlier) in the query?
Order of predicates at the same level in the query hierarchy has no impact on performance.
is the 5th AND necessary?
If you mean 'AND ( Check_in.Num > vtable.Num )', then yes - without that it will remove all the rows that are duplicated - i.e. it won't leave one row behid.
The purpose of indexes is to speed up searches and filters... the index is (in layman terms) a sorted table that pin-points each row of the data (which may be itself unsorted).
So, if you want to speed your delete query, it would help to know where the data is. So, as a set of thumb rules, you will need to add indexes to the following fields:
Every primary or foreign key
Every date on which you perform frequent searches / filters
Every numeric field on which you perform frequent searches / filters
I avoid indexes on text fields, since they are quite expensive (in terms of space), but if you need to perform frequent searches on text fields, you should also index them.

Questions on how to randomly Query multiple rows from Mysql without using "ORDER BY RAND()"

I need to query the MYSQL with some condition, and get five random different rows from the result.
Say, I have a table named 'user', and a field named 'cash'. I can compose a SQL like:
SELECT * FROM user where cash < 1000 order by RAND() LIMIT 5.
The result is good, totally random, unsorted and different from each other, exact what I want.
But I got from google that the efficiency is bad when the table get large because MySQL creates a temporary table with all the result rows and assigns each one of them a random sorting index. The results are then sorted and returned.
Then I go on searching and got a solution like:
SELECT * FROM `user` AS t1 JOIN (SELECT ROUND(RAND() * ((SELECT MAX(id) FROM `user`)- (SELECT MIN(id) FROM `user`))+(SELECT MIN(id) FROM `user`)) AS id) AS t2 WHERE t1.id >= t2.id AND cash < 1000 ORDER BY t1.id LIMIT 5;
This method uses JOIN and MAX(id). And the efficiency is better than the first one according to my testing. However, there is a problem. Since I also needs a condition "cash<1000" and if the the RAND() is so big that no row behind it has the cash<1000, then no result will return.
Anyone has good idea of how to compose the SQL that has have the same effect as the first one but has better efficiency?
Or, shall I just do simple query in MYSQL and let PHP randomly pick 5 different rows from the query result?
Your help is appreciated.
To make first query faster, just SELECT id - that will make the temporary table rather small (it will contain only IDs and not all fields of each row) and maybe it will fit in memory (temp table with text/blob are always created on-disk for example). Then when you get a result, run another query SELECT * FROM xy WHERE id IN (a,b,c,d,...). As you mentioned this approach is not very efficient, but as a quick fix this modification will make it several times faster.
One of the best approaches seems to be getting the total number of rows, choosing random numbers and for each result run a new query SELECT * FROM xy WHERE abc LIMIT $random,1. It should be quite efficient for random 3-5, but not good if you want 100 random rows each time :)
Also consider caching your results. Often you don't need different random rows to be displayed on each page load. Generate your random rows only once per minute. If you will generate the data for example via cron, you can live also with query which takes several seconds, as users will see the old data while new data are being generated.
Here are some of my bookmarks for this problem for reference:
http://jan.kneschke.de/projects/mysql/order-by-rand/
http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/