Cannot remove duplicate values from a mysql table - mysql

I have a table ship_details which is not having any constraints. The data is coming from a data source & original designer of the table thought the incoming data not to have duplication.Now I have to remove the duplicate entries. Now the table has 9,94,184 entries.
The table definition is
CREATE TABLE `ship_details` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`order_number` varchar(150) DEFAULT NULL,
`delivery_id` varchar(150) DEFAULT NULL,
`transaction_type` varchar(150) DEFAULT NULL,
`pick_date` varchar(150) DEFAULT NULL,
`pn_note_number` varchar(150) DEFAULT NULL,
`item_id` varchar(150) DEFAULT NULL,
`item_code` varchar(150) DEFAULT NULL,
`picked_quantity` varchar(150) DEFAULT NULL,
`lot_number` varchar(150) DEFAULT NULL,
`lot_expiry` varchar(150) DEFAULT NULL,
`name` varchar(150) DEFAULT NULL,
`delivered_date` varchar(150) DEFAULT NULL,
`extra_attrib1` varchar(150) DEFAULT NULL,
`extra_attrib2` varchar(150) DEFAULT NULL,
`extra_attrib3` varchar(150) DEFAULT NULL,
`extra_attrib4` varchar(150) DEFAULT NULL,
`extra_attrib5` varchar(150) DEFAULT NULL,
`extra_attrib6` varchar(150) DEFAULT NULL,
`extra_attrib7` varchar(150) DEFAULT NULL,
`extra_attrib8` varchar(150) DEFAULT NULL,
`extra_attrib9` varchar(150) DEFAULT NULL,
`extra_attrib10` varchar(150) DEFAULT NULL,
`last_updated` varchar(100) DEFAULT NULL,
`outbound_id` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=994222 DEFAULT CHARSET=latin1;
I tried to delete the duplicate entries by using following script:
delete s1
from ship_details s1
inner join ship_details s2
where s1.id < s2.id
and s2.order_number = s1.order_number
and s2.delivery_id = s1.delivery_id
and s2.item_code = s1.item_code
and s2.lot_number = s1.lot_number
and s2.picked_quantity = s1.picked_quantity;
but that gave lock wait timeout. Even if I use a particular order no still it times out.
So I went for the approach of replicating the table with unique constraint of order_number, delivery_id, item_code and picked_quantity.
So tried to export the data from the original table after running following command:
SELECT distinct order_number, delivery_id, transaction_type, pick_date, pn_note_number,
item_id, item_code, picked_quantity, lot_number, lot_expiry, name, delivered_date,
extra_attrib10,last_updated, outbound_id
FROM ship_details;
But this command did not give me unique result. This results in 1,54,948 rows. Pl. see this:
INSERT INTO clean_ship_details (order_number,delivery_id,transaction_type,pick_date,pn_note_number,item_id,item_code,picked_quantity,lot_number,lot_expiry,name,delivered_date,extra_attrib10,last_updated,outbound_id) VALUES
('181020373','10068965','Shipped','2018-11-11T15:50:48.000+04:00','PN176348','516169','VCH128','73','C34142','2021-02-28T00:00:00.000+04:00','DVT-6410','2019-06-18T15:48:12.000+04:00','','2019-06-18T15:54:40.000+04:00','51616973_73_'),
('181020373','10068965','Shipped','2018-11-11T15:50:48.000+04:00','PN176348','516169','VCH128','73','C34142','2021-02-28T00:00:00.000+04:00','DVT-6410','2019-06-18T15:48:12.000+04:00','','2019-06-18T15:54:40.000+04:00','58719373_73_'),
('181020373','10068965','Shipped','2018-11-11T15:50:48.000+04:00','PN176348','516170','VCH120','12','K33471/A','2020-10-31T00:00:00.000+04:00','DVT-6410','2019-06-18T15:48:12.000+04:00','','2019-06-18T15:54:40.000+04:00','51617012_12_'),
('181020373','10068965','Shipped','2019-06-19T12:22:39.000+04:00','PN239867','587193','VCH128','2','E34284','2021-04-30T00:00:00.000+04:00','DVT-6410','2019-06-18T15:48:12.000+04:00','','2019-06-18T15:54:40.000+04:00','5161692_2_'),
('181020373','10068965','Shipped','2019-06-19T12:22:39.000+04:00','PN239867','587193','VCH128','2','E34284','2021-04-30T00:00:00.000+04:00','DVT-6410','2019-06-18T15:48:12.000+04:00','','2019-06-18T15:54:40.000+04:00','5871932_2_'),
('191002479','10091039','Shipped','2019-02-12T07:50:55.000+04:00','PN186154','544495','VTP048','170','205809','2020-07-31T00:00:00.000+04:00','DVT-6479','2019-07-11T07:30:38.000+04:00','','2019-07-11T09:31:22.000+04:00','544495170_170_'),
('191002479','10091039','Shipped','2019-02-12T07:50:55.000+04:00','PN186154','544495','VTP048','170','205809','2020-07-31T00:00:00.000+04:00','DVT-6479','2019-07-11T07:30:38.000+04:00','','2019-07-11T09:31:22.000+04:00','594447170_170_'),
('191002479','10091039','Shipped','2019-07-18T07:45:49.000+04:00','PN249274','594447','VTP048','11','208744','2021-01-31T00:00:00.000+04:00','DVT-6479','2019-07-11T07:30:38.000+04:00','','2019-07-11T09:31:22.000+04:00','54449511_11_'),
('191002479','10091039','Shipped','2019-07-18T07:45:49.000+04:00','PN249274','594447','VTP048','11','208744','2021-01-31T00:00:00.000+04:00','DVT-6479','2019-07-11T07:30:38.000+04:00','','2019-07-11T09:31:22.000+04:00','59444711_11_'),
('191006312','10188037','Shipped','2019-03-31T12:17:39.000+04:00','PN201490','560373','VTP048','26','207783','2020-12-31T00:00:00.000+04:00','DVT-6694','2019-10-08T07:08:45.000+04:00','','2019-10-08T07:11:44.000+04:00','56037326_26_');
I cannot insert this to the new table.
Update I tried to insert using a script without success as I get lock wait time exceeded even with a limit of just 1 record:
INSERT IGNORE INTO clean_ship_details (order_number,delivery_id,transaction_type,pick_date,pn_note_number,item_id,item_code,picked_quantity,lot_number,lot_expiry,name,delivered_date,last_updated,outbound_id) SELECT order_number,delivery_id,transaction_type,pick_date,pn_note_number,item_id,item_code,picked_quantity,lot_number,lot_expiry,name,delivered_date,last_updated,outbound_id FROM ship_details order by order_number,delivery_id,item_id limit 10;

Your second approach does dedupe if as you say unique constraint of order_number, delivery_id, item_code and picked_quantity also you don't need distinct because unique key will detect the duplicates and you can INSERT IGNORE the error
Using your sample data
enter link description here

Related

Long running Mysql Query on Indexes and sort by clause

I have a very long running MySql query. The query simply joins two tables which are very huge
bizevents - Nearly 34 Million rows
bizevents_actions - Nearly 17 million rows
Here is the query:
select
bizevent0_.id as id1_37_,
bizevent0_.json as json2_37_,
bizevent0_.account_id as account_3_37_,
bizevent0_.createdBy as createdB4_37_,
bizevent0_.createdOn as createdO5_37_,
bizevent0_.description as descript6_37_,
bizevent0_.iconCss as iconCss7_37_,
bizevent0_.modifiedBy as modified8_37_,
bizevent0_.modifiedOn as modified9_37_,
bizevent0_.name as name10_37_,
bizevent0_.version as version11_37_,
bizevent0_.fired as fired12_37_,
bizevent0_.preCreateFired as preCrea13_37_,
bizevent0_.entityRefClazz as entityR14_37_,
bizevent0_.entityRefIdAsStr as entityR15_37_,
bizevent0_.entityRefIdType as entityR16_37_,
bizevent0_.entityRefName as entityR17_37_,
bizevent0_.entityRefType as entityR18_37_,
bizevent0_.entityRefVersion as entityR19_37_
from
BizEvent bizevent0_
left outer join BizEvent_actions actions1_ on
bizevent0_.id = actions1_.BizEvent_id
where
bizevent0_.createdOn >= '1969-12-31 19:00:01.0'
and (actions1_.action <> 'SoftLock'
and actions1_.targetRefClazz = 'com.biznuvo.core.orm.domain.org.EmployeeGroup'
and actions1_.targetRefIdAsStr = '1'
or actions1_.action <> 'SoftLock'
and actions1_.objectRefClazz = 'com.biznuvo.core.orm.domain.org.EmployeeGroup'
and actions1_.objectRefIdAsStr = '1')
order by
bizevent0_.createdOn;
Below are the table definitions -- As you see i have defined the indexes well enough on these two tables on all the search columns plus the sort column. But still my queries are running for very very long time. Appreciate any more ideas either with respective indexing.
-- bizevent definition
CREATE TABLE `bizevent` (
`id` bigint(20) NOT NULL,
`json` longtext,
`account_id` int(11) DEFAULT NULL,
`createdBy` varchar(50) NOT NULL,
`createdon` datetime(3) DEFAULT NULL,
`description` varchar(255) DEFAULT NULL,
`iconCss` varchar(50) DEFAULT NULL,
`modifiedBy` varchar(50) NOT NULL,
`modifiedon` datetime(3) DEFAULT NULL,
`name` varchar(255) NOT NULL,
`version` int(11) NOT NULL,
`fired` bit(1) NOT NULL,
`preCreateFired` bit(1) NOT NULL,
`entityRefClazz` varchar(255) DEFAULT NULL,
`entityRefIdAsStr` varchar(255) DEFAULT NULL,
`entityRefIdType` varchar(25) DEFAULT NULL,
`entityRefName` varchar(255) DEFAULT NULL,
`entityRefType` varchar(50) DEFAULT NULL,
`entityRefVersion` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `IDXk9kxuuprilygwfwddr67xt1pw` (`createdon`),
KEY `IDXsf3ufmeg5t9ok7qkypppuey7y` (`entityRefIdAsStr`),
KEY `IDX5bxv4g72wxmjqshb770lvjcto` (`entityRefClazz`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
-- bizevent_actions definition
CREATE TABLE `bizevent_actions` (
`BizEvent_id` bigint(20) NOT NULL,
`action` varchar(255) DEFAULT NULL,
`objectBizType` varchar(255) DEFAULT NULL,
`objectName` varchar(255) DEFAULT NULL,
`objectRefClazz` varchar(255) DEFAULT NULL,
`objectRefIdAsStr` varchar(255) DEFAULT NULL,
`objectRefIdType` int(11) DEFAULT NULL,
`objectRefVersion` int(11) DEFAULT NULL,
`targetBizType` varchar(255) DEFAULT NULL,
`targetName` varchar(255) DEFAULT NULL,
`targetRefClazz` varchar(255) DEFAULT NULL,
`targetRefIdAsStr` varchar(255) DEFAULT NULL,
`targetRefIdType` int(11) DEFAULT NULL,
`targetRefVersion` int(11) DEFAULT NULL,
`embedJson` longtext,
`actions_ORDER` int(11) NOT NULL,
PRIMARY KEY (`BizEvent_id`,`actions_ORDER`),
KEY `IDXa21hhagjogn3lar1bn5obl48gll` (`action`),
KEY `IDX7agsatk8u8qvtj37vhotja0ce77` (`targetRefClazz`),
KEY `IDXa7tktl678kqu3tk8mmkt1mo8lbo` (`targetRefIdAsStr`),
KEY `IDXa22eevu7m820jeb2uekkt42pqeu` (`objectRefClazz`),
KEY `IDXa33ba772tpkl9ig8ptkfhk18ig6` (`objectRefIdAsStr`),
CONSTRAINT `FKr9qjs61id11n48tdn1cdp3wot` FOREIGN KEY (`BizEvent_id`) REFERENCES `bizevent` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;>
By the way we are using Amazon RDS 5.7.33 MySql version. 16 GB RAM and 4 vCPU.
I also did a Explain Extended on the query and below is what it shows. Appreciate any help.
Initially the search of the bizevent_actions didn;t have the indexes defined. I have defined the indexes for them and tried the query but of no use.
One technique that worked for me in a similar situation was abandoning the idea of JOIN completely and switching to queries by PK. More detailed: find out which table in join would give less rows on average if you use only that table and related filter to query; get the primary keys from that table and then query the other one using WHERE pk IN ().
In your case one example would be:
SELECT
bizevent0_.id as id1_37_,
bizevent0_.json as json2_37_,
bizevent0_.account_id as account_3_37_,
...
FROM BizEvent bizevent0_
WHERE
bizevent0_.createdOn >= '1969-12-31 19:00:01.0'
AND bizevent0_.id IN (
SELECT BizEvent_id
FROM BizEvent_actions actions1_
WHERE
actions1_.action <> 'SoftLock'
and actions1_.targetRefClazz = 'com.biznuvo.core.orm.domain.org.EmployeeGroup'
and actions1_.targetRefIdAsStr = '1'
or actions1_.action <> 'SoftLock'
and actions1_.objectRefClazz = 'com.biznuvo.core.orm.domain.org.EmployeeGroup'
and actions1_.objectRefIdAsStr = '1')
ORDER BY
bizevent0_.createdOn;
This assumes that you're not actually willing to select 33+ Mio rows from BizEvent though - your code with LEFT OUTER JOIN would have done exactly this.

Update column value for each row by the values of other columns, avoiding deadlocks

Experts, need help here.
I have a single table tbl_stock_data, for which I am trying to update a column "INBOUND_STOCK_LVL" using the values from columns "INBOUND_STOCK_LVL" and "OUTBOUND_STOCK_LVL"
such that:
INBOUND_STOCK_LVL = INBOUND_STOCK_LVL - (INBOUND_STOCK_LVL - OUTBOUND_STOCK_LVL)
Table View
I've written a query as follows:
UPDATE tbl_stock_data t1,
( SELECT STORE_EXT_ID, INBOUND_STOCK_LVL, OUTBOUND_STOCK_LVL,
ORG_ID, INTEGRATION_PARTNER
FROM tbl_stock_data
WHERE STORE_EXT_ID = 'STOCK1'
AND INTEGRATION_PARTNER = 'fortnox'
AND ORG_ID = 'asdsg-23ewfsd-2342'
) t2
SET t1.INBOUND_STOCK_LVL = t2.INBOUND_STOCK_LVL - (t2.INBOUND_STOCK_LVL - t2.OUTBOUND_STOCK_LVL)
WHERE t1.STORE_EXT_ID=t2.STORE_EXT_ID
AND t1.INTEGRATION_PARTNER=t2.INTEGRATION_PARTNER
AND t1.ORG_ID=t2.ORG_ID
But this query sets the column to 0
Before this, I was using a simple query (given below) that was working fine but it started creating deadlocks for the larger data set
UPDATE tbl_stock_data
SET INBOUND_STOCK_LVL = INBOUND_STOCK_LVL - (INBOUND_STOCK_LVL - OUTBOUND_STOCK_LVL)
WHERE STORE_EXT_ID = 'STOCK1'
AND INTEGRATION_PARTNER = 'fortnox'
AND ORG_ID = 'asdsg-23ewfsd-2342'
Note:
single record can be obtained by PRODUCT_ID, STORE_EXT_ID, ORG_ID and INTEGRATION_PARTNER
CREATE TABLE `tbl_stock_data` (
`ORG_ID` varchar(100) NOT NULL,
`PRODUCT_ID` varchar(100) NOT NULL,
`PRODUCT_EXT_ID` varchar(100) DEFAULT NULL,
`WAREHOUSE_ID` varchar(100) NOT NULL,
`WAREHOUSE_EXT_ID` varchar(100) DEFAULT NULL,
`STORE_ID` varchar(100) DEFAULT NULL,
`STORE_EXT_ID` varchar(100) DEFAULT NULL,
`INBOUND_STOCK_LVL` varchar(100) DEFAULT NULL,
`OUTBOUND_STOCK_LVL` varchar(100) DEFAULT NULL,
`OUTBOUND_STOCK_DELTA` varchar(100) DEFAULT NULL,
`INTEGRATION_PARTNER` varchar(100) DEFAULT NULL,
`DATE_CREATED` varchar(100) NOT NULL,
`LAST_MODIFIED` varchar(100) DEFAULT NULL,
PRIMARY KEY (`ORG_ID`,`PRODUCT_ID`,`WAREHOUSE_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Query large table with 50 million rows

trying to query a large table (senddb.order_histories) that has close to 50M rows and this is the MySQL query I am using:
FIRST APPROACH- inner join:
select a.id,
a.order_number,
a.sku_id,
a.fulfillment_status,
a.modified_by,
a.created_at,
a.updated_at
from senddb.order_line_items a
inner join (
select order_line_item_id,
order_number,
order_status,
order_status_description,
action,
modified_by,
created_at,
max(updated_at) as updated_at
from senddb.order_histories
where order_status in ('x','y','z')
and fulfillment_location = 'abcd'
group by order_line_item_id) as b
on a.id = b.order_line_item_id
and a.fulfillment_status = '2';
EXPLAIN output :
SECOND APPROACH- nested select:
select a.id,
a.order_number,
a.sku_id,
a.fulfillment_status,
a.modified_by,
a.created_at,
a.updated_at
from senddb.order_line_items a
where a.fulfillment_status = '2'
and a.id in (
select b.order_line_item_id from(
select order_line_item_id,
order_number,
order_status,
order_status_description,
action,
modified_by,
created_at,
max(updated_at) as updated_at
from senddb.order_histories
where
order_status in ('x','y','z')
and fulfillment_location = 'abcd'
group by order_line_item_id) as b);
I believe nested select is a bad approach on large data but i anyhow added it here because it worked on my sample set. Anyway both the queries eventually time out after 600 seconds with the message : Error Code: 2013. Lost connection to MySQL server during query.
I would like to know if there are any ways to alter the query to make it run faster. I have already tried reducing the columns in the inner select / inner join but that should not really be an issue IMO. I also looked up a solution that says "create a clustered index" but i wasn't really able to follow. Any help is appreciated.
TABLE order_histories :
order_histories CREATE TABLE `order_histories` (
`id` int(4) unsigned NOT NULL AUTO_INCREMENT,
`order_number` varchar(24) DEFAULT NULL,
`order_status_description` varchar(255) DEFAULT NULL,
`datetime_stamp` datetime DEFAULT NULL,
`action` varchar(32) DEFAULT NULL,
`fulfillment_location` int(8) DEFAULT NULL,
`order_status` int(8) DEFAULT NULL,
`user_id` int(8) DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`modified_by` varchar(32) DEFAULT NULL,
`order_line_item_id` int(11) DEFAULT NULL,
`pooled` tinyint(1) DEFAULT '0',
PRIMARY KEY (`id`),
KEY `order_histories_ecash_idx` (`order_number`),
KEY `order_line_item_id` (`order_line_item_id`)
) ENGINE=InnoDB AUTO_INCREMENT=454738178 DEFAULT CHARSET=latin1
TABLE order_line_items :
order_line_items CREATE TABLE `order_line_items` (
`id` int(4) unsigned NOT NULL AUTO_INCREMENT,
`order_number` varchar(24) DEFAULT NULL,
`sku_id` int(8) DEFAULT NULL,
`original_price` float DEFAULT NULL,
`dept_description` varchar(100) DEFAULT NULL,
`description` varchar(100) DEFAULT NULL,
`quantity_ordered` int(8) DEFAULT NULL,
`gift_indicator` char(1) DEFAULT NULL,
`gift_wrap_flag` char(1) DEFAULT NULL,
`shipping_record_flag` char(1) DEFAULT NULL,
`gift_comments` varchar(100) DEFAULT NULL,
`item_status` char(1) DEFAULT NULL,
`tax_amount` float DEFAULT NULL,
`tax_rate` float DEFAULT NULL,
`upc` varchar(20) DEFAULT NULL,
`final_price` float DEFAULT NULL,
`line_number` int(8) DEFAULT NULL,
`master_line_number` int(8) DEFAULT NULL,
`gift_wrap_flag_type` char(1) DEFAULT NULL,
`color_code` varchar(4) DEFAULT NULL,
`size_id` varchar(6) DEFAULT NULL,
`width_id` varchar(6) DEFAULT NULL,
`brand` varchar(15) DEFAULT NULL,
`vpn` varchar(30) DEFAULT NULL,
`dept_number` int(8) DEFAULT NULL,
`class_number` int(8) DEFAULT NULL,
`non_merch_item` char(1) DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`modified_by` varchar(32) DEFAULT NULL,
`chain_id` int(11) DEFAULT NULL,
`fulfillment_location` int(11) DEFAULT NULL,
`fulfillment_date` datetime DEFAULT NULL,
`fulfillment_status` int(11) DEFAULT NULL,
`fulfillment_sales_associate` int(11) DEFAULT NULL,
`gift_wrap_line_number` int(11) DEFAULT NULL,
`shipping_type` int(11) DEFAULT NULL,
`order_track_info_id` int(11) DEFAULT NULL,
`store_tlog_updated` varchar(1) DEFAULT NULL,
`shipping_tlx_code` int(11) DEFAULT NULL,
`store_closed` tinyint(1) DEFAULT NULL,
`flags` int(11) DEFAULT NULL,
`deal_based_index` int(11) DEFAULT NULL,
`tlog_calc_ret_price` float DEFAULT NULL,
`tlog_amount` float DEFAULT NULL,
`tlog_retail_price` float DEFAULT NULL,
`tlog_ext_amount` float DEFAULT NULL,
`tlog_flag_1` int(11) DEFAULT NULL,
`tlog_flag_2` int(11) DEFAULT NULL,
`tlog_flag_3` int(11) DEFAULT NULL,
`time_remaining` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `order_line_items_ecash_idx` (`order_number`),
KEY `order_line_item_fulfillment_location_idx` (`fulfillment_location`),
KEY `order_line_item_fulfillment_status_idx` (`fulfillment_status`),
KEY `upc_idx` (`upc`),
KEY `sku_id_idx` (`sku_id`),
KEY `order_line_items_idx001` (`order_number`,`id`,`fulfillment_status`),
KEY `order_track_info_id` (`order_track_info_id`),
KEY `shipping_type_idx` (`shipping_type`,`non_merch_item`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=11367052 DEFAULT CHARSET=latin1
This query can be simplified:
select a.id,
a.order_number,
a.sku_id,
a.fulfillment_status,
a.modified_by,
a.created_at,
a.updated_at
from senddb.order_line_items a
inner join senddb.order_histories b on a.id = b.order_line_item_id
where b.order_status in ('x','y','z')
and b.fulfillment_location = 'abcd'
and a.fulfillment_status = '2';
Since you're only selecting values from table a, you don't need to select specific values from table b and can instead just apply your conditions. Outside of this, you need to ensure that b.order_line_item_id has an index on it. You can find more about creating indexes here. I'm not an expert in MySQL but something similar to this should work if senddb.order_histories.order_line_item_id isn't already the primary key.
CREATE INDEX IX_order_histories_order_line_item_id ON order_histories (order_line_item_id);
You need to read up the optimization section of the MySQL docs. It contains a lot of information on how you can optimize your queries and data sets. The main idea here is to add indexes to the fields that are being used as the criteria in the WHERE clause of the SQL statements.
Basically, both of your alternatives are using a "sub-SELECT, not an INNER JOIN.
The syntax of a true JOIN is one of the following:
SELECT ...
FROM X INNER JOIN Y USING (field_list)
... or ...
SELECT ...
FROM X INNER JOIN Y ON (x.field1 = y.field2) ...
But in both cases the objects being joined are tables or views.
I'm going to presume ... admittedly, without checking ... that Nick Larsen's answer #1 adequately re-expresses your original query using JOINs.
(Notice how, in his answer, the shorthand identifiers A and B are introduced as referring to each of the two table-names mentioned in his query.)
Firstly, you need to decide if a 50 million resultset is what you are asking for. Big data tables are not there so that you can select all their rows. They are there so that you can ask them questions using sql queries. SQL is a query language, it's not a data loading language.
What's your purpose? If you want to copy the data you can do that by loading the data, for example, 1000 rows per query in a for loop. if you are loading the data for processing, you can do that in the same way.
If you want to derive statistical information, you can use outer join and return a low number of rows, using aggregate functions. But you shouldn't do that either, what you "should" do is to decide what you want from the table and preferably, run aggregate functions to store useful information in a different table. (mostly SELECT INTO queries) You should never need to join a table of 50 million records in the first place.
Telling you how to do something wrong using indexes wouldn't be the right thing here.

Join data from one table based on 2 values

I have the following SQL statement:
SELECT user_accounts.uacc_id,
user_accounts.uacc_username,
ride_rides.ride_type,
ride_rides.ride_num_seats,
ride_rides.ride_price_seat,
ride_rides.ride_accept_nm,
ride_rides.ride_split_cost,
ride_rides.ride_from,
ride_rides.ride_from_lat,
ride_rides.ride_from_lng,
ride_rides.ride_to,
ride_rides.ride_to_lat,
ride_rides.ride_to_lng,
user_profiles.upro_image_name,
ride_times.ridetms_id,
ride_times.ridetms_return,
ride_times.ridetms_depart_date,
ride_times.ridetms_depart_time,
ride_times.ridetms_return_date,
ride_times.ridetms_return_time,
depart_times.dpttme_text
FROM ride_times
LEFT JOIN ride_rides
ON ride_rides.ride_id = ride_times.ridetms_ride_fk
LEFT JOIN user_accounts
ON ride_rides.ride_uacc_fk = user_accounts.uacc_id
LEFT JOIN user_profiles
ON user_profiles.upro_uacc_fk = user_accounts.uacc_id
LEFT JOIN depart_times
ON depart_times.dpttme_id = ride_times.ridetms_depart_time
WHERE ride_times.ridetms_id = ?"
Right now, I have the query pulling a text representation of the data from ride_times.ridetms_depart_time in the last join, and it works fine. However, I need to do the same with another column in the ride_times table. I think I need to use an alias, but after reading several sources on aliases, I can't wrap my head around how to change the call.
Also, 100 brownie points for any feedback about any glaring mistakes in this call. It is my first attempt at using JOINs.
take care,
lee
Thanks to the responses I've received so far. Here is the structure of the tables involved:
CREATE TABLE `depart_times` (
`dpttme_id` int(11) NOT NULL,
`dpttme_text` varchar(50) DEFAULT NULL,
PRIMARY KEY (`dpttme_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `ride_rides` (
`ride_id` int(11) NOT NULL AUTO_INCREMENT,
`ride_uacc_fk` int(11) NOT NULL,
`ride_date_added` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`ride_type` tinyint(4) DEFAULT NULL,
`ride_from` varchar(200) DEFAULT NULL,
`ride_from_lat` float(10,6) DEFAULT NULL,
`ride_from_lng` float(10,6) DEFAULT NULL,
`ride_to` varchar(200) DEFAULT NULL,
`ride_to_lat` float(10,6) DEFAULT NULL,
`ride_to_lng` float(10,6) DEFAULT NULL,
`ride_num_seats` tinyint(4) DEFAULT NULL,
`ride_price_seat` float DEFAULT NULL,
`ride_accept_nm` tinyint(1) DEFAULT '0' COMMENT 'accept non-monetary items',
`ride_split_cost` tinyint(1) DEFAULT '0',
`ride_notes` longtext,
PRIMARY KEY (`ride_id`)
) ENGINE=MyISAM AUTO_INCREMENT=34 DEFAULT CHARSET=latin1;
CREATE TABLE `ride_times` (
`ridetms_id` int(11) NOT NULL AUTO_INCREMENT,
`ridetms_ride_fk` int(11) DEFAULT NULL,
`ridetms_date_added` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`ridetms_depart_date` date NOT NULL DEFAULT '0000-00-00',
`ridetms_depart_time` tinyint(4) DEFAULT '0',
`ridetms_return` tinyint(1) DEFAULT '0',
`ridetms_return_date` date NOT NULL DEFAULT '0000-00-00',
`ridetms_return_time` tinyint(4) DEFAULT '0',
PRIMARY KEY (`ridetms_id`)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=latin1;
CREATE TABLE `user_accounts` (
`uacc_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`uacc_group_fk` smallint(5) unsigned NOT NULL,
`uacc_email` varchar(100) NOT NULL,
`uacc_username` varchar(15) NOT NULL,
`uacc_password` varchar(60) NOT NULL,
`uacc_ip_address` varchar(40) NOT NULL,
`uacc_salt` varchar(40) NOT NULL,
`uacc_activation_token` varchar(40) NOT NULL,
`uacc_forgotten_password_token` varchar(40) NOT NULL,
`uacc_forgotten_password_expire` datetime NOT NULL,
`uacc_update_email_token` varchar(40) NOT NULL,
`uacc_update_email` varchar(100) NOT NULL,
`uacc_active` tinyint(1) unsigned NOT NULL,
`uacc_suspend` tinyint(1) unsigned NOT NULL,
`uacc_fail_login_attempts` smallint(5) NOT NULL,
`uacc_fail_login_ip_address` varchar(40) NOT NULL,
`uacc_date_fail_login_ban` datetime NOT NULL COMMENT 'Time user is banned until due to repeated failed logins',
`uacc_date_last_login` datetime NOT NULL,
`uacc_date_added` datetime NOT NULL,
PRIMARY KEY (`uacc_id`),
UNIQUE KEY `uacc_id` (`uacc_id`),
KEY `uacc_group_fk` (`uacc_group_fk`),
KEY `uacc_email` (`uacc_email`),
KEY `uacc_username` (`uacc_username`),
KEY `uacc_fail_login_ip_address` (`uacc_fail_login_ip_address`)
) ENGINE=InnoDB AUTO_INCREMENT=56 DEFAULT CHARSET=latin1;
CREATE TABLE `user_profiles` (
`upro_id` int(11) NOT NULL AUTO_INCREMENT,
`upro_uacc_fk` int(11) NOT NULL,
`upro_name` varchar(100) DEFAULT NULL,
`upro_blackberry_id` varchar(200) DEFAULT NULL,
`upro_yahoo_id` varchar(200) DEFAULT NULL,
`upro_skype_id` varchar(200) DEFAULT NULL,
`upro_gmail_id` varchar(200) DEFAULT NULL,
`upro_image_name` varchar(200) DEFAULT 'default.jpg',
PRIMARY KEY (`upro_id`),
UNIQUE KEY `upro_id` (`upro_id`),
KEY `upro_uacc_fk` (`upro_uacc_fk`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=25 DEFAULT CHARSET=latin1;
So, to clarify:
Right now I am pulling some text from depart_times based on ride_times.ridetms_depart_time. I need to also pull some text form depart_times based on ride_times.ridetms_return_time.
I think if you have joined all the tables already, Just put the condition with the WHERE clause Like :
WHERE
ride_times.ridetms_id = ?
AND
depart_times.column_name1 = ride_times.return_time
Does this answer your question?
Well, I must point out this line:
LEFT JOIN depart_times ON depart_times.dpttme_id = ride_times.ridetms_depart_time
Just feels like, something wrong with the schema. But, that is also a guess from my side seeing the namings..
Ok,
After LOTS more searching, I found this post:
MySQL alias for SELECT * columns
I have revised my statement to the following and it is working properly:
SELECT user_accounts.uacc_id,
user_accounts.uacc_username,
ride_rides.*,
user_profiles.upro_image_name,
ride_times.*,
dpt1.dpttme_text AS dep_text,
dpt2.dpttme_text AS ret_text
FROM ride_times
LEFT JOIN ride_rides
ON ride_rides.ride_id = ride_times.ridetms_ride_fk
LEFT JOIN user_accounts
ON ride_rides.ride_uacc_fk = user_accounts.uacc_id
LEFT JOIN user_profiles
ON user_profiles.upro_uacc_fk = user_accounts.uacc_id
LEFT JOIN depart_times AS dpt1
ON dpt1.dpttme_id = ride_times.ridetms_depart_time
LEFT JOIN depart_times AS dpt2
ON dpt2.dpttme_id = ride_times.ridetms_return_time
WHERE ride_times.ridetms_id = ?
Thanks very much to everyone who attempted to help me.
take care,
lee

Slow update of one table when comparing multiple fields across two tables

The following query is timing out after 600 seconds.
update placed p
,Results r
set p.position = r.position
where p.competitor = r.competitor
AND p.date = r.date
AND REPLACE(p.time,":","") = r.time;
The structure is as follows:
'CREATE TABLE `placed` (
`idplaced` varchar(50) DEFAULT NULL,
`date` decimal(8,0) DEFAULT NULL,
`time` varchar(45) DEFAULT NULL,
`field1` varchar(45) DEFAULT NULL,
`competitor` varchar(45) DEFAULT NULL,
`field2` int(2) DEFAULT NULL,
`field3` varchar(45) DEFAULT NULL,
`field4` varchar(45) DEFAULT NULL,
`field5` decimal(6,2) DEFAULT NULL,
`field6` decimal(10,2) DEFAULT NULL,
`field7` decimal(6,2) DEFAULT NULL,
`field8` char(1) DEFAULT NULL,
`field9` varchar(45) DEFAULT NULL,
`position` char(4) DEFAULT NULL,
`field10` decimal(6,2) DEFAULT NULL,
`field11` char(1) DEFAULT NULL,
`field12` char(1) DEFAULT NULL,
`field13` decimal(6,2) DEFAULT NULL,
`field14` decimal(6,2) DEFAULT NULL,
`field15` decimal(6,2) DEFAULT NULL,
`field16` decimal(6,2) DEFAULT NULL,
`field17` decimal(6,2) DEFAULT NULL,
`field18` char(1) DEFAULT NULL,
`field19` char(20) DEFAULT NULL,
`field20` char(1) DEFAULT NULL,
`field21` char(5) DEFAULT NULL,
`field22` char(5) DEFAULT NULL,
`field23` int(11) DEFAULT NULL
PRIMARY KEY (`idplaced`),
UNIQUE KEY `date_time_competitor_field18_combo` (`date`,`time`,`competitor`,`field18`)
) ENGINE=InnoDB AUTO_INCREMENT=100688607 DEFAULT CHARSET=latin1;
CREATE TABLE `results` (
`idresults` int(11) NOT NULL AUTO_INCREMENT,
`date` char(8) DEFAULT NULL,
`time` char(4) DEFAULT NULL,
`field1` varchar(45) DEFAULT NULL,
`competitor` varchar(45) DEFAULT NULL,
`position` char(4) DEFAULT NULL,
`field2` varchar(45) DEFAULT NULL,
`field3` decimal(2,0) DEFAULT NULL,
PRIMARY KEY (`idresults`)
) ENGINE=InnoDB AUTO_INCREMENT=6644 DEFAULT CHARSET=latin1;
The PLACED table has 65,000 records, the RESULTS table has 9,000 records.
I am assuming the solution involves a JOIN statement of some descript, and I have tried taking several suggestions from this site, but am simply not finding the answer I am looking for. Simply put, I would be grateful for suggestions on this. I can put up example tables / create table code if requried.
The index cannot be used efficiently to perform the join because of your REPLACE operation.
I'd suggest creating an index with the columns in the following slightly different order:
(date, competitor, time, position)
It may also help to add this index on both tables.
It would be even better if you could modify the data in the database so that the data in the time column was stored in the same format in both tables.
First of all, you'd better send us your full tables description, using
show create table
Second, you'd better use join syntax :
update placed p
join Results r on r.competitor = p.competitor
set p.position = r.position
where p.date = r.date
AND REPLACE(p.time,":","") = r.time;
Hope this will help.