How to handle data set without unique ID - mysql

I'm working on a data import routine from one source into another and I've got one table that doesn't have it's own unique identifier. Instead it uses a combination of four fields to determine the record to be modified. My source table structure is below:
feed_hcp_leasenote table:
BLDGID varchar(255),
LEASID varchar(255),
NOTEDATE varchar(255),
REF1 varchar(255),
NOTETEXT varchar(8000),
tempid int PRIMARY, AUTONUMBER
The first four are the fields which, when evaluated altogether, make a record unique in the source database. I'm importing this data into two tables, one for the note and another for the other fields. Here is my structure for the new database:
lease_note table:
lnid int PRIMARY AUTONUMBER,
notetext longtext,
lid int (lease ID, links to lease table)
customfield_data table (holds other data):
cfdid int PRIMARY AUTONUMBER,
data_date dateteime,
data_smtext varchar(1000),
linkid int (links the data to its source ID)
cfid int (links the data to its field type)
The problem that I'm running into is when I try to identify those records that exist in the source database without a match in the new database my query seems to be duplicating records to the point that the query never finishes and locks up my server. I can successfully query based on BLDGID and LEASID and limit the query to the proper records but when I try to JOIN the customfield_data table aliased to the NOTEDATE and REF1 fields it starts to exponentially duplicate records. Here's my query:
SELECT NOTEDATE, REF1, REF2, LASTDATE, USERID, NOTETEXT, lid
FROM feed_hcp_leasenote
JOIN customfield_data mrileaseid ON feed_hcp_leasenote.LEASID = mrileaseid.data_smtext AND mrileaseid.cfid = 36
JOIN leases ON mrileaseid.linkid = leases.lid
JOIN suites ON leases.sid = suites.sid
JOIN floors ON suites.fid = floors.fid
JOIN customfield_data coid ON floors.bid = coid.linkid AND coid.cfid = 1 AND coid.data_smtext = feed_hcp_leasenote.BLDGID
JOIN customfield_data status ON leases.lid = status.linkid AND status.cfid = 27 AND status.data_smtext <> 'I'
WHERE tempid NOT IN (
SELECT tempid
FROM feed_hcp_leasenote
JOIN customfield_data mrileaseid ON feed_hcp_leasenote.LEASID = mrileaseid.data_smtext AND mrileaseid.cfid = 36
JOIN leases ON mrileaseid.linkid = temp_leases.lid
JOIN suites ON leases.sid = suites.sid
JOIN floors ON suites.fid = floors.fid
JOIN customfield_data coid ON floors.bid = coid.linkid AND coid.data_smtext = feed_hcp_leasenote.BLDGID AND coid.cfid = 1
JOIN customfield_data notedate ON STR_TO_DATE(feed_hcp_leasenote.NOTEDATE, '%e-%b-%Y') = notedate.data_date AND notedate.cfid = 55
JOIN customfield_data ref1 ON feed_hcp_leasenote.REF1 = ref1.data_smtext AND ref1.cfid = 56
JOIN lease_notes ON leases.lid = lease_notes.lid AND notedate.linkid = lease_notes.lnid AND ref1.linkid = lease_notes.lnid )
At the moment, I've narrowed the problem down to the NOT IN subquery -- running just that part crashes the server. I imagine the problem is that because there can be multiple notes with the same BLDGID, LEASID, NOTEDATE, and REF1 (but not all 4), the query keeps selecting back on itself and effectively creating an infinite loop.
Short of modifying the source database to include a unique ID (which I can't do) does anyone see a solution to this? Thanks in advance!
(Edits based on feedback)
Sorry for the lack of information, I was worried about that. Basically I'm importing the data in feed_hcp_leasenote from a CSV file dumped from another database that I have no control over. I add a tempid field once the data is imported into my server with the idea of using it in the SELECT WHERE tempid NOT IN query, though I'm not married to that approach.
My goal is to split the data in feed_hcp_leasenote into two tables: lease_note which holds the primary record (with a unique ID) and the note itself and; customfield_data which holds other data related to the record.
The source data feed consists of about 65,000 records, of which I'm importing about 25,000 since the remainder are connected to records that have been deactivated.
(2nd Edit)
Visual Schema of relevant tables: http://www.tentenstudios.com/clients/relynx/schema.png
EXPLAIN query:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY status ref data_smtext,linkid,cfid cfid 4 const 928 Using where
1 PRIMARY mrileaseid ref data_smtext,linkid,cfid linkid 5 rl_hpsi.status.linkid 19 Using where
1 PRIMARY leases eq_ref PRIMARY,sid PRIMARY 4 rl_hpsi.mrileaseid.linkid 1 Using where
1 PRIMARY suites eq_ref PRIMARY,fid PRIMARY 4 rl_hpsi.leases.sid 1
1 PRIMARY floors eq_ref PRIMARY,bid PRIMARY 4 rl_hpsi.suites.fid 1
1 PRIMARY feed_hcp_leasenote ref BLDGID,LEASID LEASID 768 rl_hpsi.mrileaseid.data_smtext 19 Using where
1 PRIMARY coid ref data_smtext,linkid,cfid data_smtext 1002 rl_hpsi.feed_hcp_leasenote.BLDGID 10 Using where
2 DEPENDENT SUBQUERY feed_hcp_leasenote eq_ref PRIMARY,BLDGID,LEASID PRIMARY 4 func 1
2 DEPENDENT SUBQUERY mrileaseid ref data_smtext,linkid,cfid data_smtext 1002 rl_hpsi.feed_hcp_leasenote.LEASID 10 Using where
2 DEPENDENT SUBQUERY leases eq_ref PRIMARY,sid PRIMARY 4 rl_hpsi.mrileaseid.linkid 1
2 DEPENDENT SUBQUERY suites eq_ref PRIMARY,fid PRIMARY 4 rl_hpsi.leases.sid 1
2 DEPENDENT SUBQUERY floors eq_ref PRIMARY,bid PRIMARY 4 rl_hpsi.suites.fid 1
2 DEPENDENT SUBQUERY ref1 ref data_smtext,linkid,cfid data_smtext 1002 rl_hpsi.feed_hcp_leasenote.REF1 10 Using where
2 DEPENDENT SUBQUERY lease_notes eq_ref PRIMARY PRIMARY 4 rl_hpsi.ref1.linkid 1 Using where
2 DEPENDENT SUBQUERY coid ref data_smtext,linkid,cfid data_smtext 1002 rl_hpsi.feed_hcp_leasenote.BLDGID 10 Using where
2 DEPENDENT SUBQUERY notedate ref linkid,cfid linkid 5 rl_hpsi.ref1.linkid 19 Using where

doesn't have it's own unique identifier. Instead it uses a combination of four fields to determine the record to be modified
No: if the four fields in combination constitute a unique key, then you have a unique identifier - just one with four parts.
BLDGID varchar(255),
LEASID varchar(255),
NOTEDATE varchar(255),
REF1 varchar(255),
NOTETEXT varchar(8000)
So you've no idea how the data is actually structured or you go this from a MSAccess programmer who doesn't know either.
SELECT NOTEDATE, REF1, REF2, LASTDATE, USERID, NOTETEXT, lid
FROM feed_hcp_leasenote
OMG. If that's the answer then you're asking the wrong questions.
Short of modifying the source database to include a unique ID (which I can't do) does anyone see a solution to this?
Find another job? Seriously. If you can't add a primary key to the import table / can't import it into a temporary table with a primary key defined, then you will spend a stupid amount of time trying to fix this.
BTW: While innodb will handle keys up to 3072 bytes (1024 on 32-bit) this will continue to run like a dog until you reduce your column sizes or use a hash of the actually PK data as the primary key.
It's not clear from your question how many rows you are adding / how many rows are already in the database. Nor have you provided the structure of the other tables. Nor have you provided an explain plan which should be your starting point for any performance problems.
It might be possible to get this running a lot faster - it's impossible to say from the information you have provided. But given the ridiculous constraint that you have to make it faster without changing the schema, I wonder what other horrors await.
I did think that, without knowing the details of the current schema, it would be possible to breakdown the current query into several components and check each one, maintaining a score in the import table - then use the score to determine what had unmatched data - however this requires schema changes too.
BTW have a google for the DISTINCT keyword in SQL.

Related

How to Optimized performance of JOIN query on large table

I am using Server version: 5.5.28-log MySQL Community Server (GPL).
I have a big table consist of 279703655 records called table A. I have to perform join on this table with one of my changelog table B and then insert matching records in new tmp table C.
B table has index on column type.
A table consist of prod_id,his_id and other columns.A table has index on both column prod_id,history_id.
When i am going to perform the following query
INSERT INTO C(prod,his_id,comm)
SELECT DISTINCT a.product_id,a.history_id,comm
FROM B as b INNER JOIN A as a ON a.his_id = b.his_id AND b.type="applications"
GROUP BY prod_id
ON DUPLICATE KEY UPDATE
`his_id` = VALUES(`his_id`);
it takes 7 to 8 min to insert records.
Even if i perform simple count from table A it took 15 min to give me count.
I have also tried a procedure to insert records in Limit but due to count query takes 15 min it is more slower then before.
BEGIN
DECLARE n INT DEFAULT 0;
DECLARE i INT DEFAULT 0;
SELECT COUNT(*) FROM A INTO n;
SET i=5000000;
WHILE i<n DO
INSERT INTO C(product_id,history_id,comments)
SELECT a.product_id,a.history_id,a.comments FROM B as b
INNER JOIN (SELECT * FROM A LIMIT i,1) as a ON a.history_id=b.history_id;
SET i = i + 5000000;
END WHILE;
End
But the above code is also take 15 to 20 min o execute.
Please suggest me how i make it faster.
Below is EXPLAIN result:
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
| 1 | SIMPLE | a | ALL | (NULL) | (NULL) | (NULL) | (NULL) | 279703655 | |
| 1 | SIMPLE | b | eq_ref | PRIMARY | PRIMARY | 8 | DB.a.history_id | 1 | Using index |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+--------------+-------------+
(from Comment)
CREATE TABLE B (
history_id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
history_hash char(32) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
type enum('products','brands','partnames','mc_partnames','applications') NOT NULL,
stamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (history_id),
UNIQUE KEY history_hash (history_hash),
KEY type (type),
KEY stamp (stamp)
);
Let's first look at the tables.
What you call table B is really a history table. Its primary key is the history_id.
What you call table A is really a product table with one product per row and product_id its primary key. Each product also has a history_id. Thus you have created a 1:n relation. A product has one history row; one history row relates to multiple products.
You are selecting the product table rows that have an 'application' type history entry. This should be written as:
select product_id, history_id, comm
from product
where history_id in
(
select history_id
from history
where type = 'applications'
);
(A join would work just as well, but isn't as clear. As there is only one history row per product, you can't get duplicates. Both GROUP BY and DISTINCT are completely superfluous in your query and should be removed in order not to give the DBMS unecessary work to do. But as mentioned: better don't join at all. If you want rows from table A, select from table A. If you want to look up rows in table B, look them up in the WHERE clause, where all criteria belongs.)
Now, we would have to know how many rows may be affected. If only 1% of all history rows are 'applications', then an index should be used. Preferably
create index idx1 on history (type, history_id);
… which finds rows by type and gets their history_id right away.
If, say 20%, of all all history rows are 'applications', then reading the table sequentially might be more efficient.
Then, how many product rows may we get? Even with a single history row, we might get millions of related product rows. Or vice versa, with millions of history rows we might get no product row at all. Again, we can provide an index, which may or may not be used by the DBMS:
create index idx2 on product (history_id, product_id, comm);
This is about as fast as it gets. Two indexes offered and a proper written query without an unnecessary join. There were times when MySQL had performance problems with IN. People rewrote the clause with EXISTS then. I don't think this is still necessary.
As of MySQL 8.0.3, you can create histogram statistics for tables.
analyze history update histogram on type;
analyze product update histogram on history_id;
This is an important step to help the optimizer to find the optimal way to select the data.
Indexes needed (assuming it is history_id, not his_id):
B: INDEX(type, history_id) -- in this order. Note: "covering"
A: INDEX(history_id, product_id, comm)
What column or combination of columns provides the uniqueness constraint that IODKU needs?
Really-- Provide SHOW CREATE TABLE.

Mysql count() with Left Join taking over 10 seconds on 5 million records

I have the following two tables that I am joining like this
I am using mysql v. 5.7
Table: contacts about 5 mil
id : auto inc (int) primary Key
status : Int (Index)
Table : contact_lists about 10 mil
id : auto inc (int) primary Key
contactId : index
listId: index
Table: Lists about 30
id: auto inc (int) primary key
Here is my query , I have 10 million record on contacts table
SELECT cl.listId, count(c.id) active from `contact_lists` cl
LEFT JOIN `contacts` c ON c.id = cl.contactId and c.status = 1
group by cl.listId
Here is my Explain
1 SIMPLE cl NULL listId contact_lists 8 NULL 9062524 100.00 Using index
1 SIMPLE c NULL eq_ref PRIMARY PRIMARY 4 cl.contactId 1 100.00 Using where
This Query is taking over 11 seconds when I run it, any idea how can I speed it up
I have tried adding indexes nothing really worked, can I rewrite this somehow to make it faster? like less than 2 seconds, the issue is the count(c.id) is very slow when it comes to that much data
The result
listId. active
1 100
2. 3000
3. 500010
and so on
Based on your Explain results, there is no index for id and status. Create those:
ALTER TABLE `contact`
ADD INDEX `id_status` (`id`, `status`);
And you'll also need a foreign key on contact_list:
ALTER TABLE `contact_list`
ADD CONSTRAINT `FK_contact_list_contact` FOREIGN KEY (`contactId`) REFERENCES `contact` (`id`);
Note that making these changes might lock your table for a little while

Slow execution of a subquery when no matches

Please note that I have asked this question on dba.stackexchange.com, but I thought I'd post it here too:
In MySQL, I have two basic tables - Posts and Followers:
CREATE TABLE Posts (
id int(11) NOT NULL AUTO_INCREMENT,
posted int(11) NOT NULL,
body varchar(512) NOT NULL,
authorId int(11) NOT NULL,
PRIMARY KEY (id),
KEY posted (posted),
KEY authorId (authorId,posted)
) ENGINE=InnoDB;
CREATE TABLE Followers (
userId int(11) NOT NULL,
followerId int(11) NOT NULL,
PRIMARY KEY (userId,followerId),
KEY followerId (followerId)
) ENGINE=InnoDB;
I have the following query, which seems to be optimized enough:
SELECT p.*
FROM Posts p
WHERE p.authorId IN (SELECT f.userId
FROM Followers f
WHERE f.followerId = 9
ORDER BY authorId)
ORDER BY posted
LIMIT 0, 20
EXPLAIN output:
+------+--------------------+-------+-----------------+--------------------+---------+---------+------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------------+-------+-----------------+--------------------+---------+---------+------------+------+--------------------------+
| 1 | PRIMARY | p | index | NULL | posted | 4 | NULL | 20 | Using where |
| 2 | DEPENDENT SUBQUERY | f | unique_subquery | PRIMARY,followerId | PRIMARY | 8 | func,const | 1 | Using index; Using where |
+------+--------------------+-------+-----------------+--------------------+---------+---------+------------+------+--------------------------+
When followerId is a valid id (meaning, it actually exists in both tables), the query execution is almost immediate. However, when the id is not present in the tables, the query only returns results (empty set) after a 7 second delay.
Why is this happening? Is there some way to speed up this query for cases where there are no matches (without having to do a check ahead of time)?
Is there some way to speed up this query ...???
Yes. You should do two things.
First, you should use EXISTS instead of IN (cross reference SQL Server IN vs. EXISTS Performance). It'll speed up the instances where there is a match, which will come in handy as your data set grows (it's may be fast enough now, but that doesn't mean you shouldn't follow best practices, and in this case EXISTS is a better practice than IN)
Second, you should modify the keys on your second table just a little bit. You were off to a good start using the compound key on (userId,followerId), but in terms of optimizing this particular query, you need to keep in mind the "leftmost prefix" rule of MySQL indices, eg
If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to find rows. http://dev.mysql.com/doc/refman/5.6/en/multiple-column-indexes.html
What your Query Execution Plan from EXPLAIN is telling you is that SQL thinks it makes more sense to join Followers to Posts (using the Primary Key on Posts) and filter the results for a given followerId off of that index. Think of it like saying "Show me all the possible matches, then reduce that down to just the ones that match followerId = {}"
If you replace your followerId key with a compound key (followerId,userId), you should be able to quickly zoom in to just the user ids associated with a given followerID and do the existence check against those.
I wish I knew how to explain this better... it's kind of a tough concept to grasp until you have a "Aha!" moment and it clicks. But if you look into the leftmost prefix rules on indices, and also change the key on followerId to be a key on (followerId,userId), I think it'll speed it up quite a bit. And if you use EXISTS instead of IN, that'll help you maintain that speed even as your data set grows.
try this one :
SELECT p.*
FROM Posts p
inner join Followers f On f.userId = p.authorId
WHERE f.followerId = 9
ORDER BY posted
LIMIT 0, 20

Merge data from 2 tables, use only unique rows

I have 2 tables in my database
primary_id
primary_date
primary_measuredData
temporary_id
temporary_date
temporary_measuredData
well. the table have other columns but these are the important ones.
What I want is the following.
Table "primary" consists of verified measuredData.If data is available here, the output should choose first from primary, and if not available in primary, choose from temporary.
In about 99.99% of the cases all old data is in the primary, and only the last day is from the temporary table.
Example:
primary table:
2013-02-05; 345
2013-02-07; 123
2013-02-08; 3425
2013-02-09; 334
temporary table:
2013-02-06; 567
2013-02-07; 1345
2013-02-10; 31
2013-02-12; 33
I am looking for the SQL query that outputs:
2013-02-05; 345 (from primary)
2013-02-06; 567 (from temporary, no value available from prim)
2013-02-07; 123 (from primary, both prim & temp have this date so primary is used)
2013-02-08; 3425 (primary)
2013-02-09; 334 (primary)
2013-02-10; 31 (temp)
2013-02-12; 33 (temp)
you see, no duplicate dates and if data is avalable at primary table then the data is used from that one.
I have no idea how to solve this, so I cant give you any "this is what I've done so far :D"
Thanks!
EDIT:
The value of "measuredData" can differ from temp and primary. This is because temp is used to store a temporary value, and later when the data is verified it goes into the primary table.
EDIT 2:
I changed the primary table and added a new column "temporary". So that I store all the data in the same table. When the primary data is updated it updates the temporary data with the new numbers. This way I dont need to merge 2 tables into one.
You should start with a UNION QUERY like this:
SELECT p.primary_date AS dt, p.primary_measuredData as measured
FROM
`primary` p
UNION ALL
SELECT t.temporary_date, t.temporary_measuredData
FROM
`temporary` t LEFT JOIN `primary` p
ON p.primary_date=t.temporary_date
WHERE p.primary_date IS NULL
a LEFT JOIN where there's no match (p.primary_date IS NULL) will return all rows from the temporary table that are not present in the primary table. And using UNION ALL you can return all rows available in the first table.
You might want to add an ORDER BY clause to the whole query. Please see fiddle here.

MYSQL: Optimize Order By in Table Sort

I am developing an application for my college's website and I would like to pull all the events in ascending date order from the database. There is a total of four tables:
Table Events1
event_id, mediumint(8), Unsigned
date, date,
Index -> Primary Key (event_id)
Index -> (date)
Table events_users
event_id, smallint(5), Unsigned
user_id, mediumint(8), Unsigned
Index -> PRIMARY (event_id, user_id)
Table user_bm
link, varchar(26)
user_id, mediumint(8)
Index -> PRIMARY (link, user_id)
Table user_eoc
link, varchar(8)
user_id, mediumint(8)
Index -> Primary (link, user_id)
Query:
EXPLAIN SELECT * FROM events1 E INNER JOIN event_users EU ON E.event_id = EU.event_id
RIGHT JOIN user_eoc EOC ON EU.user_id = EOC.user_id
INNER JOIN user_bm BM ON EOC.user_id = BM.user_id
WHERE E.date >= '2013-01-01' AND E.date <= '2013-01-31'
AND EOC.link = "E690"
AND BM.link like "1.1%"
ORDER BY E.date
EXPLANATION:
The query above does two things.
1) Searches and filters out all students through the user_bm and user_eoc tables. The "link" columns are denormalized columns to quickly filter students by major/year/campus etc.
2) After applying the filter, MYSQL grabs the user_ids of all matching students and finds all events they are attending and outputs them in ascending order.
QUERY OPTIMIZER EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE EOC ref PRIMARY PRIMARY 26 const 47 Using where; Using index; Using temporary; Using f...
1 SIMPLE BM ref PRIMARY,user_id-link user_id-link 3 test.EOC.user_id 1 Using where; Using index
1 SIMPLE EU ref PRIMARY,user_id user_id 3 test.EOC.user_id 1 Using index
1 SIMPLE E eq_ref PRIMARY,date-event_id PRIMARY 3 test.EU.event_id 1 Using where
QUESTION:
The query works fine but can be optimized. Specifically - using filesort and using temporary is costly and I would like to avoid this. I am not sure if this is possible because I would like to 'Order By' events by date that have a 1:n relationship with the matching users. The Order BY applies to a joined table.
Any help or guidance would be greatly appreciated. Thank you and Happy Holidays!
Ordering can be done in two ways. By index or by temporary table. You are ordering by date in table Events1 but it's using the PRIMARY KEY which doesn't contain date so in this case the result needs to be ordered in a temporary table.
It is not necessarily expensive though. If the result is small enough to fit in memory it will not be a temporary table on disk, just in memory and that is not expensive.
Neither is filesort. "Using filesort" doesn't mean it will use any file, it just means it's not sorting by index.
So, if your query executes fast you should be happy. If the result set is small it will be sorted in memory and no files will be created.