I'm having a problem with a query running on SQL Server 2008 enterprise.
The query is an insert to a table from another but it checks that a record is inserted just once.
The query does something like this:
insert into A(...)
--complex select from table B as b
WHERE NOT EXISTS (SELECT 1 FROM A WHERE id = b.id)
Edit: this query does the following:
If the "complex select" from B selects the record 45 (i.e. the record with id = 45) twice then the where is true for the first time record 45 appears, so it gets inserted in A.
Then the second time record 45 appears, the where condition is false, so it does not get inserted in A twice.
This query works fine on SQL Server 2008 standard edition, so I think the problem is a difference between the SQL Server versions (like a default set different or something)
I'm reading about the Maximum Insert Commit Size, but I'm not sure if that can be the issue.
There is no error message, the only visible error is that in standard I get record 45 once and in enterprise I get it twice.
Any Ideas?
I'm pretty sure that the behaviour you say you are getting on Standard Edition is for some other reason than you think it is.
You seem to be expecting that if your values to be INSERTed contain duplicates that one will be INSERTED and then the NOT EXISTS will evaluate to false because of the existence of the newly added row. However AFAIK that is not the way it is supposed to work. Looking at a simple INSERT .. SELECT as below.
CREATE TABLE A(id INT PRIMARY KEY)
CREATE TABLE B(id INT PRIMARY KEY)
INSERT INTO A
SELECT *
FROM B
Gives the following plan
Adding the NOT EXISTS clause
INSERT INTO A
SELECT *
FROM B
WHERE NOT EXISTS (SELECT 1 FROM A WHERE id = B.id)
Changes the plan as follows
As well as the plan now including an anti semi join SQL Server has added an eager spool to the plan before the clustered index insert on A. This is a blocking operator and the purpose of it is to ensure that the entire SELECT is evaluated before any rows are inserted to B at all (Related to Halloween Protection).
You might not necessarily see a spool in your plans however. e.g. SQL Server might also choose to use another blocking operator such as a SORT or a hash anti semi join.
Please post the execution plan for at least the standard edition and preferably both. Also the queries so we can see if you are using any undeterministic constructs.
try restructuring your query and see if it works. Instead of the NOT EXISTS join the tables in the insert query:
INSERT A(...)
SELECT ... FROM B LEFT JOIN A ON B.id = A.id
WHERE B.id = 45 AND A.id IS NULL
then you are performing the select only once.
Related
Say you have 2 tables with email in common between them and you want to set an binary value of 1 to the match field - this query works just fine
...but it is very SLOW and puts a huge load on the server if you're dealing with a large number of records (and we are)
UPDATE table1 a
INNER JOIN table2 b
ON a.email = b.email
SET a.match = 1;
Does anyone know of the exact same functionality but written with syntax that would put less load on the server and process the query faster?
Thanks!
Before running the query be sure to index both tables. In our case Table 1 has unique values whereas Table 2 has multiple records with the same email (i.e. different websites)
CREATE UNIQUE INDEX email ON table1 ( email );
CREATE INDEX email ON table2 ( email );
Once these indexes were created on those two tables, the query runs very fast.
We have an old FoxPro DB that still has active data being entered into it. I am in the process of writing a series of .bat files that will update a MySQL database for our web applications that I'm working on.
Our FoxPro databases were never set up with unique IDs or anything useful like that so I'm having to have the query look at a few different fields.
Here's my query thus far:
//traininghistory = MySQL DB
//traininghistory_test = FoxPro DB
INSERT INTO traininghistory
WHERE traininghistory_test.CLASSID NOT IN(SELECT CLASSID FROM traininghistory)
AND traininghistory_test.EMPID NOT IN(SELECT EMPID FROM traininghistory)
What I'm After is this:
I need an query that looks at the 600,000+ entries in the FoxPro DB (traininghistory_test in my code) and compares to the 600,000+ entries in the MySQL DB (traininghistory in my code) and only inserts the ones where the columns CLASSID and EMPID are new- that is, they are NOT in the traininghistory table.
Any thoughts on this (or if you know a simpler/more efficient way to execute this query in MySQL) are greatly appreciated.
One option is to use a outer join / null check:
insert into traininghistory
select values
from traininghistory_test tht
left join traininghistory th on tht.empid = th.empid
and tht.classid = th.classid
where th.empid is null
It's also worth noting, your current query may leave out records since it's not comparing empid and classid in the same records.
One way ist.
CREATE ONE UNIQUE INDEX ON THE COLUMS (CLASSID, EMPID),
THEN
INSERT IGNORE INTO traininghistory SELECT * or fieldlist FROM traininghistory_test;
Thats all
I have 3 tables-
--server 1
CREATE TABLE TableA (GROUP_ID INT
,STATUS VARCHAR(10))
--server 2
CREATE TABLE TableB (GROUP_ID INT
,NAME VARCHAR(10)
,STATE VARCHAR(50)
,COMPANY VARCHAR(50))
-- server 1
CREATE TABLE TableC (GROUP_ID INT
,NAME VARCHAR(10)
,STATE VARCHAR(50)
,COMPANY VARCHAR(50))
Sample data
INSERT INTO TableA (1, 'READY'),(2,'NOT READY),(3,'READY'),(4,'NOT READY')
INSERT INTO TableB (1, Mike, 'NY', 'aaa'), (1, Rick, 'OK','bbb'), (2, Smith, 'TX','ccc'), (3, Nancy, 'MN','bbb'), (4, Roger, 'CA','aaa')
I am trying to build a SSDT(SSIS 2012) package to load the data in TableC from TableB for only those GROUP_ID which has STATUS= 'READY' in TableA and change STATUS ='LOADED'
I need to accomplish this by using a project level parameters or variables for TableA-GROUP_ID and STATUS because i will be doing this for about 60 tables and those values might change.
I must build a SSIS package, it is a requirement.
using linked server is not preferred. unless its impossible to achieve through SSIS.
Any help would be appreciated.
As the two tables are on separate servers, you could create a Data Flow with two Sources. You'll need to set up Connection Managers to both databases, then point one Source to the database holding TableA, and the other to the database holding TableB. Once this is done, you can join the two with a Merge Join, and then discard the records which don't have the value or values you want using a Conditional Split. It would ultimately look a bit like this:
First you'll need to set up the Sources as already discussed. However, since you want to use a Merge Join, you'll need to sort the output from the sources. You can do this in SSIS with a Sort transform, but you're better off just building an ORDER BY clause into your SELECT statement that you have in the source, and then telling SSIS that the output is sorted:
Right click on each Source, and select Show Advanced Editor.
Go to the Input and Output Properties tab.
Select OLE DB Source Output, then set IsSorted on the right-hand side to True.
Expand OLE DB Source Output, then expand Output Columns.
Click on the column you're sorting by (presumably GROUP_ID), and set SourceKeyPosition to 1.
Here's an image of that last bit in case you're at all lost - it can be a little fiddly getting around the properties in SSIS if you're not used to it:
Since the STATUS value you want to change might load, you could set this up in the Project Parameters. Just go to that page from the Solution Explorer, and click to add a new parameter. You should end up with something like this:
As you're using 2012, you'll be able to configure this value after release in SSMS, avoiding the need to re-work this or create a configuration file.
When you set up the Conditional Split, you have a couple of options. If you might want to send rows with other STATUS values into other tables in future, then you should look for cases where the STATUS has a value of READY, but if you only care about the READY rows you can also do it the way I have here:
When you drag the output of the Conditional Split to the destination, it'll ask which output you want to use. If you've set it up the same way I have, use Conditional Split Default Output, and it'll pass through all rows which don't meet one of the conditions you've stated.
If you need to update the values of the data while you're loading it, it depends where you want the updates to show. If you want to leave TableA and TableB alone, but change the value in TableC, then you could set up a Derived Column transform after the Conditional Split and before the Destination. You could then replace the value in the STATUS column with one you set (this can be parameterised, as above):
If you want to update the STATUS field in TableA, then you should go back to the Control Flow, and after the Data Flow you've been working on, add an Execute SQL Task which is connected to the database holding TableA, and which runs a simple SQL update statement.
If this is going to be running outside of business hours and you know there won't be any new rows during this time, you can simply update all rows which currently have a STATUS of READY. If you need to update the rows more precisely because the situation might be continuing to change while you work, then you might need to re-think this - one option would be to grab all of the GROUP_ID values you want to update at the beginning, store that in a variable, and use the variable as a parameter in the Source select statements and Execute SQL Task update statement. You could also choose to work in a loop instead, but that would obviously be a lot slower than operating on the rows in bulk.
This part is from my original answer before the question was updated, but I'll leave it here in case it's useful to anyone else:
If the tables (A and B) are in the same database, instead of the Conditional Split you could set the source up to be a select statement which joins Table A to Table B, and has a WHERE clause that only selects the rows with a STATUS of READY:
select GROUP_ID, NAME, STATE, COMPANY
from TableA a
inner join TableB b
on a.GROUP_ID = b.GROUP_ID
where a.STATUS = 'READY';
I got a table with a normal setup of auto inc. ids. Some of the rows have been deleted so the ID list could look something like this:
(1, 2, 3, 5, 8, ...)
Then, from another source (Edit: Another source = NOT in a database) I have this array:
(1, 3, 4, 5, 7, 8)
I'm looking for a query I can use on the database to get the list of ID:s NOT in the table from the array I have. Which would be:
(4, 7)
Does such exist? My solution right now is either creating a temporary table so the command "WHERE table.id IS NULL" works, or probably worse, using the PHP function array_diff to see what's missing after having retrieved all the ids from table.
Since the list of ids are closing in on millions or rows I'm eager to find the best solution.
Thank you!
/Thomas
Edit 2:
My main application is a rather easy table which is populated by a lot of rows. This application is administrated using a browser and I'm using PHP as the intepreter for the code.
Everything in this table is to be exported to another system (which is 3rd party product) and there's yet no way of doing this besides manually using the import function in that program. There's also possible to insert new rows in the other system, although the agreed routing is to never ever do this.
The problem is then that my system cannot be 100 % sure that the user did everything correct from when he/she pressed the "export" key. Or, that no rows has ever been created in the other system.
From the other system I can get a CSV-file out where all the rows that system has. So, by comparing the CSV file and my table I can see if:
* There are any rows missing in the other system that should have been imported
* If someone has created rows in the other system
The problem isn't "solving it". It's making the best solution to is since there are so much data in the rows.
Thanks again!
/Thomas
We can use MYSQL not in option.
SELECT id
FROM table_one
WHERE id NOT IN ( SELECT id FROM table_two )
Edited
If you are getting the source from a csv file then you can simply have to put these values directly like:
I am assuming that the CSV are like 1,2,3,...,n
SELECT id
FROM table_one
WHERE id NOT IN ( 1,2,3,...,n );
EDIT 2
Or If you want to select the other way around then you can use mysqlimport to import data in temporary table in MySQL Database and retrieve the result and delete the table.
Like:
Create table
CREATE TABLE my_temp_table(
ids INT,
);
load .csv file
LOAD DATA LOCAL INFILE 'yourIDs.csv' INTO TABLE my_temp_table
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(ids);
Selecting records
SELECT ids FROM my_temp_table
WHERE ids NOT IN ( SELECT id FROM table_one )
dropping table
DROP TABLE IF EXISTS my_temp_table
What about using a left join ; something like this :
select second_table.id
from second_table
left join first_table on first_table.id = second_table.id
where first_table.is is null
You could also go with a sub-query ; depending on the situation, it might, or might not, be faster, though :
select second_table.id
from second_table
where second_table.id not in (
select first_table.id
from first_table
)
Or with a not exists :
select second_table.id
from second_table
where not exists (
select 1
from first_table
where first_table.id = second_table.id
)
The function you are looking for is NOT IN (an alias for <> ALL)
The MYSQL documentation:
http://dev.mysql.com/doc/refman/5.0/en/all-subqueries.html
An Example of its use:
http://www.roseindia.net/sql/mysql-example/not-in.shtml
Enjoy!
The problem is that T1 could have a million rows or ten million rows, and that number could change, so you don't know how many rows your comparison table, T2, the one that has no gaps, should have, for doing a WHERE NOT EXISTS or a LEFT JOIN testing for NULL.
But the question is, why do you care if there are missing values? I submit that, when an application is properly architected, it should not matter if there are gaps in an autoincrementing key sequence. Even an application where gaps do matter, such as a check-register, should not be using an autoincrenting primary key as a synonym for the check number.
Care to elaborate on your application requirement?
OK, I've read your edits/elaboration. Syncrhonizing two databases where the second is not supposed to insert any new rows, but might do so, sounds like a problem waiting to happen.
Neither approach suggested above (WHERE NOT EXISTS or LEFT JOIN) is air-tight and neither is a way to guarantee logical integrity between the two systems. They will not let you know which system created a row in situations where both tables contain a row with the same id. You're focusing on gaps now, but another problem is duplicate ids.
For example, if both tables have a row with id 13887, you cannot assume that database1 created the row. It could have been inserted into database2, and then database1 could insert a new row using that same id. You would have to compare all column values to ascertain that the rows are the same or not.
I'd suggest therefore that you also explore GUID as a replacement for autoincrementing integers. You cannot prevent database2 from inserting rows, but at least with GUIDs you won't run into a problem where the second database has inserted a row and assigned it a primary key value that your first database might also use, resulting in two different rows with the same id. CreationDateTime and LastUpdateDateTime columns would also be useful.
However, a proper solution, if it is available to you, is to maintain just one database and give users remote access to it, for example, via a web interface. That would eliminate the mess and complication of replication/synchronization issues.
If a remote-access web-interface is not feasible, perhaps you could make one of the databases read-only? Or does database2 have to make updates to the rows? Perhaps you could deny insert privilege? What database engine are you using?
I have the same problem: I have a list of values from the user, and I want to find the subset that does not exist in anther table. I did it in oracle by building a pseudo-table in the select statement Here's a way to do it in Oracle. Try it in MySQL without the "from dual":
-- find ids from user (1,2,3) that *don't* exist in my person table
-- build a pseudo table and join it with my person table
select pseudo.id from (
select '1' as id from dual
union select '2' as id from dual
union select '3' as id from dual
) pseudo
left join person
on person.person_id = pseudo.id
where person.person_id is null
I am running many instances of a webcrawler in parallel.
Each crawler selects a domain from a table, inserts that url and a start time into a log table, and then starts crawling the domain.
Other parallel crawlers check the log table to see what domains are already being crawled before selecting their own domain to crawl.
I need to prevent other crawlers from selecting a domain that has just been selected by another crawler but doesn't have a log entry yet. My best guess at how to do this is to lock the database from all other read/writes while one crawler selects a domain and inserts a row in the log table (two queries).
How the heck does one do this? I'm afraid this is terribly complex and relies on many other things. Please help get me started.
This code seems like a good solution (see the error below, however):
INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
(
SELECT companies.id FROM companies
LEFT OUTER JOIN crawlLog
ON companies.id = crawlLog.companyId
WHERE crawlLog.companyId IS NULL
LIMIT 1
),
now()
)
but I keep getting the following mysql error:
You can't specify target table 'crawlLog' for update in FROM clause
Is there a way to accomplish the same thing without this problem? I've tried a couple different ways. Including this:
INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
(
SELECT id
FROM companies
WHERE id NOT IN (SELECT companyId FROM crawlLog) LIMIT 1
),
now()
)
You can lock tables using the MySQL LOCK TABLES command like this:
LOCK TABLES tablename WRITE;
# Do other queries here
UNLOCK TABLES;
See:
http://dev.mysql.com/doc/refman/5.5/en/lock-tables.html
Well, table locks are one way to deal with that; but this makes parallel requests impossible. If the table is InnoDB you could force a row lock instead, using SELECT ... FOR UPDATE within a transaction.
BEGIN;
SELECT ... FROM your_table WHERE domainname = ... FOR UPDATE
# do whatever you have to do
COMMIT;
Please note that you will need an index on domainname (or whatever column you use in the WHERE-clause) for this to work, but this makes sense in general and I assume you will have that anyway.
You probably don't want to lock the table. If you do that you'll have to worry about trapping errors when the other crawlers try to write to the database - which is what you were thinking when you said "...terribly complex and relies on many other things."
Instead you should probably wrap the group of queries in a MySQL transaction (see http://dev.mysql.com/doc/refman/5.0/en/commit.html) like this:
START TRANSACTION;
SELECT #URL:=url FROM tablewiththeurls WHERE uncrawled=1 ORDER BY somecriterion LIMIT 1;
INSERT INTO loggingtable SET url=#URL;
COMMIT;
Or something close to that.
[edit] I just realized - you could probably do everything you need in a single query and not even have to worry about transactions. Something like this:
INSERT INTO loggingtable (url) SELECT url FROM tablewithurls u LEFT JOIN loggingtable l ON l.url=t.url WHERE {some criterion used to pick the url to work on} AND l.url IS NULL.
I got some inspiration from #Eljakim's answer and started this new thread where I figured out a great trick. It doesn't involve locking anything and is very simple.
INSERT INTO crawlLog (companyId, timeStartCrawling)
SELECT id, now()
FROM companies
WHERE id NOT IN
(
SELECT companyId
FROM crawlLog AS crawlLogAlias
)
LIMIT 1
I wouldn't use locking, or transactions.
The easiest way to go is to INSERT a record in the logging table if it's not yet present, and then check for that record.
Assume you have tblcrawels (cra_id) that is filled with your crawlers and tblurl (url_id) that is filled with the URLs, and a table tbllogging (log_cra_id, log_url_id) for your logfile.
You would run the following query if crawler 1 wants to start crawling url 2:
INSERT INTO tbllogging (log_cra_id, log_url_id)
SELECT 1, url_id FROM tblurl LEFT JOIN tbllogging on url_id=log_url
WHERE url_id=2 AND log_url_id IS NULL;
The next step is to check whether this record has been inserted.
SELECT * FROM tbllogging WHERE log_url_id=2 AND log_cra_id=1
If you get any results then crawler 1 can crawl this url. If you don't get any results this means that another crawler has inserted in the same line and is already crawling.
It's better to use row lock or transactional based query so that other parallel request context can access the table.