I have a left join query that shows all the fields from a primary table (tblMarkers) and the values from a second table (tblLocations) where there is matching record.
tblLocations does not have a record for every id in tblMarkers
$query ="SELECT `tblMarkers`.*,`tblLocation`.*,`tblLocation`.`ID` AS `markerID`
FROM
`tblMarkers`
LEFT JOIN `tblLocation` ON `tblMarkers`.`ID` = `tblLocation`.`ID`
WHERE
`tblMarkers`.`ID` = $id";
I am comfortable with using UPDATE to update the tblMarkers fields but how do I update or INSERT a record into tblLocations if the record does not exist yet in tblLocations.
Also, how do I lock the record I ma working on to prevent someone else from doing an update at the same time?
Can I also use UPDATE tblMarkers * or do I have to list every field in the UPDATE statement?
Unfortunately you might have to implement some validation in your outside script. There is an IF statement in SQL, but I'm not sure if you can trigger different commands based on it's outcome.
Locking
In terms of locking, you have 2 options. for MyISAM tables, you can only lock the entire table using http://dev.mysql.com/doc/refman/5.0/en/lock-tables.html
LOCK TABLE users;
For InnoDB tables, there is no explicit 'lock' for single rows, however you can use transactions, to get exclusive rights during the operation. http://dev.mysql.com/doc/refman/5.0/en/innodb-locks-set.html
Update
There might be some shorthand notation, but I think you have to list every field in your query. Alternatively, you can always read the entire row, delete it and insert again using shorthand INSERT query. It all depends on how many fields you've got.
Related
I have a MySQL database with just 1 table:
Fields are: blocknr (not unique), btcaddress (not unique), txid (not unique), vin, vinvoutnr, netvalue.
Indexes exist on both btcaddress and txid.
Data in it looks like this:
I need to delete all "deletable" record pairs. An example is given in red.
Conditions are:
txid must be the same (there can be more than 2 records with same txid)
vinvoutnr must be the same
vin must be different (can have only 2 values 0 and 1, so 1 must be 0 other must be 1)
In a table of 36M records, about 33M records will be deleted.
I've used this:
delete t1
from registration t1
inner join registration t2
where t1.txid=t2.txid and t1.vinvoutnr=t2.vinvoutnr and t1.vin<>t2.vin;
It works but takes 5 hours.
Maybe this would work too (not tested yet):
delete t1
from registration as t1, registration as t2
where t1.txid=t2.txid and t1.vinvoutnr=t2.vinvoutnr and t1.vin<>t2.vin;
Or do I forget about a delete query and try to make a new table with all non-delatables in and then drop the original ?
Database can be offline for this delete query.
Based on your question, you are deleting most of the rows in the table. That is just really expensive. A better approach is to empty the table and re-populate it:
create table temp_registration as
<query for the rows to keep here>;
truncate table registration;
insert into registration
select *
from temp_registration;
Your logic is a bit hard to follow, but I think the logic on the rows to keep is:
select r.*
from registration r
where not exists (select 1
from registration r2
where r2.txid = r.txid and
r2.vinvoutnr = r.vinvoutnr and
r2.vin <> r.vin
);
For best performance, you want an index on registration(txid, vinvoutnr, vin).
Given that you expect to remove the majority of your data it does sound like the simplest approach would be to create a new table with the correct data and then drop the original table as you suggest. Otherwise ADyson's corrections to the JOIN query might help to alleviate the performance issue.
I can't seem to find the answer to this anywhere. I am reading a csv into a data frame using the read.csv function. Then I am writing the data frame contents to a mysql table using dbWriteTable. This works great for the initial run to create the table, but I each run after this needs to do either an insert or an update depending on whether the record already exists in the table.
The 1st column in the data frame is the primary key, and the other records contain data that might change every time I pull a new copy of the csv. Each time I pull the CSV, if the primary key already exists, I want it to update that record with the new data, and if the primary key does not exist(eg: a new key since the last run), I want it to just insert the record into the table.
This is my current dbWriteTable. This creates the table just fine the 1st time it's run, and also inserts a "Timestamp" column into the table that is set to "on update CURRENT_TIMESTAMP" so that I know when each record was last updated.
dbWriteTable(mydb, value=csvData, name=Table, row.names=FALSE, field.types=list(PrimaryKey="VARCHAR(10)",Column2="VARCHAR(255)",Column3="VARCHAR(255)",Timestamp="TIMESTAMP"), append=TRUE)
Now the next time I run this, I simply want it to update any PrimaryKeys that are already in the table, and add any new ones. I also don't want to lose any records in the event a PrimaryKey disappears from the CSV source.
Is it possible to do this kind of update using dbWriteTable, or some other R function?
If that's not possible, is it possible to just run a mysql query that would delete any duplicate PrimaryKey records and keep just the 1 record with the most current timestamp? So I would run a dbWriteTable to append the new data, and then run a MySQL query to prune out the older records.
Obviously I couldn't define that 1st column as an actual PrimaryKey in the DB as my append/delete solution wouldn't work due to duplicate keys, and that's fine, I can always add an auto increment integer column to the table for the "real" primary key if needed.
Thoughts?
Consider using a temp table (an exact replica of final table but with less records) and then run an INSERT and UPDATE query into final table which will handle both cases without overlap (plus primary keys are constraints and queries will error out if attempts are made to duplicate any):
records to append if not exists - using the LEFT JOIN NULL query
records to update if does exist. - using the UPDATE INNER JOIN query
Concerning the former there is a regular debate among SQL coders if LEFT JOIN NULL or NOT IN or NOT EXISTS is the optimal solution which of course "depends". Left Join used here does avoid subqueries. But consider those avenues if needed.
# DELETE LAST SET OF TEMP DATA
dbSendQuery(mydb, "DELETE FROM tempTable")
# APPEND R DATA FRAME TO TEMP DATA
dbWriteTable(mydb, value=csvData, name=tempTable, row.names=FALSE,
field.types=list(PrimaryKey="VARCHAR(10)", Column2="VARCHAR(255)",
Column3="VARCHAR(255)", Timestamp="TIMESTAMP"),
append=TRUE, overwrite=FALSE)
# LEFT JOIN ... NULL QUERY TO APPEND NEW RECORDS NOT IN TABLE
dbSendQuery(mydb, "INSERT INTO finalTable (Column1, Column2, Column3, Timestamp)
SELECT Column1, Column2, Column3, Timestamp
FROM tempTable f
LEFT JOIN finalTable t
ON f.PrimaryKey = t.PrimaryKey
WHERE f.PrimaryKey IS NULL;")
# UPDATE INNER JOIN QUERY TO UPDATE MATCHING RECORDS
dbSendQuery(mydb, "UPDATE finalTable f
INNER JOIN tempTable t
ON f.PrimaryKey = t.PrimaryKey
SET f.Column1 = t.Column1,
f.Column2 = t.Column2,
f.Column3 = t.Column3,
f.Timestamp = t.Timestamp;")
For the most part, queries above will be compliant in most SQL backends should you ever need to change databases. Some RDMS do not support UPDATE INNER JOIN but equivalent alternatives are available. Finally, the beauty of this route is all processing is handled in the SQL engine and not in R.
Sounds like you're trying to do an upsert.
I'm kind of rusty with MySQL but the general idea is that you need to have a staging table to upload the new CSV, and then in the database itself do the insert/update.
For that you need to use dbSendQuery with INSERT ON DUPLICATE UPDATE.
http://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html
I have two tables T1 and T2 and want to update one field of T1 from T2 where T2 holds massive data.
What is more efficient?
Updating T1 in a for loop iteration over the values
or
Left join it with T2 and update.
Please note that i'm updating these tables in a shell script
In general, the JOIN will always work much better than a loop. The size should not be an issue if it is properly indexed.
There is no simple answer which will be more effective, it will depend on table size and data size to which you are going to update in one go.
Suppose you are using innodb engine and trying to update 1,000 or more rows in one go with 2 heavy tables join and it is quite frequent then it will not be good idea on production server as it will lock your table for some time and due to this locking some other operations also can be hit on your production server.
Option1: If you are trying to update few rows and based on proper indexed fields (preferred based on primary key) then you can go with join.
Option2: If you are trying to update a large amount of data based on multiple tables join then below option will be better:
Step1: Create a stored procedure.
Step2: Keep below query results in a cursor.
suppose you want TO UPDATE corresponding field2 DATA of TABLE table2 IN field1 of TABLE table1:
SELECT a.primary_key,b.field2 FROM table1 a JOIN table2 b ON a.primary_key=b.foreign_key WHERE [place CONDITION here IF any...];
Step3: Now update all rows one by one based on primary key using stored values in cursor.
Step4: You can call this stored procedure from your script.
I make 2 queries in a transaction: SELECT (containing JOIN clause) and UPDATE. It is required that data in selected rows don't change before the update is done, so i'm using FOR UPDATE clause. My question is: does the 'for update' works only for part of data selected from table specified in FROM clause or for data from joined tables also? My DBMS is MySql.
The documentation simply says that the lock is on rows read without excepting joined tables, so it should be on all records on all the joined tables. If you want to lock only the rows in one of the tables, you can do that separately: 'SELECT 1 FROM keytable WHERE ... FOR UPDATE'.
That said, this is not needed to simply prevent an update between the SELECT and UPDATE. The read lock on the SELECT already does this. The purpose of the FOR UPDATE would be to prevent another transaction from reading the rows and thus potentially causing a deadlock because the UPDATE can not be applied until the other transaction releases its read lock.
Does anyone know what would be more efficient and use less resources:
Method 1-- Using a single SELECT statement to get data from one table and then iterating through it to execute multiple UPDATEs on another table. E.G. (pseudo-code, execute() runs query):
Query1_resultset = execute("SELECT item_id, sum(views) as view_count FROM tableA WHERE condition=1");
while(Query1_resultset as row) {
execute("UPDATE tableB SET view_count=row.view_count WHERE id=row.item_id");
}
Method 2-- Use a single INSERT.. ON DUPLICATE KEY UPDATE statement with a nested SELECT statement. E.G.:
INSERT INTO tableB (id, view_count) SELECT item_id, SUM(views) as view_count FROM tableA WHERE condition=1 ON DUPLICATE KEY UPDATE view_count=VALUES(view_count);
Note: ID on tableB is a primary key. There actually won't be any INSERTS because I know the key will exist. So it's all UPDATEs. Just using this statement to pass in a single query rather than multiple.
I'm really curious as to why either would be more efficient. Is it the number of queries that determines how quickly it will run? Where is the bottleneck?
I'm looking for something that will scale (the number of rows being updated grows daily).
Any ideas?
Thanks
It depens on your update/insert ratio. If you have lots of inserts and only a couple of updates than the INSERT ... ON DUPLICATE KEY UPDATE statement will be faster.
If you mainly have updates, than you would be better off with an UPDATE statement and an insert as fallback (if there was no update). You could use the multi table update clause to do it with a single update instead of a select followed by an update by the way. If you're doing both a SELECT and an UPDATE than the INSERT will definately be faster.
I think INSERT.. ON DUPLICATE KEY UPDATE is more efficient (otherwise, it wouldn't make much sense to add such an extension). By the way, your first example is not exactly the same as the second one - you neither use transactions nor you lock the table, so it's possible that the record returned by SELECT will not exist by the time you execute UPDATE.