I have a table say Table1 in mysql. We have an application that stores data in Table1, millions of new record get saved daily. We have a requirement of extracting data from this table, transform it and then load it to new table say Table2 (kinda ETL process), which should be happened live in the interval of some seconds. How can i perform it efficiently and without copying duplicate records from Table1.
I though of introducing new field in Table1 say Extracted to keep track of extraction. So, if particular row has already been extracted, field Extracted will have the value Y indicating extraction. If not, then field Extracted will have value N, which means this row still needs to be extracted. Means ETL job needs to update this field Extracted in Table1 after extraction. What i am wondering is, Would it be efficient to update records in such a huge table where millions of new data get saved daily ?? Please suggest!!
Thank You Guys!!
If you need to keep the data in Table2 in sync (and slighthly modified) with those in Table1 you have a couple of options on MySQL level:
Triggers - create after INSERT, UPDATE, DELETE trigger which transfers the data to Table2 immediately and does the transformation for you.
Views - if data in Table2 are read only, create a view where select definition does the required transformation.
The advantage of both approaches is that Table2 is always up to date with Table1 and no extra fields are required.
use this to ignore duplicate record
INSERT IGNORE INTO Table2 (field1,field2) VALUES (x,y);
or use this to update record if there is a duplicate record on the table
INSERT INTO Table2 (field1,field2) VALUES (x,y) ON DUPLICATE KEY UPDATE primarykey=primarykeyvalues;
Related
My Python application generates a CSV file containing a few hundred unique records, one unique row per line. It runs hourly and very often the data remains the same from one run to another. If there are changes, then they are small, e.g.
one record removed.
a few new records added.
occasional update to an existing record.
Each record is just four simple fields (name, date, id, description), and there will be no more than 10,000 records by the time the project is at maximum, so it can all be contained in a single table.
What the best way to merge changes into the table?
A few approaches I'm considering are:
1) empty the table and re-populate on each run.
2) write the latest data to a staging table and run a DB job to merge the changes into the main table.
3) read the existing table data into my python script, collect the new data, find the differences, run multiple 'CRUD' operations to apply the changes one by one
Can anyone suggest a better way?
thanks
I would do this in the following way:
Load the new CSV file into a second table.
DELETE rows in the main table that are missing from the second table:
DELETE m FROM main_table AS m
LEFT OUTER JOIN new_table AS t ON m.id = t.id
WHERE t.id IS NULL;
Use INSERT ON DUPLICATE KEY UPDATE to update rows that need to be updated. This becomes a no-op on each row that already contains the same values.
INSERT INTO main_table (id, name, date, description)
SELECT id, name, date, description FROM new_table
ON DUPLICATE KEY UPDATE
name = VALUES(name), date = VALUES(date), description = VALUES(description);
Drop the second table once you're done with it.
This is assuming id is the primary key and you have no other UNIQUE KEY in the table.
Given a data set size of 10,000 rows, this should be quick enough to do it in one batch. Once the data set gets 10x larger, you may have to reconsider the solution, for example do batches of 10,000 rows at a time.
I can't seem to find the answer to this anywhere. I am reading a csv into a data frame using the read.csv function. Then I am writing the data frame contents to a mysql table using dbWriteTable. This works great for the initial run to create the table, but I each run after this needs to do either an insert or an update depending on whether the record already exists in the table.
The 1st column in the data frame is the primary key, and the other records contain data that might change every time I pull a new copy of the csv. Each time I pull the CSV, if the primary key already exists, I want it to update that record with the new data, and if the primary key does not exist(eg: a new key since the last run), I want it to just insert the record into the table.
This is my current dbWriteTable. This creates the table just fine the 1st time it's run, and also inserts a "Timestamp" column into the table that is set to "on update CURRENT_TIMESTAMP" so that I know when each record was last updated.
dbWriteTable(mydb, value=csvData, name=Table, row.names=FALSE, field.types=list(PrimaryKey="VARCHAR(10)",Column2="VARCHAR(255)",Column3="VARCHAR(255)",Timestamp="TIMESTAMP"), append=TRUE)
Now the next time I run this, I simply want it to update any PrimaryKeys that are already in the table, and add any new ones. I also don't want to lose any records in the event a PrimaryKey disappears from the CSV source.
Is it possible to do this kind of update using dbWriteTable, or some other R function?
If that's not possible, is it possible to just run a mysql query that would delete any duplicate PrimaryKey records and keep just the 1 record with the most current timestamp? So I would run a dbWriteTable to append the new data, and then run a MySQL query to prune out the older records.
Obviously I couldn't define that 1st column as an actual PrimaryKey in the DB as my append/delete solution wouldn't work due to duplicate keys, and that's fine, I can always add an auto increment integer column to the table for the "real" primary key if needed.
Thoughts?
Consider using a temp table (an exact replica of final table but with less records) and then run an INSERT and UPDATE query into final table which will handle both cases without overlap (plus primary keys are constraints and queries will error out if attempts are made to duplicate any):
records to append if not exists - using the LEFT JOIN NULL query
records to update if does exist. - using the UPDATE INNER JOIN query
Concerning the former there is a regular debate among SQL coders if LEFT JOIN NULL or NOT IN or NOT EXISTS is the optimal solution which of course "depends". Left Join used here does avoid subqueries. But consider those avenues if needed.
# DELETE LAST SET OF TEMP DATA
dbSendQuery(mydb, "DELETE FROM tempTable")
# APPEND R DATA FRAME TO TEMP DATA
dbWriteTable(mydb, value=csvData, name=tempTable, row.names=FALSE,
field.types=list(PrimaryKey="VARCHAR(10)", Column2="VARCHAR(255)",
Column3="VARCHAR(255)", Timestamp="TIMESTAMP"),
append=TRUE, overwrite=FALSE)
# LEFT JOIN ... NULL QUERY TO APPEND NEW RECORDS NOT IN TABLE
dbSendQuery(mydb, "INSERT INTO finalTable (Column1, Column2, Column3, Timestamp)
SELECT Column1, Column2, Column3, Timestamp
FROM tempTable f
LEFT JOIN finalTable t
ON f.PrimaryKey = t.PrimaryKey
WHERE f.PrimaryKey IS NULL;")
# UPDATE INNER JOIN QUERY TO UPDATE MATCHING RECORDS
dbSendQuery(mydb, "UPDATE finalTable f
INNER JOIN tempTable t
ON f.PrimaryKey = t.PrimaryKey
SET f.Column1 = t.Column1,
f.Column2 = t.Column2,
f.Column3 = t.Column3,
f.Timestamp = t.Timestamp;")
For the most part, queries above will be compliant in most SQL backends should you ever need to change databases. Some RDMS do not support UPDATE INNER JOIN but equivalent alternatives are available. Finally, the beauty of this route is all processing is handled in the SQL engine and not in R.
Sounds like you're trying to do an upsert.
I'm kind of rusty with MySQL but the general idea is that you need to have a staging table to upload the new CSV, and then in the database itself do the insert/update.
For that you need to use dbSendQuery with INSERT ON DUPLICATE UPDATE.
http://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html
I have two tables T1 and T2 and want to update one field of T1 from T2 where T2 holds massive data.
What is more efficient?
Updating T1 in a for loop iteration over the values
or
Left join it with T2 and update.
Please note that i'm updating these tables in a shell script
In general, the JOIN will always work much better than a loop. The size should not be an issue if it is properly indexed.
There is no simple answer which will be more effective, it will depend on table size and data size to which you are going to update in one go.
Suppose you are using innodb engine and trying to update 1,000 or more rows in one go with 2 heavy tables join and it is quite frequent then it will not be good idea on production server as it will lock your table for some time and due to this locking some other operations also can be hit on your production server.
Option1: If you are trying to update few rows and based on proper indexed fields (preferred based on primary key) then you can go with join.
Option2: If you are trying to update a large amount of data based on multiple tables join then below option will be better:
Step1: Create a stored procedure.
Step2: Keep below query results in a cursor.
suppose you want TO UPDATE corresponding field2 DATA of TABLE table2 IN field1 of TABLE table1:
SELECT a.primary_key,b.field2 FROM table1 a JOIN table2 b ON a.primary_key=b.foreign_key WHERE [place CONDITION here IF any...];
Step3: Now update all rows one by one based on primary key using stored values in cursor.
Step4: You can call this stored procedure from your script.
I have a table that stores the summed values of a large table. I'm not calculating them on the fly as I need them frequently.
What is the best way to update these values?
I could delete the relevant rows from the table, do a full group by sum on all the relevant lines and then insert the new data.
Or I could index a timestamp column on the main table, and then only sum the latest values and add them to the existing data. This is complicated because some sums won't exist so both an insert and an update query would need to run.
I realize that the answer depends on the particulars of the data, but what I want to know is if it is ever worth doing the second method; if there are millions of rows being summed in the first example and only tens in the second, would the second be significantly faster to execute?
You can try with triggers on update/delete. Then you check inserted or deleted value and according to it modify the sum in second table.
http://dev.mysql.com/doc/refman/5.0/en/triggers.html
For me there is several ways :
Make a view which should be up-to-date (i don't know if you can do concrete views in mysql)
Make a table which will be up-to-date using a trigger (on update/delete/insert as example) or using a batch during (night, so data will be 1 day old)
Make a stored procedure which will be retrieving and computing only the data needed.
I would do something like this (INSERT UPDATE):
mysql_query("
INSERT INTO sum_table (col1, col2)
SELECT id, SUM(value)
FROM table
GROUP BY id
ON DUPLICATE KEY UPDATE col2 = VALUES(col2)
");
Please let me know if you need more examples.
I have a table containing about 500 000 rows. Once a day, I will try to synchronize this table with an external API. Most of the times, there are few- or no changes made since last update. My question is basically how should I construct my MySQL query for best performance? I have thought about using insert ignore, but it doesn't feel like the best way to go since only a few rows will be inserted and MySQL must loop through all rows in the table. I have also thought about using LOAD_DATA_INFILE to insert all rows in a temporary table and then select the rows not already in my original table, and then remove the temporary table. Maybe someone else has a better suggestion?
Thank you in advance!
I usually use a temporary table and the LOAD DATA INFILE bulk loader. The bulk loader is much more efficient that trying to insert records using a dynamically created query.
If you index your permanent tables with appropriate unique keys that relate to the keys in the API then you should find the the INSERT and UPDATE statements work pretty fast. An example of the type of INSERT query I use is as follows:
INSERT INTO keywords(api_adgroup_id, api_keyword_id, keyword_text, match_type, status)
SELECT a.api_id, a.keyword_text, a.match_type, a.status
FROM tmp_keywords a LEFT JOIN keywords b ON a.api_adgroup_id = b.api_adgroup_id AND a.api_keyword_id = b.api_keyword_id
WHERE b.api_keyword_id IS NULL
In this example, I perform an OUTER JOIN on the keywords table to check if it already exists. Only new rows in the temporary table where there isn't a match in the main table (the api_keyword_id in the keywords table is NULL) are inserted.
Also note that in this example I need to use both the ad group id AND the keyword id to uniquely identify the keyword because the AdWords API gives the same keyword/match type combination the same id when it exists in more than one ad group.