Dealing with large overlapping sets of data - updating just the delta - mysql

My Python application generates a CSV file containing a few hundred unique records, one unique row per line. It runs hourly and very often the data remains the same from one run to another. If there are changes, then they are small, e.g.
one record removed.
a few new records added.
occasional update to an existing record.
Each record is just four simple fields (name, date, id, description), and there will be no more than 10,000 records by the time the project is at maximum, so it can all be contained in a single table.
What the best way to merge changes into the table?
A few approaches I'm considering are:
1) empty the table and re-populate on each run.
2) write the latest data to a staging table and run a DB job to merge the changes into the main table.
3) read the existing table data into my python script, collect the new data, find the differences, run multiple 'CRUD' operations to apply the changes one by one
Can anyone suggest a better way?
thanks

I would do this in the following way:
Load the new CSV file into a second table.
DELETE rows in the main table that are missing from the second table:
DELETE m FROM main_table AS m
LEFT OUTER JOIN new_table AS t ON m.id = t.id
WHERE t.id IS NULL;
Use INSERT ON DUPLICATE KEY UPDATE to update rows that need to be updated. This becomes a no-op on each row that already contains the same values.
INSERT INTO main_table (id, name, date, description)
SELECT id, name, date, description FROM new_table
ON DUPLICATE KEY UPDATE
name = VALUES(name), date = VALUES(date), description = VALUES(description);
Drop the second table once you're done with it.
This is assuming id is the primary key and you have no other UNIQUE KEY in the table.
Given a data set size of 10,000 rows, this should be quick enough to do it in one batch. Once the data set gets 10x larger, you may have to reconsider the solution, for example do batches of 10,000 rows at a time.

Related

How to transfer data from one table to another table

I have a table say Table1 in mysql. We have an application that stores data in Table1, millions of new record get saved daily. We have a requirement of extracting data from this table, transform it and then load it to new table say Table2 (kinda ETL process), which should be happened live in the interval of some seconds. How can i perform it efficiently and without copying duplicate records from Table1.
I though of introducing new field in Table1 say Extracted to keep track of extraction. So, if particular row has already been extracted, field Extracted will have the value Y indicating extraction. If not, then field Extracted will have value N, which means this row still needs to be extracted. Means ETL job needs to update this field Extracted in Table1 after extraction. What i am wondering is, Would it be efficient to update records in such a huge table where millions of new data get saved daily ?? Please suggest!!
Thank You Guys!!
If you need to keep the data in Table2 in sync (and slighthly modified) with those in Table1 you have a couple of options on MySQL level:
Triggers - create after INSERT, UPDATE, DELETE trigger which transfers the data to Table2 immediately and does the transformation for you.
Views - if data in Table2 are read only, create a view where select definition does the required transformation.
The advantage of both approaches is that Table2 is always up to date with Table1 and no extra fields are required.
use this to ignore duplicate record
INSERT IGNORE INTO Table2 (field1,field2) VALUES (x,y);
or use this to update record if there is a duplicate record on the table
INSERT INTO Table2 (field1,field2) VALUES (x,y) ON DUPLICATE KEY UPDATE primarykey=primarykeyvalues;

Handling multiple MySql queries (Deleting and Copy)

Good morning.
I have a table on MySQL DataBase.
In this table there are 5 robots that can write like 10 record each per hour.
Every 3 month a script that I have created, make a copy of the table and then delete all the table entries (In this way I can keep the IDs in a certain order).
My question is.
That are two different statement:
CREATE TABLE omologationResult_{date} AS SELECT * FROM omologationResult
DELETE FROM omologationResult
if the script is going to copy the table at point 0, and a record will be added from the robots, there's no problem, because the SQL statement starts from the lowest ID 'till the end. But if the script is going to delete the table and the robot is writing in it. What will happen? I lose the last robot record?
And if it's true. What can I do to make a copy of the table and then remove only the data that I've copied?
Thank you
Yes, this is not a safe operation because it's not atomic. It's quite possible for another thread to insert values into that table in between your CREATE .. SELECT and the DELETE. One option you have is to use a multi table DELETE
CREATE TABLE omologationResult_{date} AS SELECT * FROM omologationResult;
DELETE omologationResult FROM omologationResult
INNER JOIN omologationResult_{date} ON omologationResult_{date}.id = omologationResult.id
Will ensure that only items that exist in both tables have been deleted from omologationResult

r - dbWriteTable or a MySQL Delete query?

I can't seem to find the answer to this anywhere. I am reading a csv into a data frame using the read.csv function. Then I am writing the data frame contents to a mysql table using dbWriteTable. This works great for the initial run to create the table, but I each run after this needs to do either an insert or an update depending on whether the record already exists in the table.
The 1st column in the data frame is the primary key, and the other records contain data that might change every time I pull a new copy of the csv. Each time I pull the CSV, if the primary key already exists, I want it to update that record with the new data, and if the primary key does not exist(eg: a new key since the last run), I want it to just insert the record into the table.
This is my current dbWriteTable. This creates the table just fine the 1st time it's run, and also inserts a "Timestamp" column into the table that is set to "on update CURRENT_TIMESTAMP" so that I know when each record was last updated.
dbWriteTable(mydb, value=csvData, name=Table, row.names=FALSE, field.types=list(PrimaryKey="VARCHAR(10)",Column2="VARCHAR(255)",Column3="VARCHAR(255)",Timestamp="TIMESTAMP"), append=TRUE)
Now the next time I run this, I simply want it to update any PrimaryKeys that are already in the table, and add any new ones. I also don't want to lose any records in the event a PrimaryKey disappears from the CSV source.
Is it possible to do this kind of update using dbWriteTable, or some other R function?
If that's not possible, is it possible to just run a mysql query that would delete any duplicate PrimaryKey records and keep just the 1 record with the most current timestamp? So I would run a dbWriteTable to append the new data, and then run a MySQL query to prune out the older records.
Obviously I couldn't define that 1st column as an actual PrimaryKey in the DB as my append/delete solution wouldn't work due to duplicate keys, and that's fine, I can always add an auto increment integer column to the table for the "real" primary key if needed.
Thoughts?
Consider using a temp table (an exact replica of final table but with less records) and then run an INSERT and UPDATE query into final table which will handle both cases without overlap (plus primary keys are constraints and queries will error out if attempts are made to duplicate any):
records to append if not exists - using the LEFT JOIN NULL query
records to update if does exist. - using the UPDATE INNER JOIN query
Concerning the former there is a regular debate among SQL coders if LEFT JOIN NULL or NOT IN or NOT EXISTS is the optimal solution which of course "depends". Left Join used here does avoid subqueries. But consider those avenues if needed.
# DELETE LAST SET OF TEMP DATA
dbSendQuery(mydb, "DELETE FROM tempTable")
# APPEND R DATA FRAME TO TEMP DATA
dbWriteTable(mydb, value=csvData, name=tempTable, row.names=FALSE,
field.types=list(PrimaryKey="VARCHAR(10)", Column2="VARCHAR(255)",
Column3="VARCHAR(255)", Timestamp="TIMESTAMP"),
append=TRUE, overwrite=FALSE)
# LEFT JOIN ... NULL QUERY TO APPEND NEW RECORDS NOT IN TABLE
dbSendQuery(mydb, "INSERT INTO finalTable (Column1, Column2, Column3, Timestamp)
SELECT Column1, Column2, Column3, Timestamp
FROM tempTable f
LEFT JOIN finalTable t
ON f.PrimaryKey = t.PrimaryKey
WHERE f.PrimaryKey IS NULL;")
# UPDATE INNER JOIN QUERY TO UPDATE MATCHING RECORDS
dbSendQuery(mydb, "UPDATE finalTable f
INNER JOIN tempTable t
ON f.PrimaryKey = t.PrimaryKey
SET f.Column1 = t.Column1,
f.Column2 = t.Column2,
f.Column3 = t.Column3,
f.Timestamp = t.Timestamp;")
For the most part, queries above will be compliant in most SQL backends should you ever need to change databases. Some RDMS do not support UPDATE INNER JOIN but equivalent alternatives are available. Finally, the beauty of this route is all processing is handled in the SQL engine and not in R.
Sounds like you're trying to do an upsert.
I'm kind of rusty with MySQL but the general idea is that you need to have a staging table to upload the new CSV, and then in the database itself do the insert/update.
For that you need to use dbSendQuery with INSERT ON DUPLICATE UPDATE.
http://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html

Synchronize local database with external API

I have a table containing about 500 000 rows. Once a day, I will try to synchronize this table with an external API. Most of the times, there are few- or no changes made since last update. My question is basically how should I construct my MySQL query for best performance? I have thought about using insert ignore, but it doesn't feel like the best way to go since only a few rows will be inserted and MySQL must loop through all rows in the table. I have also thought about using LOAD_DATA_INFILE to insert all rows in a temporary table and then select the rows not already in my original table, and then remove the temporary table. Maybe someone else has a better suggestion?
Thank you in advance!
I usually use a temporary table and the LOAD DATA INFILE bulk loader. The bulk loader is much more efficient that trying to insert records using a dynamically created query.
If you index your permanent tables with appropriate unique keys that relate to the keys in the API then you should find the the INSERT and UPDATE statements work pretty fast. An example of the type of INSERT query I use is as follows:
INSERT INTO keywords(api_adgroup_id, api_keyword_id, keyword_text, match_type, status)
SELECT a.api_id, a.keyword_text, a.match_type, a.status
FROM tmp_keywords a LEFT JOIN keywords b ON a.api_adgroup_id = b.api_adgroup_id AND a.api_keyword_id = b.api_keyword_id
WHERE b.api_keyword_id IS NULL
In this example, I perform an OUTER JOIN on the keywords table to check if it already exists. Only new rows in the temporary table where there isn't a match in the main table (the api_keyword_id in the keywords table is NULL) are inserted.
Also note that in this example I need to use both the ad group id AND the keyword id to uniquely identify the keyword because the AdWords API gives the same keyword/match type combination the same id when it exists in more than one ad group.

Index counter shared by multiple tables in mysql

I have two tables, each one has a primary ID column as key. I want the two tables to share one increasing key counter.
For example, when the two tables are empty, and counter = 1. When record A is about to be inserted to table 1, its ID will be 1 and the counter will be increased to 2. When record B is about to be inserted to table 2, its ID will be 2 and the counter will be increased to 3. When record C is about to be inserted to table 1 again, its ID will be 3 and so on.
I am using PHP as the outside language. Now I have two options:
Keep the counter in the database as a single-row-single-column table. But every time I add things to table A or B, I need to update this counter table.
I can keep the counter as a global variable in PHP. But then I need to initialize the counter from the maximum key of the two tables at the start of apache, which I have no idea how to do.
Any suggestion for this?
The background is, I want to display a mix of records from the two tables in either ASC or DESC order of the creation time of the records. Furthermore, the records will be displayed in page-style, say, 50 records per page. Records are only added to the database rather than being removed. Following my above implementation, I can just perform a "select ... where key between 1 and 50" from two tables and merge the select datasets together, sort the 50 records according to IDs and display them.
Is there any other idea of implementing this requirement?
Thank you very much
Well, you will gain next to nothing with this setup; if you just keep the datetime of the insert you can easily do
SELECT * FROM
(
SELECT columnA, columnB, inserttime
FROM table1
UNION ALL
SELECT columnA, columnB, inserttime
FROM table2
)
ORDER BY inserttime
LIMIT 1, 50
And it will perform decently.
Alternatively (if chasing last drop of preformance), if you are merging the results it can be an indicator to merge the tables (why have two tables anyway if you are merging the results).
Or do it as SQL subclass (then you can have one table maintain IDs and other common attributes, and the other two reference the common ID sequence as foreign key).
if you need creatin time wont it be easier to add a timestamp field to your db and sort them according to that field?
i believe using ids as a refrence of creation is bad practice.
If you really must do this, there is a way. Create a one-row, one-column table to hold the last-used row number, and set it to zero. On each of your two data tables, create an AFTER INSERT trigger to read that table, increment it, and set the newly-inserted row number to that value. I can't remember the exact syntax because I haven't created a trigger for years; see here http://dev.mysql.com/doc/refman/5.0/en/triggers.html