Add only new records in MySQL via script - mysql

I have a large database which I am trying to update via perl. The information to be added comes from a csv file which I do not control (but which is trusted—it comes from a different part of our company). For each record in the file, I need to either add it (if it does not exist) or do nothing (if it exists). Adding a record consists of the usual INSERT INTO, but before that can run for a particular entry a specific UPDATE must be run.
Let's say for the sake of concreteness that the file has 10,000 entries, but 90% of them are already in the database. What is the most efficient way to import the records? I can see a few obvious approaches:
Pull all records of this type from the database, then check each of the entries from the file for membership. Downside: lots of data transfer, possibly enough to time the server out.
Read in the entries from the file and send a query for just those records with an RLIKE 'foo|bar|baz|...' query (or a stuff = 'foo' || stuff = 'bar' || ... query, but that seems even worse). Downside: huge query, probably enough to choke the server.
Read in the file, send a query for each entry, then add it if appropriate. Downside: tens of thousands of queries, very slow.
Apart from the UPDATE requirement, this seems like a fairly standard issue that presumably has a standard solution. If there is, it can probably be adapted to my case with appropriate use of tests on the auto_increment primary key.

The standard solution is to use INSERT IGNORE which won't raise an error if the insertion would fail because of a constraint. This isn't much use to you as it doesn't give you a chance to do the UPDATE before you know the INSERT is going to work. If you can do the update afterwards, however, this is ideal: just INSERT IGNORE each record and then do the UPDATE if it succeeded.
If a record already exists that means a record with a matching unique key is already in the database, so I don't understand the RLIKE proposal which is bound to be slow.
I would use Perl to grep the CSV file using SELECT count(*) FROM table WHERE key = ? for each record, and removing anything where the result is non-zero.
Then just do your UPDATE and INSERT for everything left in the filtered CSV data.

There is no need to timeout the server if you keep flushing data while iterating the list.

Related

Check if a record from database exist in a csv file

today I come to you for inspiration or maybe ideas how to solve a task not killing my laptop with massive and repetitive code.
I have a CSV file with around 10k records. I also have a database with respective records in it. I have four fields inside both of these structures: destination, countryCode,prefix and cost
Every time I update a database with this .csv file I have to check if the record with given destination, countryCode and prefix exist and if so, I have to update the cost. That is pretty easy and it works fine.
But here comes the tricky part: there is a possibility that the destination may be deleted from one .csv file to another and I need to be aware of that and delete that unused record from the database. What is the most efficient way of handling that kind of situation?
I really wouldn't want to check every record from the database with every row in a .csv file: that sounds like a very bad idea.
I was thinking about some time_stamp or just a bool variable which will tell me if the record was modified during the last update of the DB BUT: there is also a chance that neither of params within the record change, thus: no need to touch that record and mark it as modified.
For that task, I use Python 3 and mysql.connector lib.
Any ideas and advice will be appreciated :)
If you're keeping a time stamp why do you care if it's updated even if nothing was changed in the record? If the reason is that you want to save the date of the latest update you can add another column saving a time stamp of the last time the record appeared in the csv and afterwords delete all the records that the value of this column in them is smaller than the date of the last csv.
If the .CSV is a replacement for the existing table:
CREATE TABLE new LIKE real;
load the .csv into `new` (Probably use LOAD DATA...)
RENAME TABLE real TO old, new TO real;
DROP TABLE old;
If you have good reason to keep the old table and patch it, then...
load the .csv into a table
add suitable indexes
do one SQL to do deletes (no loop needed). It is probably a multi-table DELETE.
do one sql to update the prices (no loop needed). It is probably a multi-table UPDATE.
You can probably do the entire task (either way) without touching Python.

SSIS Redirect on Error - Too many rows

I have an SSIS package that imports a flat CSV file, there are approx 200,000 records in the file. I've set the table that the data imports into with a primary unique key of the account number. There shouldn't be any duplicates in the source data (application controlled - outside of my influence)
However there is 1 duplicate row in the CSV, however when i add the primary key it redirect 7k rows... these aren't duplicate rows it just appears to redirect a load for no reason?
If I manually remove the single duplicate row it works perfectly. There is nothing special about the data or the files, it should just import the data and redirect the error row.
This behavior is due to OLE DB Destination and fast insert mode used.
With fast insert mode, OLE DB destination issues INSERT BULK command and does insert in batches. If one of rows within a batch violates table constraints, the whole batch is failed and got redirected to Error Output. This accounts for strange at the first glance behavior - reject more than 1 row.
What you can do with it - depends on your goal and limitations
If simply filter out consecutive duplicates - switch OLE DB Dest to regular insert mode at cost of significant performance decrease. Simplest way.
If performance drop is not an option, and you need to keep it simple - use Sort Component at Dataflow and tick discard duplicate rows flag. Caveat - you do not have a control on which row will be discarded.
If you need to implement some business rules on what data should pass - then you have to implement some scoring column and use it to filter rows. See Todd McDermid's article on this.

SSIS OLE DB conditional "insert"

I have no idea whether this can be done or not, but basically, I have the following data flow:
Extracts the data from an XML file (works fine)
Simply splits the records based on an enclosed condition (works fine)
Had to add a derived column object due to some character set issues (might be better methods, but it works)
Now "Step 4" is where I'm running into a scenario where I'd only like to insert the values that have a corresponding match in my database, for instance, the XML has about 6000 records, and from those, I have maybe 10 of them that I need to match back against and insert them instead of inserting all 6000 of them and doing the compare after the fact (which I could also do, but was hoping there'd be another method). I was thinking that I might be able to perform a sql insert command within the OLE DB DESTINATION object where the ID value in the file matches, but that's what I'm not 100% clear on or if it's even possible for that matter. Should I simply go the temp table route and scrub the data after the fact, or can I do this directly in the destination piece? Any suggestions would be greatly appreciated.
EDIT
Thanks to the last comment from billinkc, I managed to get bit closer, where I can identify the matches and use that result set, but somehow it seems to be running the data flow twice, which is strange.... I took the lookup object out to see whether it was causing it and somehow it seems to be the case, any reason why it would run this entire flow twice with the addition of the lookup? I should have a total of 8 matches, which I confirmed with the data viewer output, but then it seems to be running it a second time for the same file.
Is there a reason you can't use a Lookup transformation to find existing records. Configure it so that it routes non-match records to the no match output and then only connect the match found connector to the "Navigator Staging Manager Funds"
I believe that answers what you've asked but I wonder if you're expressing the right desire? My assumption is the lookup would go against the existing destination and so the lookup returns the id 10 for a row. All of the out of the box destinations in SSIS only perform inserts, so that row that found a match would now get doubled. As you are looking for existing rows, that usually implies you'd want to perform an update to an existing row. If that's the case, there is a specially designed transformation, the OLE DB Command. It is the component that allows for updates. There is a performance problem with that component, it issues a single update statement per row flowing through it. For 10 rows, I think it'd be fine. Otherwise, the pattern you'd use is to write all the new rows (inserts) into your destination table and then write all of your changed rows (updates) into a second staging-type table. After the data flow is complete, then use an Execute SQL Task to perform a set based update statement.
There are third party options that handle combined upserts. I know Pragmatic Works has an option and there are probably others on the tasks and components site.

Most efficient way to periodically delete all entries in database older than a month

I'm trying to most efficiently manage a database table and get rid of old entries that will never be accessed. Yes they could probably easily be persisted for many years but I'd just like to get rid of them. I could do this maybe once every month. Would it be more efficient to copy the entries I want to keep into a new table then simply delete the old table. Or should a query manually delete each entry after that threshhold that I set.
I'm using MySQL with JPA/JPQL JEE6 with entity annotations and Java persistence manager.
Thanks
Another solution is to design the table with range or list PARTITIONING, and then you can use ALTER TABLE to drop or truncate old partitions.
This is much quicker than using DELETE, but it may complicate other uses of the table.
A single delete query will be the most efficient solution.
Copying data from one database to another can be lengthy if you have a lot of data to keep. It means you have to retrieve all the data with a single query (or multiple, if you want to batch), and issue a lot of insert statements in the other database.
Using JPQL, you can issue a single query to delete all old statements, something like
DELETE FROM Entity e WHERE e.date = ?
This will translated to a single SQL query, and the database will take care of deleting all the unwanted records.

MySQL: Dump a database from a SQL query

I'm writing a test framework in which I need to capture a MySQL database state (table structure, contents etc.).
I need this to implement a check that the state was not changed after certain operations. (Autoincrement values may be allowed to change, but I think I'll be able to handle this.)
The dump should preferably be in a human-readable format (preferably an SQL code, like mysqldump does).
I wish to limit my test framework to use a MySQL connection only. To capture the state it should not call mysqldump or access filesystem (like copy *.frm files or do SELECT INTO a file, pipes are fine though).
As this would be test-only code, I'm not concerned by the performance. I do need reliable behavior though.
What is the best way to implement the functionality I need?
I guess I should base my code on some of the existing open-source backup tools... Which is the best one to look at?
Update: I'm not specifying the language I write this in (no, that's not PHP), as I don't think I would be able to reuse code as is — my case is rather special (for practical purposes, lets assume MySQL C API). Code would be run on Linux.
Given your requirements, I think you are left with (pseudo-code + SQL)
tables = mysql_fetch "SHOW TABLES"
foreach table in tables
create = mysql_fetch "SHOW CREATE TABLE table"
print create
rows = mysql_fetch "SELECT * FROM table"
foreach row in rows
// or could use VALUES (v1, v2, ...), (v1, v2, ...), .... syntax (maybe preferable for smaller tables)
insert = "INSERT (fiedl1, field2, field2, etc) VALUES (value1, value2, value3, etc)"
print insert
Basically, fetch the list of all tables, then walk each table and generate INSERT statements for each row by hand (most apis have a simple way to fetch the list of column names, otherwise you can fall back to calling DESC TABLE).
SHOW CREATE TABLE is done for you, but I'm fairly certain there's nothing analogous to do SHOW INSERT ROWS.
And of course, instead of printing the dump you could do whatever you want with it.
If you don't want to use command line tools, in other words you want to do it completely within say php or whatever language you are using then why don't you iterate over the tables using SQL itself. for example to check the table structure one simple technique would be to capture a snapsot of the table structure with SHOW CREATE TABLE table_name, store the result and then later make the call again and compare the results.
Have you looked at the source code for mysqldump? I am sure most of what you want would be contained within that.
DC
Unless you build the export yourself, I don't think there is a simple solution to export and verify the data. If you do it table per table, LOAD DATA INFILE and SELECT ... INTO OUTFILE may be helpful.
I find it easier to rebuild the database for every test. At least, I can know the exact state of the data. Of course, it takes more time to run those tests, but it's a good incentive to abstract away the operations and write less tests that depend on the database.
An other alternative I use on some projects where the design does not allow such a good division, using InnoDB or some other transactional database engine works well. As long as you keep track of your transactions, or disable them during the test, you can simply start a transaction in setUp() and rollback in tearDown().