In Kettle, I use the following logic in a transformation, given some Strings X and Y as input:
[User Defined Java Expression] Generate ID
[Insert / Update] Update/Insert table set id = generatedId, name=X, company=Y where name = X; don't update the ID column
[Database Value Lookup]select id from table where name = X
Idea is to update existing entries in the table or create new ones and get the ID of the interesting row in the next step (which may be an existing one or the newly generated one).
This works fine when executed on MySQL + MyISAM but fails on MySQL + InnoDB, with all other parameters being identical. The last step fails when the row is just being inserted in the second step but works for rows already existing in the database. It seems as if the connection tries to execute the SELECT of the last step before the actual insert happened.
All parameters are set to default in the MySQL settings (MySQL 5.1 and 5.5 show the same behavior).
So my questions are: What are the relevant parameters in Kettle and/or MySQL? How can I guarantee that this works as expected? I cannot switch back to MyISAM.
just use the block rows step between the insert step and the next step. Then the step before the block will complete before the next step starts.
Well, after having evaluated different possibilities, three seem to be possible:
Write my own step which performs the select/insert in a transaction
Serialize the whole transformation in its properties (makes everything REALLY slow)
Use Codeks idea and use the blocking step
I went with the third option for now as everything else is not possible for the moment.
Make sure the transaction generated by Update/Insert is committed and the locks are released before doing the SELECT operation takes place. It looks like there are lock problems
Related
I am working with an application which needs to function with any of 300+ different MySQL databases on the same server. The databases all have nearly identical table structures, with slight variations. For example, a particular column might be present in a table for only some of the databases.
I'm wondering if there is a way that, when performing an update on a table, I can update a specific column if it exists, but still successfully execute if the column does not exist.
For example, say I have a basic update statement like this:
UPDATE some_table
SET col1 = "some value",
col2 = "another value",
col3 = "a third value"
WHERE id = 567
What can I do to make it so that, if col3 doesn't actually exist when that query is run, the statement still executes and col1 and col2 are still updated with the new values?
I have tried using IF and CASE, but those seem to only allow changing the value based on some condition, not whether or not a column actually gets updated.
I know I can query the database for the existence of the column, then use a simple if condition in the application code use a different query. However, that requires me to query the database twice: once to see if the column exists, and again to actually update it. I'd prefer to do it with one SQL query if possible. I feel like that application code might start to get unwieldy with lots of extra code to check the existence of this-or-that column and conditionally build queries, instead of just having one query which works regardless of which database the application happens to be running against at the time.
To clarify, any given instance of the application is ever only running against one database; there is a different application instance for each database, but the instances will all be running the same code. These are legacy databases that legacy code is also relying on, so I don't want to modify the actual structures in the database to make them more consistent, for fear of breaking the legacy code.
No, the syntax of your SQL query, including all column identifiers you reference, must be fixed at the time it is parsed, before it validates that the columns exist.
A given UPDATE will either succeed fully or fail fully. There is no way to update some of the columns if the query fails to update all of them.
You have two choices:
Query INFORMATION_SCHEMA.COLUMNS first, to check what columns exist in the table for a given schema. Then format your UPDATE query, including clauses to set each column only if the column exists in that instance of the table.
Or...
Run several UPDATE statements, one for each column you want to update. Each statement will succeed or fail independently, but you can catch the error and continue on to the remaining statements. You can put all these statements in a transaction, so the set of changes is committed atomically, regardless of how many succeed (a single failed statement does not roll back a transaction).
Either way, it requires you to write more code. That's the unavoidable cost of supporting such variable table structure.
I have a table with huge amount of data. The source of data is an external api. Every few hours, I need to sync the database so that the changes are up to date from the external api. I am doing a full sync (api doesn't allow delta sync).
While sync happens, I want to make sure that the data from the database is also available for read. So, I am following below steps:
I have a cloumn in the table which acts as a flag for whether or not data is readable. Only the data with flag set is marked for read.
I am inserting all the data from the api into the table.
Once all the data is written, I am deleting all the data in the table with flag set.
After deletion, I am updating the table and setting the flag for all the rows.
Table has around ~50 million rows and is expected to grow. There is a customerId field in the table. Sync usually happens based on customerId by passing it to the api.
My problem is, step 3 and 4 above are taking a lot of time. Queries are something like:
Step 3 --> delete from foo where customer_id=12345678 and flag=1
Step 4 --> update foo set flag=1 where customer_id=12345678
I have tried partitioning the table based on customer_id and it works great where customer_id has less number of rows but for some customer_id, the number of rows in each partition itself goes till ~5 million.
Around 90% of data doesn't change between two syncs. How can I make this fast?
I was thinking of using just the update queries instead of insert queries and then check if there was any update. If not, I can issue an insert query for the same row. This way any updates will be taken care of along with the insert. But I am not sure if the operation will block read queries for this while update is in progress.
For your setup (read only data, full sync), the fastest way to update the table is to not update at all, but to import the data into a different table and to rename it afterwards to make it the new table.
Create a table like your original table, e.g. use
create table foo_import like foo;
If you have e.g. triggers, add them too.
From now on, let the import api write its (full) sync to this new table.
After a sync is done, swap the two tables:
RENAME TABLE foo TO foo_tmp,
foo_import TO foo,
foo_tmp to foo_import;
It will (literally) just require a second.
This command is atomic: it will wait for transactions that access these tables to finish, it will not present a situation where there is no table foo and it will completely fail (and not do anything) if one of the tables doesn't exist or foo_tmp already exists.
As a final step, empty your import table (that now contains your old data) to be ready for your next import:
truncate foo_import;
This will again just require a second.
The rest of your querys probably assume that flag=1. Until (if at all) you update the code to not use the flag anymore, you can set its default value to 1 to keep it compatible, e.g. use
alter table foo modify column flag tinyint default 1;
Since you don't have foreign keys, it doesn't have to bother you, but for others with a similar problem it might be useful to know that foreign keys will get adjusted, so foreign keys that are referencing foo will reference foo_import after renaming the tables. To make them point to the new table foo again, they have to be dropped and recreated. Everything else (e.g. views, queries, procedures) will resolve by the current name, so they will always access the current foo.
CREATE TABLE new LIKE real;
Load `new` by whatever means you have; take as long as needed.
RENAME TABLE real TO old, new TO real;
DROP TABLE old;
The RENAME is atomic and "instantaneous"; real is "always" available.
(I don't see the need for flag.)
OR...
Since you are actually updating a chunk of a table, consider these...
If the chunk is small...
Load the new data into a tmp table
DELETE the old rows
INSERT ... SELECT ... to move the new rows in. (Having the new data already in a table is probably the fastest way to achieve this.)
If the chunk is big, and you don't want to lock the table for "too long", there are some other tricks. But first, is there some form of unique row number for each row for the customer? (I'm thinking about batch-moving a bunch or rows at a time, but need more specifics before spelling it out.)
I have a cronjob that loops through and updates a MySQL table row by row. After the table is 'completed', I would like to execute the cronjob exactly 1 more time, to perform various cleanup activities.
In execute a cronjob exactly once, thaJeztah states:
It's best to set that value in the mysql database, e.g. needs_cleanup = 1. That way you can always find those records at a later time. Keeping it in the database allows to to recover, for example, if a cron-job wasn't executed or failed half-way the loop. – thaJeztah
I think this would be a good solution if its possible, as in my case I only need to set the flag once a day. If it is possible could someone point me to the sql commands nescesary to execute the placement of a simple binary flag, with values 0,1 in a mysql table?
UPDATE mytable SET needs_cleanup = 1
does it for all records of mytable. If you need for a single record, add a WHERE condition, e.g.
UPDATE mytable SET needs_cleanup = 1
WHERE id = 1
Using a MySQL DB, I am having trouble with a stored procedure and event timer that I created.
I made an empty table that gets populated with data from another via SELECT INTO.
Prior to populating, I TRUNCATE the current data. It's used to track only log entries that occur within 2 months from the current date.
This turns a 350k+ log table into about 750 which really speeds up reporting queries.
The problem is that if a client sends a query precisely between the TRUNCATE statement and the SELECT INTO statement (which has a high probability considering the EVENT is set to run every 1 minute), the query returns no rows...
I have looked into locking a read on the table while this PROCEDURE is ran, but locks are not allowed in STORED PROCEDURES.
Can anyone come up with a workaround that (preferably) doesn't require a remodel?
I really need to be pointed in the right direction here.
Thanks,
Max
I'd suggest an alternate approach instead of truncating the table, and then selecting into it...
You can instead select your new data set into a new table. Next, using a single RENAME command, rename the new table to the existing table and the existing table to some backup name.
RENAME TABLE existing_table TO backup_table, new_table TO existing_table;
This is a single, atomic operation... so it wouldn't be possible for the client to read from the data after it is emptied but before it is re-populated.
Alternately, you could change your TRUNCATE to a DELETE FROM, and then wrap this in a transaction along with the SELECT INTO:
START TRANSACTION
DELETE FROM YourTable;
SELECT INTO YourTable...;
COMMIT
I have a data transfer tool that transfers information from one database to another. Every hour it will issue an UPDATE on all the rows in a table. I already have an INSERT trigger to dump the data from that one table into a number of other tables. I added an UPDATE trigger to edit the other tables, but it's making the extra processing is making the entire UPDATE process run slowly.
I'd like to wrap the body of the UPDATE trigger in an IF statement that compares the old and new rows, and skips processing if nothing has changed. Is it possible to compare an entire row against another, like this?
IF new = old THEN ...
Or is there no other option than to check each column individually?
If speed is the issue here, I would either save a timestamp of when it was last edited or a checksum.
Using the latter approach, if you have a table with three rows A, B and C, I would modify this scheme to also include a new row, cksum.
Whenever you insert something, you would in the cksum insert a value generated using a fast hashing algorithm, for instance MD5. This checksum could be something like
checksum = MD5(A + B + C);
This way, whenever having to insert something, you would only have to compare with the cksum field.
Sadly, no, you're going to need to compare each column individually. Probably not the answer you were hoping for.