Pentaho Kettle insert Error Handling of step - mysql

I am new to GeoKettle (Spoon) of Pentaho and I am currently rows from an Excel-file into my database. Now I want to avoid duplicates in my databasetable. That is why I want to insert only those rows into my database table which aren't there yet (to have only unique records in my database table).
And as far as I know, there are two ways to realize that. The first way I tried was with the Insert/Update step (I have disabled the Update functionality) and defined all the columns which have to be equal in order to insert the record or not. But it does not work. All records are still inserted into the database.
That is why I am trying to do the (according to pentaho) much faster option which is a "Table Output" with an "Update" error handling step as shown in the picture.
As shown in the picture, the arrow which is pointing from "table output" to "update" is black. But I need a red dotted one for error handling of step . But I do not know how to create this. In tutorials I often see that there pops up a little window with 2 options like in the picture:
But I do not get that popup. If I want to create a hop, I will have to mark both steps and do a right-click on one of them.
So in which possible ways can I create such a red dotted arrow? In the end, it has to look like this:
Thank you so much in advance!!

You have a problem with your setup. Or with your version of the PDI. The functionality of an error step was introduced in V4 but fully implemented for all steps around V6.
Download a fresh PDI from SourceForge. V7.1 is really a robust and stable edition. Unzip and test.
By the way, what you want to achieve is know as the CRUD pattern. CRUD for Create, Read, Update, Delete. The step doing this the Merge Rows (diff) (in the Joins family). You tell the steps which columns to check, and it produce a new column with the value identical, changed, new, or deleted. You can them redirect the flow in a Switch / Case to do the appropriate action. Further information here (V4).

Related

Making sure that a table is constructed correctly

I have a schema of a database and a web application. I want to have the web application be able to select, insert and remove rows to a table, but the table may not exist, maybe in a testing environment, and the table may be missing columns, most likely because the web application has updated.
I want to be able to make sure that the table is ready to accept the data that the web application sends to it during the time the application is alive.
The idea I had is the application (written in Java) will have a table structure embedded into it, and when the application starts, just copy all of the data in the table (if it exists) to a temporary table, delete the old table and make a new one with the temporary table's data, and then drop the temporary table. As you can tell, it's nowhere near innovative.
Another idea I had is use the SHOW COLUMNS command to correct any missing columns parallel with the SHOW TABLES LIKE to check if it exists, but I feel like Stack Overflow would've had a better solution. Is that all I can do?
There are many ways to solve the problem of consistency of the database version and the version of the application.
However, in the production database, this situation is unacceptable.
I think that the simplest ways are the best.
To ensure such compliance, it is enough to execute a script that updates the database before performing the testing.
START TRANSACTION;
DROP TABLE ... IF EXISTS;
CREATE TABLE ...
COMMIT;
Remember about IF EXISTS and having DROP grant!
Such a script can be easily managed by placing it in RCS and controlling the version number needed in the application.
You can also save this version number in some table in the database itself and check when the application starts, whether the number is compatible with the assumed one and if you do not call the database update script.
Have a look at JPA an Hibernate. There is hbm2ddl.auto property. Looks like "update" option does what you want.
For more details
What are the possible values of the Hibernate hbm2ddl.auto configuration and what do they do

Preserving data integrity in a multi-step application

I'm currently working on a PHP web application with Symfony 2/Doctrine and MySQL as SGBD.
I have multiple steps (about 12) and at the end of a step, I store some data in my SGBD and I go to a next step, etc.
The user can return to a specific step with a 'go back' button. If he decides to do that, I need to update my stored data. For example, if a user is in step 6 and he returns to step 1, I need to clear some of my columns values.
My SQL model is light, 3 tables and I have a column state in one to keep the current step (step 1, step 2, etc). I don't know how to implement this.
Maybe it's a good idea to create stored procedures and call it before each save. In my mind, the stored procedure clean up my tables (perform an update) to restore at a given step.
Any ideas ?
Thankls
This sounds like a app design problem. If you are working with a framework my advice is to stay away from stored procedures and use your framework/DMS to interact with the database.
My suggestion would be to use a state machine. you need to:
1) Define all your steps
2) Define all possible transitions from one step to another
If you tell us more about the context on your problem we might be able to give you better advice. There are some great implementations for the state pattern for some frameworks
For symfony 2 i found these libraries:
https://github.com/yohang/Finite
https://github.com/winzou/StateMachineBundle

Sybase to MySQL automatic exportation

I have two databases: Sybase and MySQL. I need to export records to MySql when these are inserted in Sybase or export in some scheduled event.
I've tried with output statement but this can not be used in triggers or procedures.
Any suggestion to solve this problem?
(disclaimer, I've done similar things previously, but by no means would I consider the answer below the state of the art - just one possible approach
google around something like 'cross-database replication' or 'cross rdbms replication' to see who's done this before.
).
I would first of all see if you can't score an ETL tool do the job without too much work. There are free open source ones and even things like Microsoft SSIS might work on non-MS databases.
If not, I would split this into different steps.
Find an appropriate Sybase output command that exports a subset of rows from one or more tables. By subset I mean you need to be able to add a WHERE clause, not just do a full table dump.
Use an appropriate MySQL import script/command to load the data gotten out of step #1. You may need to cycle back and forth between the 2 till you have something that works manually.
Write a Sybase trigger to insert lookup keys into a to-export table. You want to store at least the tablename & source Sybase table's keys for each inserted row. Use column names like key1_char, key2_char, not the actual column names, that makes it easier to extend to other source tables as needed. keep trigger processing as light as possible. What about updates btw?
Write a scheduled batch on Sybase side to run step #1 for the rows flagged in #3.
Write a scheduled batch on Mysql to import ,via #2, the results of #4. Or kick it off from #4.
Another approach is to do the #3 flagging bit as needed, but use to drive one scheduled batch that SELECTs data from Sybase and INSERTs it into mysql directly.
You'll have to pick up the data from Sybase's SELECT and bind it manually to the INSERT of mysql. But you probably get finer control over whats going on and you don't have to juggle 2 batches. That's what I think a clever ETL would already be doing on your behalf. Any half clever scripting language like php, python or ruby ought to handle it easily. Especially important if you have things like surrogate/auto-generated keys.
Keep in mind that in both cases you'll have to either delete the to-export rows that you've successfully inserted or flag them as done.

SSIS OLE DB conditional "insert"

I have no idea whether this can be done or not, but basically, I have the following data flow:
Extracts the data from an XML file (works fine)
Simply splits the records based on an enclosed condition (works fine)
Had to add a derived column object due to some character set issues (might be better methods, but it works)
Now "Step 4" is where I'm running into a scenario where I'd only like to insert the values that have a corresponding match in my database, for instance, the XML has about 6000 records, and from those, I have maybe 10 of them that I need to match back against and insert them instead of inserting all 6000 of them and doing the compare after the fact (which I could also do, but was hoping there'd be another method). I was thinking that I might be able to perform a sql insert command within the OLE DB DESTINATION object where the ID value in the file matches, but that's what I'm not 100% clear on or if it's even possible for that matter. Should I simply go the temp table route and scrub the data after the fact, or can I do this directly in the destination piece? Any suggestions would be greatly appreciated.
EDIT
Thanks to the last comment from billinkc, I managed to get bit closer, where I can identify the matches and use that result set, but somehow it seems to be running the data flow twice, which is strange.... I took the lookup object out to see whether it was causing it and somehow it seems to be the case, any reason why it would run this entire flow twice with the addition of the lookup? I should have a total of 8 matches, which I confirmed with the data viewer output, but then it seems to be running it a second time for the same file.
Is there a reason you can't use a Lookup transformation to find existing records. Configure it so that it routes non-match records to the no match output and then only connect the match found connector to the "Navigator Staging Manager Funds"
I believe that answers what you've asked but I wonder if you're expressing the right desire? My assumption is the lookup would go against the existing destination and so the lookup returns the id 10 for a row. All of the out of the box destinations in SSIS only perform inserts, so that row that found a match would now get doubled. As you are looking for existing rows, that usually implies you'd want to perform an update to an existing row. If that's the case, there is a specially designed transformation, the OLE DB Command. It is the component that allows for updates. There is a performance problem with that component, it issues a single update statement per row flowing through it. For 10 rows, I think it'd be fine. Otherwise, the pattern you'd use is to write all the new rows (inserts) into your destination table and then write all of your changed rows (updates) into a second staging-type table. After the data flow is complete, then use an Execute SQL Task to perform a set based update statement.
There are third party options that handle combined upserts. I know Pragmatic Works has an option and there are probably others on the tasks and components site.

How to version control data stored in mysql

I'm trying to use a simple mysql database but tweak it so that every field is backed up up to an indefinite number of versions. The best way I can illustrate this is by replacing each and every field of every table with a stack of all the values this field has ever had (each of these values should be timestamped). I guess it's kind of like having customized version control for all my data..
Any ideas on how to do this?
The usual method for "tracking any changes" to a table is to add insert/update/delete trigger procedures on the table and have those records saved in a history table.
For example, if your main data table is "ItemInfo" then you would also have an ItemInfo_History table that got a copy of the new record every time anything changed (via the triggers).
This keeps the performance of your primary table consistent, yet gives you access to the history of any changes if you need it.
Here are some examples, they are for SQL Server but they demonstrate the logic:
My Repository table
My Repository History table
My Repository Insert trigger procedure
My Repository Update trigger procedure
Hmm, what you're talking about sounds similar to Slowly Changing Dimension.
Be aware that version control on arbitrary database structures is officially a rather Hard Problem. :-)
A simple solution would be to add a version/revision field to the tables, and whenever a record is updated, instead of updating it in place, insert a copy with the changes applied and the version number incremented. Then when selecting, always choose the record with the latest version. That's roughly how most such schemes are implemented (e.g. Wikimedia does it pretty much this exact way).
Maybe a tool can help you to do that for you. Have a look at nextep designer :
https://github.com/christophefondacci/nextep-designer
With this IDE you will be able to take snapshots of your database structure and data and put it under version control. After this you can compute the differences between any 2 versions and generate the appropriate SQL that can insert / update / delete your data.
Maybe this is an alternative way to achieve what you wanted.