Waiting in pentaho kettle-spoon - open-source

I am new to the Pentaho Kettle and I want to do multiple operations in a Transformation.
Firstly I am
inserting data from a text file to a main table.
Loading some of the columns from the main table to a 2nd table based on some conditions.
But the problem is only after completing the first step I have to do the 2nd step. Because for the 2nd step i need the 1st step to be completed.
I can say that my first step is taking almost 20 mins..
Also in that same transformation i have to do other data loading from different table too..
I don't know kettle is providing a dedicated option to perform that like any switches or something like that.I have searched a lot in web but I didn't got any ...
So can anyone help me in solving the problem.

that's what exactly the "Blocking Step" do, give a try.
http://www.nicholasgoodman.com/bt/blog/2008/06/25/ordered-rows-in-kettle/

Or split your transform into multiple transforms and orchestrate them in a Job. If your transforms are simple, I would tend towards using the blocking steps. But using them too much I find makes the transforms messy and complex. Wrapping Transforms in Jobs usually gives you more control.
Brian

Related

Data pipeline proposal

Our product has been growing steadily over the last few years and we are now on a turning point as far as data size for some of our tables is, where we expect that the growth of said tables will probably double or triple in the next few months, and even more so in the next few years. We are talking in the range of 1.4M now, so over 3M by the end of the summer and (since we expect growth to be exponential) we assume around 10M at the end of the year. (M being million, not mega/1000).
The table we are talking about is sort of a logging table. The application receives data files (csv/xls) on a daily basis and the data is transfered into said table. Then it is used in the application for a specific amount of time - a couple of weeks/months - after which it becomes rather redundant. That is: if all goes well. If there is some problem down the road, the data in the rows can be useful to inspect for problem solving.
What we would like to do is periodically clean up the table, removing any number of rows based on certain requirements, but instead of actually deleting the rows move them 'somewhere else'.
We currently use MySQL as a database and the 'somewhere else' could be the same, but can be anything. For other projects we have a Master/Slave setup where the whole database is involved, but that's not what we want or need here. It's just some tables where the Master table would need to become shorter and the Slave only bigger, not a one-on-one sync.
The main requirement for the secondary store would be that the data should be easy to inspect/query when need to, either by SQL or another DSL, or just visual tooling. So we are not interested in backing up the data to one or more CSV files or another plain text format, since that is not as easy to inspect. The logs will then be somewhere on S3 so we would need to download it, grep/sed/awk on it... We'd much rather have something database like that we can consult.
I hope the problem is clear?
For the record: while the solution can be anything we prefer to have the simplest solution possible. It's not that we don't want Apache Kafka (example), but then we'd have to learn it, install it, maintain it. Every new piece of technology adds onto our stack, the lighter it remains the more we like it ;).
Thanks!
PS: we are not just being lazy here, we have done some research but we just thought it'd be a good idea to get some more insight in the problem.

SSIS Data Flow: duplicated rule problem after lookup

I have a data flow that I need to get a column value from 'SQL tableA' and do a lookup task in 'SQL tableB' using this column value. If the lookup found a connection between the two tables, I need to get the value of another column from 'SQL tableA' and put this value in 'SQL tableC'( the table that will be persisted ). If lookup fail, this column value will be NULL.
My problem: After this behavior above, the rest of my flow is the same. So I have two duplicated equal flows below lookup. And this is terrible for readability and maintenance.
What do I can do to resolve this situation with little performance loss?
The data model is legacy, so change the data model is impossible.
Best Regards,
Luis
The way I see it, there are really three options:
Use UNION ALL and possibly sacrifice performance for modularity. There may in fact be no performance issue. You should test and see
If possible, implement all of this in a stored procedure. You can implement code reuse there and it will quite possibly run much faster
Build a custom transformation component that implements those last three steps.
This option appeals to all programmers but may have the worst performance and in my opinion will just cause issues down the track. If you're writing reams of C# code inside SSIS then you'll eventually reach a point where it's easier to just build a standalone app.
It would be much easier to answer if you explained
What you're really doing
slowly changing dimension?
data cleansing?
adding reference data?
spamming
What are those three activities?
sending an email?
calling a web service?
calling some other API?
What your constraints are
Is all of this data on one server and can you create stored procs and tables?

DB design question - multiplayer game

I am new to DB design. I am trying to write a board game (4 players max) and was trying to come up with a way to communicate moves among each other.
I am using a DB for this as per suggestions on stackoverflow.
My problem is this - When player A makes a move that move has to be read by B,C and D. So the fact that A made the move needs to be communicated to B,C and D. I am doing it the following way. Please tell me if there is a better way to do it. To me it seems all wrong and incredibly flaky.
I have a table with the following fields -
gameId, userMove, flagA, flagB, flagC, flagD
So when A makes the move I write among other things - (flagA=0, flagB=1, flagC=1, flagD=1)
When B,C or D read A's move they decrement their corresponding flag.
A will not update the table unless all flags are 0.
Same thing happens when others make their moves.
Comments? There has to be a better way for this. The things I am seeing wrong here -
I am looping on a select until all flags are 0 for A
I am looping on a select until the flag for the corresponding user is set to read the move.
That is a lot of server load and client timeouts I need to worry about.
I hope I have been able to explain my problem clearly. Please ask questions if needed.
Any help is appreciated.
EDIT: The game is web based (runs in a browser) and I am using php for the server side development and so I cannot use an in-memory cache though I would have loved to do that if possible.
Thanks,
- Pav
If the players of your game will be interacting with one game server during a single game session, it looks like you can keep all that state in memory.
Databases are great for durable storage of data with guarantees for atomicity, consistency and integrity. However, you don't seem to need any of these features for the temporal state that you are describing.
If flagA,B,C and D are all bits you might consider putting them all into one column and treating that column as a bit mask.
This will allow one column to control all flags. It can make your selects and updates much cleaner.
Read up on using bitmasks here:
http://www.gordano.com/kb.htm?q=966
http://en.wikipedia.org/wiki/Mask_%28computing%29
Have you considered usng a file to store the info?

Loading Hybrid Dimension Table with SCD1 and SCD2 attributes + SSIS

I am just in a process of starting a new task, wherein in i need to load Hybrid Dimension Table with SCD1 and SCD2. This need to be achieved as a SSIS Package. Can someone guide what would be the best way dealing this in SSIS, should i used SCD component or there is other way? What are the best practices for this.
For SCD2 type, am using Merge statement.
Thanks
That's a can of worms :)
There are basically four ways to handle SCDs in SSIS:
1. Using the built-in SCD component
2. "Rolling your own" using Lookups, Conditional Splits, Derived Columns, and various destinations.
3. Using T-SQL MERGE
4. Using the third party Kimball SCD component
I'll alert you to my bias towards #4 - I wrote it. But here's my analysis of the bunch.
1 is a good solution for "small" and "easy" dimensions. Why is it good? It's understandable, handles SCD 1 and 2, and is easy to set up. But why only "small" and "easy" dimensions? Because it uses an internal uncached lookup (RBAR) that can't be improved. Because if you change anything in it (re-run the wizard), it destroys any changes you've made to the data flow. And because it won't handle rows where case sensitivity isn't important, or trailing spaces aren't important.
2 is a good solution for larger dimensions. It's good because it performs pretty well, and is "well documented" in that you can see exactly what it's doing from the names of the components you use and how they're put together. It's also easy to manipulate and change how it operates. The downside is that it takes time to set up and test.
3 is a good solution for huge dimensions. It usually outperforms all other alternatives. But that's about all it has going for it. It's very complex to code, and not very understandable without tons of comments.
4 is a good solution for just about any size except maybe "huge" dimensions. It's "easy" to use like the stock SCD component, performs as good or better than 2, and is as configurable as 2.
More info on 4 here.

Pathing in a non-geographic environment

For a school project, I need to create a way to create personnalized queries based on end-user choices.
Since the user can choose basically any fields from any combination of tables, I need to find a way to map the tables in order to make a join and not have extraneous data (This may lead to incoherent reports, but we're willing to live with that).
For up to two tables, I already managed to design an algorithm that works fine. However, when I add another table, I can't find a way to path through my database. All tables available for the personnalized reports can be linked together so it really all falls down to finding which path to use.
You might be able to try some form of an A* algorithm. Basically this looks at each of the possible next options to choose and applies a heuristic to it, a function that determines roughly how far it is between this node and your goal. It then chooses the one that is closer and repeats. The hardest part of implementing A* is designing a good heuristic.
Without more information on how the tables fit together, or what you mean by a 'path' through the tables, it's hard to recommend something though.
Looks like it didn't like my link, probably the * in it, try:
http://en.wikipedia.org/wiki/A*_search_algorithm
Edit:
If that is the whole database, I'd go with a depth-first exhaustive search.
I thought about using A* or a similar algorithm, but as you said, the hardest part is about designing the heuristic.
My tables are centered around somewhat of a backbone with quite a few branches each leading to at most a single leaf node. Here is the actual map (table names removed because I'm paranoid). Assuming I want to view data from tha A, B and C tables, I need an algorithm to find the blue path.