Informatica PowerCenter pipelines to Azure Data Factory - ssis

I am trying to move my informatica pipelines in PC 10.1 to Azure Data Factory/ Synapse pipelines. Other than rewriting them from scratch, is there a way to migrate them somehow.. I am not finding any tools to achieve this as well. Has anyone faced this problem. Any leads on how to proceed ahead.
Thanks

There are no out of box solutions available to complete this migration. Unfortunately, you will have to author them again.

Informatica PowerCenter pipelines are a physical implementation of an Extract Transform Load (ETL) process. Each provider has different approaches to the implementations and they do not necessarily map well from one to another. Core Azure Data Factory (ADF) is actually more suited to Extract, Load and Transform (ELT), unless of course you use Data Flows.
So what you have to do is:
map out physically what your current pipeline is doing, if you don't have that documentation already. A simple spreadsheet template mapping out the components of the existing pipeline, tracking source, target plus any transformations will suffice
logically map out what the pipeline is doing; ie without using PowerCenter- specific terminology lay out what the "as is" pipeline is doing. A data flow diagram is a great way to do this
logically map out what the "to be" pipeline should do; ie without using any ADF-specific terminology, attempt to refine the "as is" pipeline to its simplest form
using expert knowledge of the ADF components (eg Copy, Lookup, Notebook, Stored Proc to name but a few) map from the logical "to be" to the physical (in the loosest sense of the word, it's all cloud now right : ), eg move data from place to place with the Copy activity, transform data in a SQL database using the Stored Proc activity, a repeated activity might use a For Each loop (bear in mind these execute in parallel), do sophisticated transformations or processing using Databricks notebooks if required and so on. If you require a low-code approach, consider Data Flows.
So you can see it's just a few simple steps. Good luck!

Related

Boomi integration - Dynamically inject mapping information

We are now in process of evaluating integration solutions and comparing Mule and Boomi.
Use case is to read an Excel file, map the columns to a pre-defined set of JSON attributes and then use the JSON to insert records into a database. The mapping may vary from one Excel template to another wherein the column names in an Excel may be different from others.
How do I inject mapping information (source vs target) from outside integration flow?
Note: In Mule, I'm able to do that using a mapping variable (value is JSON) that I inject using Mule DataWeave language.
Boomi's mapping component is static in terms of structure but more versatile solutions are certainly possible.
The data processor component opens up Groovy, JavaScript, and XSLT 3.0 as options. These are Turing-complete languages that can be used to bend Boomi to almost any outcome.
You could make the Boomi UI available to those who need to write the maps in JSON. It's a pretty simple interface to learn. By using a route component, there could be one "parent" process that governs the a process for each template/process and then a map for each template. Such a solution would be pretty easy to build and run; allowing the template-specific processes to be deployed independently of the "parent".
You could map to a generic columnar structure and then dynamically alter the target
columns by writing a SQL procedure that would alter the target columns.
I've come across attempts to do what you're describing (not using either Boomi or Mulesoft) which were tragic failures: https://www.zdnet.com/article/uk-rural-payments-agency-rpa-it-failure-and-gross-incompetence-screws-farmers/ I draw your attention to the NAO's points:
ensure the system specifications retain a realistic level of flexibility
and
bespoke software is costly to develop, needs to be thoroughly tested, and takes more time to implement
The general goal for such a requirement like yours is usually to make transformation/ETL available to "non-programmers" which denies the reality that there are many more skills to delivering an outcome than "programming".

Stream from JMS queue and store in Hive/MySQL

I have the following setup (that I cannot change) and I'd like some advice from people who have been down that road. I'm not sure if this is the right place to ask, but here goes anyway.
Various JSON messages are placed on a different channels of a JMS queue (Universal Messaging/webMethods).
Before the data can be stored in relational-style DBs it has to be transformed: renamed, arrays flattened and some structures from nested objects extracted.
Data has to be appended to MySQL (as a serving layer for a visualization tool) and Hive (for long-term storage).
We're stuck on Spark 1.4.1 and may move to 1.6.0 in a few months' time. So, structured streaming is not (yet) an option.
At some point the events will be streamed directly to real-time dashboards, so having something in place that is capable of doing that now would be ideal.
Ideally coding is done in Scala (because we already have considerable batch-based repo with Spark and Scala), so the minimal requirement is JVM-based.
I've looked at Spark Streaming but it does not have a JMS adapter and as far as I can tell operating on JSON would be done using a SQLContext instance on the DStream's RDDs. I understand that it's possible to write a custom adapter, but then I'm not sure if Spark is still the best/easiest solution. I've also looked at the doc for Samza and Flink but did not find much for JMS and/or JSON, at least not natively.
Apache Camel seems like it might have a substantial set of connectors but I'm not too familiar with it, and I get the impression it does not do the streaming part, 'just' the bit where you connect to various systems. There's also Akka although I get the impression it's more of a replacement for messaging systems and JMS is set.
There is an almost bewildering amount of available tools and I'm at this point at a loss what to look at or what to look out for. What do you recommend based on your experience that I use to pick up the messages, transform, and insert into Hive and MySQL?

Logging crosscutting concern needs access to data layer

Say I have an architecture similar to the Layered Architecture Sample. Let's also assume each large box is its own project. The Frameworks box and each layer would then be its own project. If we don't use IoC and instead go the traditional layered approach without interfaces, Service would reference Business which would reference Data and all of these would reference Frameworks.
Now the requirement is that logging be done to the database, so presumably Frameworks would need to reference Service to reach Data. There are two issues with this:
You now have a circular dependency (remember, no interfaces). Most of this can be solved if you use interfaces with IoC (Dependency Injection or Service Locator) and a composition root. Is this the only way or can you somehow do it without an interface?
If you're in Business and need to log something, it has to make the jump to service again. Is there any way to only make the jump from Presentation but not from Service/Business/Data without a composition root?
I know I'm somewhat answering my own question, but I basically want to understand if this architecture is feasible at all without IoC.
Without some inversion of control, there is not too much you can do.
Assuming your language supports something like Reflection in .NET you could try to dynamically invoke code and invert the control at runtime. You don't need interfaces but you might have to mark/decorate or have a convention for the types that you need in the upper layers.
I'm just thinking now about crazy, non-pragmatic approaches: you could post-process the binary and inject the logging code in every layer.
For this example, I would call the logging methods in the Business Layer where logging is needed. It really doesn't make any sense to call up a level. That would an unnecessary abstraction, and it sounds like you've gathered as much.
Is there any abstraction provided in the Services layer for logging that you would require when logging from the business layer? If so, perhaps some sort of facade could be created for the purpose of business layer logging. If you do not require this abstraction, though, I would call the Business logging methods directly.
IMO, Since Logging is cross cutting concern, it should not refer your Data layer. In your question, I see that you had assumed that you are logging into the database. Even if this is your requirement, you have to keep database connection/insert of log records code as seperate from your application data Layer. It will be part of your Logging library rather than part of Data layer. Do NOT treat it as part of data layer. It is this perspective with which you can continue to develop/enhance logging [framework] as well as it will be seperate from your data layer.
From My perspective, data layer only constitues of Application Data Access and not logging. For concrete you can see, NLog or Log4Net libraries and see how they are not concerned with Application's data Access layer strategy.
Hope this helps.

Build system that is not file-centric

We have a software infrastructure which works pretty much like a software build system: Information is gathered from different sources and used to generate some outputs. Like in traditional software builds we have different types of output, dependency trees, etc.
The main difference is that our sources, intermediate results and outputs are not inherently file-based. Rather, they're (uniquely addressable) data objects.
Right now we're mapping our data structure to files and directories in combination with a traditional build system (SCons) but that does not scale, both w.r.t. performance but (more importantly) w.r.t. maintainability. Hence I'm looking for an infrastructure that's built for this purpose from the ground up.
As an illustration, assume you have 3 XML documents A, B and C. Let's say that B/foo/bar is to be calculated from A/x/y and A/x/z, and that similarly C/a/b is calculated from A/x/y. I need an infrastructure to
Implement these relationships (i.e. the transformations and their dependencies)
Automatically re-build the relevant parts after changes are made
One major problem with using files is that, if I map A, B and C to some files A.xml, B.xml and C.xml and use a traditional build system, then any change to A.xml will trigger a rebuild of B.xml and C.xml, even if A/x/y and A/x/z (the original dependencies of B) are not modified. For a fine-grained dependency resolution I therefore would need to map each of A, B and C not to a file, but to a directory where each sub-directory represents an element, files represents attributes, etc. As I said, this does not scale for us.
(Please note that our system is not actually based on XML)
Right now I'm looking for any existing software, infrastructure or concept which points into this direction, regardless of implementation language and underlying data structures.
It sounds like you need an active object database management system (ODBMS) like GemStone/S. ODBMSs provide the traditional persistence services without the old cost of mapping data structures to files and the well-known benefits of object technology. As you've mentioned dependency trees and addressable objects, in ODBMSs navigational references are stored as part of their data, allowing any complex interaction patterns among objects to be represented/accessed. This is specially true when you predict a system which makes use of inheritance, object nesting and cross-referencing.
Although an object engine may seem oversized for your requirements, it is common for large-scale production business systems to store and execute methods using OODBMs, within a concurrent and multiuser environment. It doesn't come for free because you have to invest in the human part of the equation (education and experience) but once the initial fear is overcome, it will pay the return of investment.
For re-building (subscribed) parts after changes (notifications from announcers) are made, you may use the Observer design pattern, or one of its variants (SASE or Announcements framework), to implement your announce/subscription architecture. Under this type of event frameworks there are intrinsic problems which are hard to solve with traditional file-based solutions, as you have noticed already. For example, it is typical for a dependency mechanism to manage the replacement of an object, or in your example an XML document, by another one. Any modern events framework should manage when an object is removed, all dependents plugged to the old object are updated to the new reference.
Finally, there is a free GemStone/S stack which includes object dependency framework so you may experiment with a real object-database.
So nothing comes to mind that solves exactly your problem, but there are a few tools that might get you a little closer than you are now:
1) You might be able to throw something together using Fuse that would give you better control of how your data objects are mapped out to files. Fuse basically allows you to construct arbitrary file systems from whatever backing data you want. (The python bindings are pretty friendly, but there are a number of other language interfaces available as well). Then you could use a traditional build tool, and take advantage of file like objects better associated w/your data.
2) Cmake has a pretty extensible language for writing custom targets that you might be able to press into service. Unfortunately its language is pretty didactic and has something of a steep learning curve, so it wouldn't be my first choice.

Testing and mocking with Flex

I am developing a "dumb" front-end, it's an AIR application that interacts with a "smart" LiveCycle server. There are currently about 20 request & response pairs for the application. For many reasons (testing, developing outside the corporate network, etc), we have several XML files of fake data, and if a certain configuration flag is set, the files are loaded, a specific file is parsed and used to create a mock response. Each XML file is a set of responses for different situation, all internally consistent. We currently have about 10 XML files, each corresponding to different situation we can run into. This is probably going to grow to 30-50 XML files.
The current system was developed by me during one of those 90-hour-week release cycles, when we were under duress because LiveCycle was down again and we had a deadline to meet. Most of the minor crap has been cleaned up.
The fake data is in an object called FakeData, with properties like customerType1:XML, customerType2:XML, overdueCustomer1:XML, etc. Then in the FakeData constructor, all of the properties are set like this:
customerType1:XML = FileUtil.loadXML(File.applicationDirectory.resolvePath("fakeData/customerType1.xml");
And whenever you need some fake data (this happens in special FakeDelegates that extend the real LiveCycle Delegates), you get it from an instance of FakeData.
This is awful, for many reasons, but it works. One embarrassing part is that every time you create an instance of FakeData, it reloads all the XML files.
I'm trying to figure out if there's a design pattern that is not Singleton that can handle this more elegantly. The constraints are:
No global instances can be required (currently, all the code dealing with the fake data, including the fake delegates, is pulled out of production builds without any side-effects, and it needs to stay that way). This puts the Factory pattern out of the running.
It can handle multiple objects using the XML data without performance issues.
The XML files are read centrally so that the other code doesn't have to know where the XML files are, and so some preprocessing can be done (like creating a map of certain tag values and the associated XML file).
Design patterns, or other architecture suggestions, would be greatly appreciated.
Take a look at ASMock which was developed by a good friend of mine (and a member here Richard Szalay) and is based on .nets Rhino mocks. We've used it in several production environments now so i can vouch for it's stability.
should be able to get rid of any fake tests (more like integration tests) by using the mock object instead.
Wouldn't it make more sense to do traditional mocking with a mocking framework? Depending on your implementation, it might be possible to set up the Expects by reading the fake-data XML files.
Here is a Google Code project that offers mocking for ActionScript.