Integration between Foundry Workshop and Code Repositories - palantir-foundry

I encountered this technical issue when developing applications in what-if scenarios using Palantir Workshop. Basically, our users generally make changes in Workshop. When all the changes are complete, users can trigger a build in Code Repositories from Workshop for an execution of a Python transform function. Multiple datasets are taken as inputs to the transform functions. What would be the recommended technique for this integration to happen?
Thank you for your attention!
I tried to use build schedules manager together with "apply scenario". Although it can work, but the biggest con lies in that:
the build pipeline is more manual and ad-hoc instead of automated.
the "apply scenario" in Workshop works against our expectation because the data were expected to not override the ontology objects. However, for the trigger to happen, I have to apply the changes made at the Workshop back to the ontology for a chained trigger of build.

Related

Is there a function in SSIS to extract functionality from a package to be added to other packages?

I just added error handling functionality to an SSIS package that I am upgrading, and I need to add this same error handling to about 30 more packages. Is there a way to extract the error handling control flow, parameters, variables, etc. so that I can easily add them to the rest of the packages?
I am using Visual Studio Enterprise 2019 and SSIS 15.0.
I found a bunch of articles on BIML, but it looks like that is only for creating new packages. I am aware that copy and paste exists, but I would like to try to find a solution that is easy to apply across future packages as well as the current packages being updated. Apologies if this question has already been asked, I searched, but I'm not sure that I even really know what search terms would be applicable.
Yes, Biml is an excellent choice for creating consistent packages going forward. Even if you're only generating empty packages with error handling logic, that's a pattern and that's the power of Biml.
With the change to BimlExpress and the now free ability to reverse engineer packages, an approach could be to reverse engineer the packages to Biml. That would all be static tier but you'll need to select all and then in a new BimlScript file, add the error handling like so
<#
foreach(AstPackageNode apn in this.RootNode.Packages)
{
if (!apn.Events.Where(x => x.EventType==EventType.OnError).Any())
{
AstTaskEventHandlerNode onError = new AstTaskEventHandlerNode(null);
onError.EventType = EventType.OnError;
onError.Name = "OnError";
// TODO: add tasks and such
apn.Events.Add(onError);
}
//WriteLine(apn.GetBiml());
}
#>
Once that's looking good, you right click on everything at once and generate packages.
A non-Biml approach is going to test your C# (or VB.NET) skills. I've not touched this type of SSIS dev in more than a decade but the concept will remain the same. https://billfellows.blogspot.com/2016/10/what-packages-still-use-configuration.html
You'll need to find all the SSIS packages. For each one of those, use a reference to the DTS Runtime application to load it. Then look at the package's Events collection and if there isn't an OnError, you're going to have to add one to the collection and then add all the associated tasks, configure them and then save.

Informatica PowerCenter pipelines to Azure Data Factory

I am trying to move my informatica pipelines in PC 10.1 to Azure Data Factory/ Synapse pipelines. Other than rewriting them from scratch, is there a way to migrate them somehow.. I am not finding any tools to achieve this as well. Has anyone faced this problem. Any leads on how to proceed ahead.
Thanks
There are no out of box solutions available to complete this migration. Unfortunately, you will have to author them again.
Informatica PowerCenter pipelines are a physical implementation of an Extract Transform Load (ETL) process. Each provider has different approaches to the implementations and they do not necessarily map well from one to another. Core Azure Data Factory (ADF) is actually more suited to Extract, Load and Transform (ELT), unless of course you use Data Flows.
So what you have to do is:
map out physically what your current pipeline is doing, if you don't have that documentation already. A simple spreadsheet template mapping out the components of the existing pipeline, tracking source, target plus any transformations will suffice
logically map out what the pipeline is doing; ie without using PowerCenter- specific terminology lay out what the "as is" pipeline is doing. A data flow diagram is a great way to do this
logically map out what the "to be" pipeline should do; ie without using any ADF-specific terminology, attempt to refine the "as is" pipeline to its simplest form
using expert knowledge of the ADF components (eg Copy, Lookup, Notebook, Stored Proc to name but a few) map from the logical "to be" to the physical (in the loosest sense of the word, it's all cloud now right : ), eg move data from place to place with the Copy activity, transform data in a SQL database using the Stored Proc activity, a repeated activity might use a For Each loop (bear in mind these execute in parallel), do sophisticated transformations or processing using Databricks notebooks if required and so on. If you require a low-code approach, consider Data Flows.
So you can see it's just a few simple steps. Good luck!

SSRS 2008 - Create a chart of a directed graph to visualise ETL jobs

I can't find anything that hints towards native support for charting graph data structures (otherwise known as "network maps" by some), and in my case, a directed graph. I'm wanting to create a visualisation of our ETL dependency chain at work to show the steps that each different 'job' is reliant on before being able to proceed.
Questions:
Has anybody been able to 'simulate\hack\workaround' this lack of out-of-the-box functionality in SSRS?
Any ideas on how to possibly achieve this if no-one has thought of doing this before?
EDIT - 2014-10-30
Two years and no answer so I've accepted the most promising advice on a workaround to get what is needed, as no direct functionality has been found.
From left field:
You could wrap an SSIS package around your "ETL jobs". The SSIS Control Flow surface has a GUI for expressing task dependancies. It's functional if not not visually outstanding. Your "ETL jobs" could be Execute SQL Task or Execute Process Task objects. You can connect the precedence constraints to show dependancies.
This could either be for real use or just for documentation purposes. If you use it for real you'll find its a great way to control ETL dependancies and parallelism.

Open alternatives to Windows Workflow

Pre-warning: There are some other questions similar to this but don't quite answer the question (these include: Alternatives to Windows Workflow Foundation?, Can anyone recommend a .Net open source alternative to Windows Workflow?)
We are developing a system that is an event based state machine, currently we are investigating windows workflow, our system needs to be low latency in its response to events from a multitude of sources (xmpp, http, sms, phone call, email etc etc) coming into the system, scalable and resilient and most importantly customisable. For a variety of reasons (and due diligence) I am looking for open workflow engines that support functions similar to Windows Workflow Foundation (and more - if possible), mainly (but it doesn't matter too much if there are engines that don't support some features):
Persistence of long running tasks, and resumption of tasks on external events
High performance, low latency
Ability to develop custom actions
The ability to specify workflows dynamically
Tracking and tracing
I am not constrained to platform or language, and I would love some help and tips from you guys so that I can start to investigate the engines more closely and any experiences you had with the engines.
Paul.
I invite you to examine Stateless further, as suggested in the answer to my SO question can-anyone-recommend-a-net-open-source-alternative-to-windows-workflow. to achieve the goal of a long running state machine is very simple in that you can store the current state of your state in a database and re-sync the state machine when needed. Consider the following code from the stateless site:
Stateless has been designed with
encapsulation within an ORM-ed domain
model in mind. Some ORMs place
requirements upon where mapped data
may be stored. To this end, the
StateMachine constructor can accept
function arguments that will be used
to read and write the state values:
var stateMachine = new StateMachine<State, Trigger>(
() => myState.Value,
s => myState.Value = s);
With very little effort you can persist your state, then retrieve that state easily later on.
In respect updating the workflow dynamically, if you configure a state machine such as
var stateMachine = new StateMachine<string, int>();
and maintain a separate file of states and triggers in XML, you can perform a configuration at runtime by looping through the string int value pairs.
"Java side":
Apache ODE (Orchestration Director Engine) executes business processes written following the WS-BPEL standard. It talks to web services, sending and receiving messages, handling data manipulation and error recovery as described by your process definition. It supports both long and short living process executions to orchestrate all the services that are part of your application.
http://ode.apache.org/
OSWorkflow can be considered a "low level" workflow implementation. Situations like "loops" and "conditions" that might be represented by a graphical icon in other workflow systems must be "coded" in OSWorkflow.
http://www.opensymphony.com/osworkflow/
Shark is an extendable workflow engine framework including a standard implementation completely based on WfMC specifications using XPDL (without any proprietary extensions !) as its native workflow process definition format and the WfMC "ToolAgents" API for serverside execution of system activitie
http://www.enhydra.org/workflow/shark/index.html
Python side:
http://bika.sourceforge.net/
http://www.vivtek.com/wftk/
I this will help you :-)
You might consider implementing your flow as an actual state machine. Tools like State Machine Compiler and Ragel can help with this. State machines, in many circumstances, are just what you need to implement insanely complex behavior that is testable, and rock-solid. I don't claim to be a Windows work flow expert, but from what I have seen, I question its superiority over coding your own state machine, either by hand or using a tool.
You might want to check out Simple State Machine.
If you feel like you want to have more control over things and want to roll your own it might be helpful to check out the Saga support that projects like NServiceBus and MassTransit use. Sagas look to be very similar to WF workflows but are POCO objects and I believe both projects just use NHibernate for Saga persistence.
I'm going to recommend you take a few hours to look at the book Open-Source ESBs in Action. "Orchestration" and "Choreography" are the key buzzwords to look at when dealing with "enterprise service busses." The systems for .NET are quite expensive (BizTalk is in the price range of a decent car, the price of Tibco is in the price range of a decent house).
Other links:
Open ESB project
Comparison of OpenESB and ServiceMix (both of which are the subject of the "In Action" book above.
Try Drools for JAVA, I personally have never tried it but I know several commercial applications are based on drools.
http://www.jboss.org/drools/
You could also upgrade to .NET 4.0 there are major improvements in the Workflow in the new framework. I know if I was writing a new workflow application I would jump to 4.0.
Good Luck
JBoss JBPM
Consider Workflow Engine, a lightweight all-in-one component that enables you to add custom executable workflows of any complexity to any .NET or Java software, be it your own creation or a third-party solution, with minimal changes to existing code. It supports custom actions and commands, has timers and supports parallel workflows. And there's a free version.
You can take a look at Imixs-Workflow, which is an event driven approach of a state machine based on bpmn 2.0. It specially focuses on human-centric long running tasks.

Best practices for version information?

I am currently working on automating/improving the release process for packaging my shop's entire product. Currently the product is a combination of:
Java server-side codebase
XML configuration and application files
Shell and batch scripts for administrators
Statically served HTML pages
and some other stuff, but that's most of it
All or most of which have various versioning information contained in them, used for varying purposes. Part of the release packaging process involves doing a lot of finding, grep'ing and sed'ing (in scripts) to update the information. This glue that packages the product seems to have been cobbled together in an organic, just-in-time manner, and is pretty horrible to maintain. For example, some Java methods create Date objects for the time of release, the arguments for which are updated by a textual replacement, without compiler validation... just, urgh.
I'm trying avoid giving examples of actual software used (i.e. CVS, SVN, ant, etc.) because I'd like to avoid the "use xyz's feature to do this" and concentrate more on general practices. I'd like to blame shoddy design for the problem, but if I had to start again, still using varying technologies, I'd be unsure how best to go about handling this, beyond laying down conventions.
My questions is, are there any best practices or hints and tips for maintaining and updating versioning information across different technologies, filetypes, platforms and version control systems?
Create a properties file that contains the version number and have all of the different components reference the properties file
Java files can reference the properties through
XML can use includes?
HTML can use a JavaScript to write the version number from the properties in the HTML
Shell scripts can read in the file
Indeed, to complete Craig Angus's answer, the rule of thumb here should be to not include any meta-informations in your normal delivery files, but to report those meta-data (version number, release date, and so on) into one special file -- included in the release --.
That helps when you use one VCS (Version Control System) tool from the development to homologation to pre-production.
That means whenever you load a workspace (either for developing, or for testing or for preparing a release into production), it is the versionning tool which gives you all the details.
When you prepare a delivery (a set of packaged files), you should ask that VCS tool about every meta-information you want to keep, and write them in a special file itself included into the said set of files.
That delivery should be packaged in an external directory (outside any workspace) and:
copied to a shared directory (or a maven repository) if it is a non-official release (but just a quick packaging for helping the team next door who is waiting for your delivery). That way you can make 10 or 20 delivers a day, it does not matter: they are easily disposable.
imported into the VCS in order to serve as official deliveries, and in order to be deployed easily since all you need is to ask the versionning tool for the right version of the right deliver, and you can begin to deploy it.
Note: I just described a release management process mostly used for many inter-dependant projects. For one small single project, you can skip the import in the VCS tool and store your deliveries elsewhere.
In addition to Craig Angus' ones include the version of tools used.