Get status of 'newly-launched' EMR cluster programmatically - aws-sdk

I'm following official docs guide to write a Scala script for launching EMR cluster using AWS Java SDK. I'm able to identify 3 major steps needed here:
Instantiating an EMR Client
I do this using AmazonElasticMapReduceClientBuilder.defaultClient()
Creating a JobFlowRequest
I create a RunJobFlowRequest object and supply it with JobFlowInstancesConfig (both objects are supplied with appropriate parameters depending on the requirement)
Running JobFlowRequest
This is done by calling emrClient.runJobFlow(runJobFlowRequest) which returns a RunJobFlowResult object
But RunJobFlowResult object doesn't provide any clue as to whether the cluster was launched successfully or not (with all the given configurations)
Now I'm aware that listClusters() method of the emrClient can be used to get cluster id of the newly-launched cluster through which we can query the state of the cluster using describeCluster() call. However since I'm using a Scala script to perform all this stuff, I need the process to be automated (here looking up the cluster id in the result of getClusters() will have to be done manually)
Is there any way this could be achieved?

You have all the pieces there but haven't quite stitched them together.
The cluster's id can be retrieved from RunJobFlowResult.getJobFlowId(). (It is a string starting with "j-".) Then you can pass this jobFlowId to DescribeCluster.
I don't blame you for your confusion though, since it's called "jobFlowId" for some methods (mainly older API methods) and "clusterId" in other methods. They are really the same thing though.

Related

Informatica PowerCenter pipelines to Azure Data Factory

I am trying to move my informatica pipelines in PC 10.1 to Azure Data Factory/ Synapse pipelines. Other than rewriting them from scratch, is there a way to migrate them somehow.. I am not finding any tools to achieve this as well. Has anyone faced this problem. Any leads on how to proceed ahead.
Thanks
There are no out of box solutions available to complete this migration. Unfortunately, you will have to author them again.
Informatica PowerCenter pipelines are a physical implementation of an Extract Transform Load (ETL) process. Each provider has different approaches to the implementations and they do not necessarily map well from one to another. Core Azure Data Factory (ADF) is actually more suited to Extract, Load and Transform (ELT), unless of course you use Data Flows.
So what you have to do is:
map out physically what your current pipeline is doing, if you don't have that documentation already. A simple spreadsheet template mapping out the components of the existing pipeline, tracking source, target plus any transformations will suffice
logically map out what the pipeline is doing; ie without using PowerCenter- specific terminology lay out what the "as is" pipeline is doing. A data flow diagram is a great way to do this
logically map out what the "to be" pipeline should do; ie without using any ADF-specific terminology, attempt to refine the "as is" pipeline to its simplest form
using expert knowledge of the ADF components (eg Copy, Lookup, Notebook, Stored Proc to name but a few) map from the logical "to be" to the physical (in the loosest sense of the word, it's all cloud now right : ), eg move data from place to place with the Copy activity, transform data in a SQL database using the Stored Proc activity, a repeated activity might use a For Each loop (bear in mind these execute in parallel), do sophisticated transformations or processing using Databricks notebooks if required and so on. If you require a low-code approach, consider Data Flows.
So you can see it's just a few simple steps. Good luck!

how to create custom node in knime?

I have added all the plugins of Knime in Eclipse and I want to create my Own custom node. but I am not able to understand how to pass the data from one node to another node.
I saw one node which has been provided by the Knime itself which is " File Reader " node. Now I want the source code of this node or jar file for this node But I am not able to find it out.
I am searching with the similar name in eclipse plugin folder but still I didn't get it.
Can someone please tell me how to pass the data from one node to another node and how to identify the classes or jar for any node given by knime and source code also.
Assuming that your data is a standard datatable, then you need to subclass NodeModel, with a call to the supertype constructor:
public MyNodeModel(){
//One incoming table, one outgoing table
super(1,1);
}
You need to override the default #execute(BufferedDataTable[] inData, ExecutionContext exec) method - this is where the meat of the node work is done and the output table created. Ideally, if your input and output table have a one-to-one row mapping then use a ColumnRearranger class (because this reduces disk IO considerably, and if you need it, allows simple parallelisation of your node), otherwise your execute method needs to iterate through the incoming datatable and generate an output table.
The #configure(DataTableSpec[] inSpecs) method needs to be implemented to at the least provide a spec for the output table if this can be determined before the node is executed (it normally can, and this allows downstream nodes also to be configures, but the 'Transpose' node is an example of a node which cannot do so).
There are various other methods which you also need to implement, but in some cases these will be empty methods.
In addition to the NodeModel, you need to implement some other classes too - a NodeFactory, optionally a NodeSettingsPane and optionally a NodeView.
In Eclipse you can view the sources for many nodes, and also the KNIME community 'book' pages all have a link to their source code. Take a look at https://tech.knime.org/developer-guide and https://tech.knime.org/developer/example for a step-by-step guide. Also, questions to the knime forums (including a developer forum) generally get rapid responses - and KNIME run a Developer Training Course a few times a year if you want to spend a few days learning more. And last but not least, it is worth familiarising yourself with the noding guidelines which describe the best practice of how your node should behave
Source code for KNIME nodes are now available on git hub.
Alternatively you can check under your project>plugin dependencies>knime-base.jar>org.knime.base.node.io.filereader for file reader source code in eclipse KNIME SDK.
Knime-base.jar will be added to your project by default when created with KNIME SDK.

SpringJUnit4ClassRunner customized for multitenant environments

I work on a multi tenant application. Current Tenant information is managed via thread locals, which are set via a filter for the request.
During integration tests (non-web) this filter does not apply, so I look for a method to set this thread local for unit tests.
I started thinking about an annotation on the test-class or methods (including #Before and #After). This could be something like #AsTenant("tenantId") (lets assume we only need the tenant id).
I basically look for a way now to extend SpringJUnit4ClassRunner to be aware of the annotations and properly set the thread locals at the right time. Does anybody have experience or ideas on where to hook this? (I am not very familiar with test runners)
Thanks in advance!

How can I add/remove instances from GCE load balancers with Ansible?

I see that there is a gce_lb Ansible module, but it is unclear to me whether or not I can actually use this to change the instances assigned to that LB or whether the module just creates and destroys LBs.
In contrast, EC2 clearly has one module just for creating and destroying ELBs, and another module explicitly for [de]registering instances to/from an existing ELB.
Currently, the gce_lb module is only for creating/destroying the LB. It does not support adding/removing instances.
The GCE modules in Ansible are built on top of the python libcloud library which does have support for add/remove. I think a similar approach taken by the EC2 modules would be a good solution here also.

Handling "Internal server error" in Groovy-Console

I have a groovy-script which takes about 5 hours to complete (it restarts (delete old and start new) many workflows), and unfortunately there are some workflows which can't get processed and throw an "internal Server error" which ends the groovy call.
All I can do now is to take a look at the logs and restart the groovy script and exclude the problematic workflow-id.
It would be a great performance-boost, if I could catch this "internal server error" in the hac and continue with the next workflow instead of aborting the skript.
I already tried to put it in try/catch, but this doesn't work.
Is there any chance to "ignore" the "internal server error"s - entries of my list to process?
Thanks for any help!
Run the Groovy script natively, not through the HAC. The Groovy/Beanshell consoles are handy for quick prototypes, but running a 5-hr process through a browser interface seems kludgy at best. You have at least a couple options:
Dynamic Beans
Did you know that Spring beans can be implemented using a number of various languages using Dynamic language beans?
Define interfaces for your processes and wire them up to Groovy implementations using the Spring configuration. Since the scripts are interpreted at runtime, you can swap out code without needing to recompile the entire platform.
Now you have the full power of Java, Spring, Groovy, and hybris. Properly sequester each process so that exceptions don't bubble up and crash the entire thing.
This option would be the cleanest way to go, since you'd be integrating the code directly into the project's codebase. And you can keep all your existing [ Groovy | JRuby | Beanshell | ... ] code.
Roll your own
Another thing you might try is examining hybris' Groovy API. I was able to leverage hybris' Beanshell interpreter classes to create my own test harness. It is a simple standalone Eclipse project that allows me to write and run Beanshell within Eclipse, with output to the console. I use it on a daily basis for quick scripting tasks like batch updates, FlexibleSearch queries, etc. I'd imagine you could do the same thing with Groovy. Search the hybris API for the HAC code that interprets the Groovy requests from the browser.
The sky's the limit, but first get out of the browser console for heavy scripting tasks.
My short answer would be: Don't use scripts for time-consuming processes.
Although you mentioned that is not possible to define standard scripts, because Business is working in parallel, I cannot recommend maintaining a live system in this manner.
Integrate that logic into a custom CronJob and add all configurable/dynamic things as properties of said Job.
The benefit of this approach would be
you have a proper logging mechanism (Sysout in HAC Groovy console sux)
you can trace your execution (time consumed, started, stopped, etc.)
can be triggered automatically (CronJob Trigger) or by other instructed user (eg Operations)
you get a more stable workflow as a whole (that is, no need of keeping track of those magic scripts (how do you version them? in the resource folder?))
The downside of this would be indeed, that you need a redeploy.
From my experience, dynamically changed code (Dynamic Beans as an example) works on projects with comparably low complexity, but tends to get messy pretty quickly.