Can you add/remove KafkaSpouts to Apache Storm Topology Dynamically - configuration

I have an Apache Storm Topology that accepts messages from multiple Kafka topics.
Currently "multiple" equates to "Two".
As I currently only have two KafkaSpouts to listen to I have hard coded both into my Topology class and coded as follows:-
builder.setSpout(SPOUT_ONE_ID, kafkaSpout_A, 1);
builder.setSpout(SPOUT_TWO_ID, kafkaSpout_B, 1);
builder.setBolt(BOLT_ID, myBolt, 1).shuffleGrouping(SPOUT_ONE_ID).shuffleGrouping(SPOUT_TWO_ID);
however the number of KafkaSpouts will increase over time, each new KafkaSpout will listen to its own unique topic. Each time a new topic appears I will have to implement a code change to my Topology and redeploy it.
I would much rather have my Topology controlled by an external configuration "mechanism" like a disk file or database table. By adding (or removing) Kafka topic details I would like my topology to start (or stop) "listening" to those topics.
Does Apache Storm support this type of dynamic configuartion?

Apache storm currently doesn't support changing topology configuration dynamically. There is an open JIRA items to support as part of JStorm. See link : https://issues.apache.org/jira/browse/STORM-1335

Related

JGIT TransportConfigCallback implementation with Apache MINA

JGit used to rely on JSch as its transport provider, but changed because JSch does not accept certain keys, including some OpenSSH ones.
If you are using a passphrase-encrypted key with JGit, the recommended approach with JSch used to be to override a configuration callback, as shown in this blog article.
The alternative is to set a shared session factory and this also what the Apache MINA team recommends with a pointer to the cost of creating sessions on demand.
However, my issue with this is that it sets a specific provider for the whole system scope. Consequently, I would like to implement TransportConfigCallback. Ultimately, this requires to adapt SshSessionFactory to MINA's session initialization code and produce a RemoteSession.
Has anyone done this or seen any boilerplate code?
Self answer after some research: The interface to MINA in JGit is deliberately set closed. The driver only accepts a regular Git configuration directory. And this is the key to configuring the system:
Provide a different git configuration directory for each user.
Provide regular git configuration (git config) files underneath for all special settings
JGit has classes to write Git configurations, so you do not need to write any parsing and serialisation code.
I will update this answer with pertinent links to the classes and source code locations involved.
From version 5.5 the factory can be configured for this usecase. See NoFilesSshTest.java for details on how to use a database to provide connection information.

how to create custom node in knime?

I have added all the plugins of Knime in Eclipse and I want to create my Own custom node. but I am not able to understand how to pass the data from one node to another node.
I saw one node which has been provided by the Knime itself which is " File Reader " node. Now I want the source code of this node or jar file for this node But I am not able to find it out.
I am searching with the similar name in eclipse plugin folder but still I didn't get it.
Can someone please tell me how to pass the data from one node to another node and how to identify the classes or jar for any node given by knime and source code also.
Assuming that your data is a standard datatable, then you need to subclass NodeModel, with a call to the supertype constructor:
public MyNodeModel(){
//One incoming table, one outgoing table
super(1,1);
}
You need to override the default #execute(BufferedDataTable[] inData, ExecutionContext exec) method - this is where the meat of the node work is done and the output table created. Ideally, if your input and output table have a one-to-one row mapping then use a ColumnRearranger class (because this reduces disk IO considerably, and if you need it, allows simple parallelisation of your node), otherwise your execute method needs to iterate through the incoming datatable and generate an output table.
The #configure(DataTableSpec[] inSpecs) method needs to be implemented to at the least provide a spec for the output table if this can be determined before the node is executed (it normally can, and this allows downstream nodes also to be configures, but the 'Transpose' node is an example of a node which cannot do so).
There are various other methods which you also need to implement, but in some cases these will be empty methods.
In addition to the NodeModel, you need to implement some other classes too - a NodeFactory, optionally a NodeSettingsPane and optionally a NodeView.
In Eclipse you can view the sources for many nodes, and also the KNIME community 'book' pages all have a link to their source code. Take a look at https://tech.knime.org/developer-guide and https://tech.knime.org/developer/example for a step-by-step guide. Also, questions to the knime forums (including a developer forum) generally get rapid responses - and KNIME run a Developer Training Course a few times a year if you want to spend a few days learning more. And last but not least, it is worth familiarising yourself with the noding guidelines which describe the best practice of how your node should behave
Source code for KNIME nodes are now available on git hub.
Alternatively you can check under your project>plugin dependencies>knime-base.jar>org.knime.base.node.io.filereader for file reader source code in eclipse KNIME SDK.
Knime-base.jar will be added to your project by default when created with KNIME SDK.

Stream from JMS queue and store in Hive/MySQL

I have the following setup (that I cannot change) and I'd like some advice from people who have been down that road. I'm not sure if this is the right place to ask, but here goes anyway.
Various JSON messages are placed on a different channels of a JMS queue (Universal Messaging/webMethods).
Before the data can be stored in relational-style DBs it has to be transformed: renamed, arrays flattened and some structures from nested objects extracted.
Data has to be appended to MySQL (as a serving layer for a visualization tool) and Hive (for long-term storage).
We're stuck on Spark 1.4.1 and may move to 1.6.0 in a few months' time. So, structured streaming is not (yet) an option.
At some point the events will be streamed directly to real-time dashboards, so having something in place that is capable of doing that now would be ideal.
Ideally coding is done in Scala (because we already have considerable batch-based repo with Spark and Scala), so the minimal requirement is JVM-based.
I've looked at Spark Streaming but it does not have a JMS adapter and as far as I can tell operating on JSON would be done using a SQLContext instance on the DStream's RDDs. I understand that it's possible to write a custom adapter, but then I'm not sure if Spark is still the best/easiest solution. I've also looked at the doc for Samza and Flink but did not find much for JMS and/or JSON, at least not natively.
Apache Camel seems like it might have a substantial set of connectors but I'm not too familiar with it, and I get the impression it does not do the streaming part, 'just' the bit where you connect to various systems. There's also Akka although I get the impression it's more of a replacement for messaging systems and JMS is set.
There is an almost bewildering amount of available tools and I'm at this point at a loss what to look at or what to look out for. What do you recommend based on your experience that I use to pick up the messages, transform, and insert into Hive and MySQL?

Open alternatives to Windows Workflow

Pre-warning: There are some other questions similar to this but don't quite answer the question (these include: Alternatives to Windows Workflow Foundation?, Can anyone recommend a .Net open source alternative to Windows Workflow?)
We are developing a system that is an event based state machine, currently we are investigating windows workflow, our system needs to be low latency in its response to events from a multitude of sources (xmpp, http, sms, phone call, email etc etc) coming into the system, scalable and resilient and most importantly customisable. For a variety of reasons (and due diligence) I am looking for open workflow engines that support functions similar to Windows Workflow Foundation (and more - if possible), mainly (but it doesn't matter too much if there are engines that don't support some features):
Persistence of long running tasks, and resumption of tasks on external events
High performance, low latency
Ability to develop custom actions
The ability to specify workflows dynamically
Tracking and tracing
I am not constrained to platform or language, and I would love some help and tips from you guys so that I can start to investigate the engines more closely and any experiences you had with the engines.
Paul.
I invite you to examine Stateless further, as suggested in the answer to my SO question can-anyone-recommend-a-net-open-source-alternative-to-windows-workflow. to achieve the goal of a long running state machine is very simple in that you can store the current state of your state in a database and re-sync the state machine when needed. Consider the following code from the stateless site:
Stateless has been designed with
encapsulation within an ORM-ed domain
model in mind. Some ORMs place
requirements upon where mapped data
may be stored. To this end, the
StateMachine constructor can accept
function arguments that will be used
to read and write the state values:
var stateMachine = new StateMachine<State, Trigger>(
() => myState.Value,
s => myState.Value = s);
With very little effort you can persist your state, then retrieve that state easily later on.
In respect updating the workflow dynamically, if you configure a state machine such as
var stateMachine = new StateMachine<string, int>();
and maintain a separate file of states and triggers in XML, you can perform a configuration at runtime by looping through the string int value pairs.
"Java side":
Apache ODE (Orchestration Director Engine) executes business processes written following the WS-BPEL standard. It talks to web services, sending and receiving messages, handling data manipulation and error recovery as described by your process definition. It supports both long and short living process executions to orchestrate all the services that are part of your application.
http://ode.apache.org/
OSWorkflow can be considered a "low level" workflow implementation. Situations like "loops" and "conditions" that might be represented by a graphical icon in other workflow systems must be "coded" in OSWorkflow.
http://www.opensymphony.com/osworkflow/
Shark is an extendable workflow engine framework including a standard implementation completely based on WfMC specifications using XPDL (without any proprietary extensions !) as its native workflow process definition format and the WfMC "ToolAgents" API for serverside execution of system activitie
http://www.enhydra.org/workflow/shark/index.html
Python side:
http://bika.sourceforge.net/
http://www.vivtek.com/wftk/
I this will help you :-)
You might consider implementing your flow as an actual state machine. Tools like State Machine Compiler and Ragel can help with this. State machines, in many circumstances, are just what you need to implement insanely complex behavior that is testable, and rock-solid. I don't claim to be a Windows work flow expert, but from what I have seen, I question its superiority over coding your own state machine, either by hand or using a tool.
You might want to check out Simple State Machine.
If you feel like you want to have more control over things and want to roll your own it might be helpful to check out the Saga support that projects like NServiceBus and MassTransit use. Sagas look to be very similar to WF workflows but are POCO objects and I believe both projects just use NHibernate for Saga persistence.
I'm going to recommend you take a few hours to look at the book Open-Source ESBs in Action. "Orchestration" and "Choreography" are the key buzzwords to look at when dealing with "enterprise service busses." The systems for .NET are quite expensive (BizTalk is in the price range of a decent car, the price of Tibco is in the price range of a decent house).
Other links:
Open ESB project
Comparison of OpenESB and ServiceMix (both of which are the subject of the "In Action" book above.
Try Drools for JAVA, I personally have never tried it but I know several commercial applications are based on drools.
http://www.jboss.org/drools/
You could also upgrade to .NET 4.0 there are major improvements in the Workflow in the new framework. I know if I was writing a new workflow application I would jump to 4.0.
Good Luck
JBoss JBPM
Consider Workflow Engine, a lightweight all-in-one component that enables you to add custom executable workflows of any complexity to any .NET or Java software, be it your own creation or a third-party solution, with minimal changes to existing code. It supports custom actions and commands, has timers and supports parallel workflows. And there's a free version.
You can take a look at Imixs-Workflow, which is an event driven approach of a state machine based on bpmn 2.0. It specially focuses on human-centric long running tasks.

What are the best practices to log an error?

Many times I saw logging of errors like these:
System.out.println("Method aMethod with parameters a:"+a+" b: "+b);
print("Error in line 88");
so.. What are the best practices to log an error?
EDIT:
This is java but could be C/C++, basic, etc.
Logging directly to the console is horrendous and frankly, the mark of an inexperienced developer. The only reason to do this sort of thing is 1) he or she is unaware of other approaches, and/or 2) the developer has not thought one bit about what will happen when his/her code is deployed to a production site, and how the application will be maintained at that point. Dealing with an application that is logging 1GB/day or more of completely unneeded debug logging is maddening.
The generally accepted best practice is to use a Logging framework that has concepts of:
Different log objects - Different classes/modules/etc can log to different loggers, so you can choose to apply different log configurations to different portions of the application.
Different log levels - so you can tweak the logging configuration to only log errors in production, to log all sorts of debug and trace info in a development environment, etc.
Different log outputs - the framework should allow you to configure where the log output is sent to without requiring any changes in the codebase. Some examples of different places you might want to send log output to are files, files that roll over based on date/size, databases, email, remoting sinks, etc.
The log framework should never never never throw any Exceptions or errors from the logging code. Your application should not fail to load or fail to start because the log framework cannot create it's log file or obtain a lock on the file (unless this is a critical requirement, maybe for legal reasons, for your app).
The eventual log framework you will use will of course depend on your platform. Some common options:
Java:
Apache Commons Logging
log4j
logback
Built-in java.util.logging
.NET:
log4net
C++:
log4cxx
Apache Commons Logging is not intended for applications general logging. It's intended to be used by libraries or APIs that don't want to force a logging implementation on the API's user.
There are also classloading issues with Commons Logging.
Pick one of the [many] logging api's, the most widely used probably being log4j or the Java Logging API.
If you want implementation independence, you might want to consider SLF4J, by the original author of log4j.
Having picked an implementation, then use the logging levels/severity within that implementation consistently, so that searching/filtering logs is easier.
The easiest way to log errors in a consistent format is to use a logging framework such as Log4j (assuming you're using Java). It is useful to include a logging section in your code standards to make sure all developers know what needs to be logged. The nice thing about most logging frameworks is they have different logging levels so you can control how verbose the logging is between development, test, and production.
A best practice is to use the java.util.logging framework
Then you can log messages in either of these formats
log.warning("..");
log.fine("..");
log.finer("..");
log.finest("..");
Or
log.log(Level.WARNING, "blah blah blah", e);
Then you can use a logging.properties (example below) to switch between levels of logging, and do all sorts of clever stuff like logging to files, with rotation etc.
handlers = java.util.logging.ConsoleHandler
.level = WARNING
java.util.logging.ConsoleHandler.level = ALL
com.example.blah = FINE
com.example.testcomponents = FINEST
Frameworks like log4j and others should be avoided in my opinion, Java has everything you need already.
EDIT
This can apply as a general practice for any programming language. Being able to control all levels of logging from a single property file is often very important in enterprise applications.
Some suggested best-practices
Use a logging framework. This will allow you to:
Easily change the destination of your log messages
Filter log messages based on severity
Support internationalised log messages
If you are using java, then slf4j is now preferred to Jakarta commons logging as the logging facade.
As stated slf4j is a facade, and you have to then pick an underlying implementation. Either log4j, java.util.logging, or 'simple'.
Follow your framework's advice to ensuring expensive logging operations are not needlessly carried out
The apache common logging API as mentioned above is a great resource. Referring back to java, there is also a standard error output stream (System.err).
Directly from the Java API:
This stream is already open and ready
to accept output data.
Typically this stream corresponds to
display output or another output
destination specified by the host
environment or user. By convention,
this output stream is used to display
error messages or other information
that should come to the immediate
attention of a user even if the
principal output stream, the value of
the variable out, has been redirected
to a file or other destination that is
typically not continuously monitored.
Aside from technical considerations from other answers it is advisable to log a meaningful message and perhaps some steps to avoid the error in the future. Depending on the errors, of course.
You could get more out of a I/O-Error when the message states something like "Could not read from file X, you don't have the appropriate permission."
See more examples on SO or search the web.
There really is no best practice for logging an error. It basically just needs to follow a consistent pattern (within the software/company/etc) that provides enough information to track the problem down. For Example, you might want to keep track of the time, the method, parameters, calling method, etc.
So long as you dont just print "Error in "