Stream from JMS queue and store in Hive/MySQL

Stream from JMS queue and store in Hive/MySQL - mysql

I have the following setup (that I cannot change) and I'd like some advice from people who have been down that road. I'm not sure if this is the right place to ask, but here goes anyway.
Various JSON messages are placed on a different channels of a JMS queue (Universal Messaging/webMethods).
Before the data can be stored in relational-style DBs it has to be transformed: renamed, arrays flattened and some structures from nested objects extracted.
Data has to be appended to MySQL (as a serving layer for a visualization tool) and Hive (for long-term storage).
We're stuck on Spark 1.4.1 and may move to 1.6.0 in a few months' time. So, structured streaming is not (yet) an option.
At some point the events will be streamed directly to real-time dashboards, so having something in place that is capable of doing that now would be ideal.
Ideally coding is done in Scala (because we already have considerable batch-based repo with Spark and Scala), so the minimal requirement is JVM-based.
I've looked at Spark Streaming but it does not have a JMS adapter and as far as I can tell operating on JSON would be done using a SQLContext instance on the DStream's RDDs. I understand that it's possible to write a custom adapter, but then I'm not sure if Spark is still the best/easiest solution. I've also looked at the doc for Samza and Flink but did not find much for JMS and/or JSON, at least not natively.
Apache Camel seems like it might have a substantial set of connectors but I'm not too familiar with it, and I get the impression it does not do the streaming part, 'just' the bit where you connect to various systems. There's also Akka although I get the impression it's more of a replacement for messaging systems and JMS is set.
There is an almost bewildering amount of available tools and I'm at this point at a loss what to look at or what to look out for. What do you recommend based on your experience that I use to pick up the messages, transform, and insert into Hive and MySQL?

Related

Informatica PowerCenter pipelines to Azure Data Factory

I am trying to move my informatica pipelines in PC 10.1 to Azure Data Factory/ Synapse pipelines. Other than rewriting them from scratch, is there a way to migrate them somehow.. I am not finding any tools to achieve this as well. Has anyone faced this problem. Any leads on how to proceed ahead.
Thanks

There are no out of box solutions available to complete this migration. Unfortunately, you will have to author them again.

Informatica PowerCenter pipelines are a physical implementation of an Extract Transform Load (ETL) process. Each provider has different approaches to the implementations and they do not necessarily map well from one to another. Core Azure Data Factory (ADF) is actually more suited to Extract, Load and Transform (ELT), unless of course you use Data Flows.
So what you have to do is:
map out physically what your current pipeline is doing, if you don't have that documentation already. A simple spreadsheet template mapping out the components of the existing pipeline, tracking source, target plus any transformations will suffice
logically map out what the pipeline is doing; ie without using PowerCenter- specific terminology lay out what the "as is" pipeline is doing. A data flow diagram is a great way to do this
logically map out what the "to be" pipeline should do; ie without using any ADF-specific terminology, attempt to refine the "as is" pipeline to its simplest form
using expert knowledge of the ADF components (eg Copy, Lookup, Notebook, Stored Proc to name but a few) map from the logical "to be" to the physical (in the loosest sense of the word, it's all cloud now right : ), eg move data from place to place with the Copy activity, transform data in a SQL database using the Stored Proc activity, a repeated activity might use a For Each loop (bear in mind these execute in parallel), do sophisticated transformations or processing using Databricks notebooks if required and so on. If you require a low-code approach, consider Data Flows.
So you can see it's just a few simple steps. Good luck!

Messaging library for jeroMQ

I have chosen jeroMQ for building Asynchronous message channel for publishing content from multiple clients. On the other end server side workers processes request and notify client only if server wanted to notify client based on the message received.
On digging deep, looking for messaging library to marshal/un-marshal message. I found kvpmsg class which does the job for simple key-value.
Don't want to re-invent the wheel if some standard library exists, that can be applied for bigger objects

It seems like you are asking for data serialization libraries. Check Wikipedia for a list and a comparison of data serialization formats.
Also there is a relevant entry in ZeroMQ FAQ explaining why ZeroMQ doesn't include any serialization format:
Does ØMQ include APIs for serializing data to/from the wire representation?
No. This design decision adheres to the UNIX philosophy of "do one thing and do it well". In the case of ØMQ, that one thing is moving messages, not marshaling data to/from binary representations.
Some middleware products do provide their own serialization API. We believe that doing so leads to bloated wire-level specifications like CORBA (1055 pages). Instead, we've opted to use the simplest wire formats possible which ensure easy interoperability, efficiency and reduce the code (and bug) bloat.
If you wish to use a serialization library, there are plenty of them out there. See for example
Google Protocol Buffers
MessagePack
JSON-GLib
C++ BSON Library
Note that serialization implementations might not be as performant as you might expect. You may need to benchmark your workloads with several serialization formats and libraries in order to understand performance and which format/implementation is best for your use case (ease of development must also be considered).

Serialization format common to node.js and ActionScript?

Some of my friends are designing a game, and I am helping them out by implementing the game's backend server. The game is written in Flash, and I plan to develop the server in node.js because (a) it would be a cool project for learning node.js, and (b) it's fast, which is important for games.
The server's architecture is based on messages sent between the server and client (sort of like Minecraft's server protocol). The message format I have so far is a byte (the packet type), two bytes (the message length) and that many bytes (the message data, which is a mapping of key-value pairs). Problem is, I really don't want to develop my own serialization format (because while I probably could, implementing it would be a pain compared to using an existing solution).
Unfortunately, I am having problems finding a good candidate for the message data serialization format.
ActionScript's own remoting format might work, but I don't like it much.
JSON has support in node.js (obviously) and in ActionScript, but it's also textual and I would prefer binary for enhanced speed.
MessagePack looked like a good candidate, but I can't find an ActionScript implementation. (There's one called as3-msgpack on Google Code, but I get weird errors and can't access it.)
BSON has an ActionScript implementation, but no node.js support besides their MongoDB library (and I'm planning on using Redis).
So, can anyone offer any other serialization formats that I might have missed? Or should I just stick with one of these (or roll my own)?

Isn't that why HTTP supports gzipped content? Just use JSON and gzip the content when you send it. The time spent gzipping is more than recovered by the reduced latency of the transmission.
Check this article for more on gzip with Actionscript. On node.js I think that gzip-compress is fairly popular.

Actually, if I were in your shoes I would implement two methods and time them. Use JSON because it is common and easy to do. But then implement AMQP instead and compare them. If you want to massively scale this then you might find that AMQP makes it easier. Also. message queuing is just such a nice fit into the node.js world view.
AMQP on Actionscript, and someone doing similar on node.js.

Leverage JSAMF in Node.js for AMF communications with Flash.
http://www.jamesward.com/2010/07/07/amf-js-a-pure-javascript-amf-implementation/

If you wanted to, you could create your entire API in client side JavaScript, and use JSON as the data exchange format, then call ExternalInterface by AS to communicate with the client JavaScript API, which would make for an elegant server side solution.
It is worth noting that Flash Player has built in support for decompressing gzip compressed data. It may be worth compressing some of your JSON objects, things like localised string tables, game configuration data, etc which can grow to be a few hundred kb but are only loaded once on game load.

I'm working on a version of MessagePack for AS3.
At the current version it does the basic (encoding/decoding). Planning streams for the future.
Check the project page: https://github.com/loteixeira/as3-msgpack

RPC for java/python with rest support, HTML monitoring and goodies

Here's my set of requirements: I'm looking for an RPC framework such as thrift, avro, protobuf (when adding services to it) which supports:
Easy and intuitive IDL. No serial numbers, no manual versioning, simple... avro is a good example for this.
Works with Java and Python
Supports both fast binary prorocol, as well as HTTP based restful style. I'd like to be able to use it for both backend-to-backend communication (java-java or python-java) as well as frontend-to-backend communication (javascript to java).
The rest support needs to include &param=value input as get/post requests (configurable per request) and output in three possible formats: json, jsonp, XML.
Compact, fast, backward compatible, easy to upgrade etc...
Provides some nice monitoring interfaces such as: JMX, web page status reports (e.g. packets in, packets out, error rate etc)
Ops friendly... no need to take the whole site down to release new versions
Both sync and asyc communication
... other goodies are welcome...
Is there something out there?
So far I've looked at thrift and avro and they are both nice in some ways, but don't check all my list.
Thanks

That's a pretty tall order - some of the requirements are met by:
Avro, Thrift, Protobuff and ICE from Zero C.
ICE is probably the most performant.

Testing and mocking with Flex

I am developing a "dumb" front-end, it's an AIR application that interacts with a "smart" LiveCycle server. There are currently about 20 request & response pairs for the application. For many reasons (testing, developing outside the corporate network, etc), we have several XML files of fake data, and if a certain configuration flag is set, the files are loaded, a specific file is parsed and used to create a mock response. Each XML file is a set of responses for different situation, all internally consistent. We currently have about 10 XML files, each corresponding to different situation we can run into. This is probably going to grow to 30-50 XML files.
The current system was developed by me during one of those 90-hour-week release cycles, when we were under duress because LiveCycle was down again and we had a deadline to meet. Most of the minor crap has been cleaned up.
The fake data is in an object called FakeData, with properties like customerType1:XML, customerType2:XML, overdueCustomer1:XML, etc. Then in the FakeData constructor, all of the properties are set like this:
customerType1:XML = FileUtil.loadXML(File.applicationDirectory.resolvePath("fakeData/customerType1.xml");
And whenever you need some fake data (this happens in special FakeDelegates that extend the real LiveCycle Delegates), you get it from an instance of FakeData.
This is awful, for many reasons, but it works. One embarrassing part is that every time you create an instance of FakeData, it reloads all the XML files.
I'm trying to figure out if there's a design pattern that is not Singleton that can handle this more elegantly. The constraints are:
No global instances can be required (currently, all the code dealing with the fake data, including the fake delegates, is pulled out of production builds without any side-effects, and it needs to stay that way). This puts the Factory pattern out of the running.
It can handle multiple objects using the XML data without performance issues.
The XML files are read centrally so that the other code doesn't have to know where the XML files are, and so some preprocessing can be done (like creating a map of certain tag values and the associated XML file).
Design patterns, or other architecture suggestions, would be greatly appreciated.

Take a look at ASMock which was developed by a good friend of mine (and a member here Richard Szalay) and is based on .nets Rhino mocks. We've used it in several production environments now so i can vouch for it's stability.
should be able to get rid of any fake tests (more like integration tests) by using the mock object instead.

Wouldn't it make more sense to do traditional mocking with a mocking framework? Depending on your implementation, it might be possible to set up the Expects by reading the fake-data XML files.
Here is a Google Code project that offers mocking for ActionScript.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008