The blurring line between Biztalk and SSIS - ssis

I'm building an application that receive positional flat files from a legacy application and for each detail row I need to search inside a third application for some data and then fill up my database. And in case of any mal formed line in file, I need to stop de processing and log the line and position of malformed string.
At least for now, the files have, max. 50MB.
I'm very confuse who is best suitable for this scenario, Biztalk and SSIS have similar features and, as far I can see, both are suitable in this scenario. This is a task I could make a good use of Biztalk or I should go with a ETL solution (Integration Services) ?

I am normally recommend BizTalk left, right and centre, however in this case I would go with SSIS for two reasons:
On a file 50Mb plus you will get much better performance out of SSIS, no matter how many resources you throw at BizTalk given the way that BTS will process each record within the file. There are strategies here of course, but SSIS will win hands-down (Although I would imagine you web-service will probably be your bottleneck irrespective of which solution you choose); and
Unless you write a custom flat-file disassembler (which is almost rocket-science, BizTalk God territory), the standard disassembler will simply stop when it reaches a malformed line, logging the error into the Event Log and no further message processing will take place.
As an aside, I have been dropped into far too many projects where customers have a solution written in BizTalk where batch operations are being performed. The original development and testing was completed on flat-files c. 1Mb - 10Mb. Customers are then confused when a 50Mb - 100Mb+ files take so long to process!
Its much better to choose the right solution to the problem (IMHO, SSIS) at the beginning of the project, rather than crowbar a solution onto a product that isn't suitable.

I would probably do this in SSIS. It appears to be a ETL job. BizTalk might give you better flexibility given the source of the data long term but if as you say it's a web service this is something that can be accomplished in SSIS.
Generally speaking SSIS = batch process and direct data translations. BizTalk = Messaging / horizontal systems requests / responses that may or may not need to be synchronized.
But don't take my word for it. Include effort, software costs if they matter, and longevity of this process.

Related

Really streaming large MySQL result sets

There's a similar question about streaming large results but the answer just points at docs and no clear answer emerges.
I believe that merely treating a full result set as a stream still takes a lot of memory on the jdbc driver side..
I am wondering if there's any clear cut pattern, or best practice, for making it work, especially on the jdbc driver side.
And in particular I am not sure why setFetchSize(Integer.MIN_VALUE) is a very good idea, as it seems far from optimal if that means each row is sent on its own on the wire.
I believe libraries like jooq and slick already take care of that... and am curious as to how to accomplish it with and without them.
Thanks!
I am wondering if there's any clear cut pattern, or best practice, for making it work, especially on the jdbc driver side.
The best practice is not to do synchronous streaming but rather fetch in moderate size chunks. However avoid using OFFSET (also see). If your doing a batch process this can be facilitated by first selecting and pushing the data into a temporary table (ie turn your original results you want into a table first and then select chunks from the table... databases are really fast at copying data internally).
Synchronous streaming in general does not scale (aka iterator). It does not scale well for batch processing and it certainly does not scale for handling loads of clients. This is why the drivers vary and do so many different things because its fairly difficult to figure out how much resources to load because it is a pull model. Async streaming (push model) would probably help but unfortunately the JDBC standard does not support async streaming.
You might notice but this is one of the reasons why many of the wrappers around JDBC such as Spring JDBC do not return Iterators (along with fact that the resource also needs to be manually cleaned up). Some of the wrappers provide iterators but really they just turn the results into a list .
Your link to the Scala version is rather disturbing that its upvoted given the stateful nature of managing a ResultSet... its very un-Scala like... I'm not sure those folks know they have to consume the iterator or close the connection/ResultSet properly which requires a fair amount of imperative programming.
While it may seem inefficient to let the database decide how much to buffer just remember that most database connections are extremely heavy memory wise (at least on postgres they are). So if you take a long time streaming and have many clients your going to have to create more connections and put serious burden on the database. Not to mention the default buffers have probably been highly optimized (ie the resultset size that client ends up with).
Finally for batch processing chunks can be done in parallel which is obviously more efficient than a synchronous pipeline as well as being restarted (with out having to rework already processed data) if a problem occurs.

Realtime synchronization of live data over network

How do you sync data between two processes (say client and server) in real time over network?
I have various documents/datasets constructed on the server, which are downloaded and displayed by clients. Once downloaded, the document receives continuous updates in order to remain fresh.
It seems to be a simple and commonly occurring concept, but I cannot find any tools that provide this level of abstraction. I am not even sure what I am looking for. Perhaps there is a similar concept with solid tool support? Perhaps there is a chain of different tools that must be put together? Here's what I have considered so far:
I am required to propagate every change in a single hop (0.5 RTT), which rules out polling (typically >10 RTT) and cache invalidation techniques (1.5 RTT).
Data replication and simple notification broadcasts are not an option, because there is too much data and too many changes. Clients must be able to select specific documents to download and monitor for changes.
I am currently using message passing pattern, which does the job, but it is hopelessly unproductive. It works at way too low level of abstraction. It is laborious, error-prone, and it doesn't scale well with increasing application complexity.
HTTP and other RPC-like techniques are good for the initial fetch, but they encourage polling for subsequent synchronization. When performing reverse requests (from data source to data consumer), change notifications are possible, but it's even more complicated than message passing.
Combining RPC (for the initial fetch) with message passing (for updates) turned out to be a nightmare due to the complexity involved in coordinating communication over the two parallel connections as well as due to the impedance mismatch between the two paradigms. I need something unified.
WebSocket & Comet are popular methods to implement change notification, but they need additional libraries to be productive and I am not aware of any libraries suitable for my application.
Message queues merely put an intermediary on the network while maintaining the basic message passing pattern. Custom message filters/routers allow me to get closer to the live document concept, but I feel like I am implementing custom middleware layer on top of the MQ.
I have tons of additional requirements (native observable data structure API on both ends, incremental updates, custom message filters, custom connection routing, cross-platform, robustness & scalability), but before considering those requirements, I need to find some tools that at least attempt to do what I need. I am trying to avoid in-house frameworks for the standard reasons - cost, time to market, long-term maintenance, and keeping developers happy.
My conclusion at the moment is that there is no such live document synchronization framework. In-house solution is the way to go, but many existing components can be used as part of the solution.
It is pretty simple to layer live document logic on top of WebSocket or any other message passing platform. Server just sends the document as a separate message when the connection is initiated and then after every change. Automated reconnection and some connection monitoring must be added to handle network failures.
Serialization at both ends is a separate problem targeted by many existing libraries. Detecting changes in server-side data structures (needed to initiate push) is yet another separate problem that has its own set of patterns and tools. Incremental updates and many other issues can be solved by intermediaries intercepting the connection.
This approach will work with current technology at the cost of extensive in-house glue code. It can be incrementally substituted with standard components as they become available.
WebSocket already includes resource URIs, routing, and a few other nice features. Useful intermediaries and libraries will likely emerge in the future. HTTP with text/event-stream MIME type is a possible future alternative to WebSocket. The advantage of HTTP is that existing tools can be reused with little modification.
I've completely thrown away the pattern of combining RPC pull with separate push channel despite rich tool support. Pushing everything in 0.5 RTT requires the push channel to use exactly the same technology as the pull channel, i.e. reverse RPC. Reverse RPC is like message passing except it introduces redundant returns, throws away useful connection semantics, and makes it hard to insert content-agnostic intermediaries into the stream.

How to manage or convert extremely large program

I have a Microsoft Access business critical database that was originally created in the 90's and has been enlarged and upgraded up to Access 2007 at this point. We have been using this database as a front end for a custom written ERP system essentially. We have moved most of the data over to an SQL server long ago, but we are still using MS Access as a front end. AS the project grows, we have a full time developer, we have started having stability problems and extremely frequent crashes of unknown causes.
As an example: 1 time out of 10 or so, a certain form will crash if I change the data in 1 specific field. There is no code firing at the time, the data is in a local temporary table that typically only has 5 rows most of the time. If I change the data in the table nothing goes wrong, but if I change it on the form Access will hard crash and dump me to the desktop. There are other examples I could provide of unexplained crashes
I am looking for advice on where to go at this point -- the access front end has all of the business logic for running our company essentially so I can't just abandon it. Ideally we would re-write the entire front end in some other language. The problem is that as a small company we don't have the resources to re-write the entire system in anything resembling a good time frame, and don't have the cash flow to pay someone else to do it. My ideal solution would be a conversion of some sort from the access front end to another end point -- whether web or local windows -- but my searches here and on google make that seem like a non-starter.
So essentially every avenue I look at seems to be a dead-end:
We can't find the source of the crashes to stabilize our current system,
We can't stop production in our current system for as long as it would take to re-write it,
We can't afford to pay someone else to write a new system,
Automated conversion tools seem like a waste of money
Are there other options or which of the options that I have thought of seems best?
We have an enterprise level program with an Access front end and an SQL Server back end. I wonder if it might not help to split the program up into different pieces for diagnostics. For instance if you have Order Entry and Inventory Management you could have one front end for each function. (Yes I can hear the howling in the background but if it was only for the purpose of diagnosis maybe it would help... )
You can also export the Access Database objects to text files and then import them into a fresh new database to get rid of weird errors in some occasions.
Well, I guess I will rephrase the basic issue here. If you have two Computing Science graduates on staff then they should have long ago anticipated that they reached the limits of Access. I fail to see this as any different than an overworked, oven in a restaurant that now have too many customers or a delivery truck that does not have the capacity to deliver goods to customers.
Since funds don't exist to re-write then your staff failed to put aside funds on a monthly bases to deal with this situation and now your choices as a result of this delayed action to deal with this growing problem places you in a difficult situation.
The computing science people you have on staff should have long ago seen this wall and limit you hit coming. In my experience is with most CS people is they consider Access rather limited, and thus even MORE alarm bells should have been ringing here and this means even less excuse exists for you to be placed in this unfortunate predicament.
So, assuming the computing science staff you have are well maintaining this application (and I graciously accept this is the case), then then a logical conclusion is this application has reached or exceeded the limits of Access. As noted such limits should have been long ago anticipated.
As you well point out that funds now do not exist for a re-write, then few choices exist without such funds.
However, you case may not be so bad since we NOW know you have experienced developers on staff. Given this case, then my suggestion would be to consider breaking out modules or small manageable parts and features one at a time from the application and having your well trained and experienced developers build either a web interface, or perhaps even using something like .net if you wish to stay 100% desktop. So this "window" of opportunity is a great chance to consider a change in architecture)
Since the data limits are SQL server and NOT Access, then both applications (existing access front end) and the new parts can BOTH well and easy operate on the same data. As you do this, you then break out and remove the existing parts from the Access application. This would suggest you eventaully return to accetable stability in the Access applicaiton. At that point , you could continue, or stop to save funds.
As noted, without funds to re-write, then the only choice is to find some means to free up SOME limited resources on a monthly basis to solve this problem.
At the end of the day, the solution to this problem is more resources, but without such resources then few technical choices and options exist here. Based on the information given so far YOU have made it clear you don't have resources here. However, the solution to this problem here requires resource allocation and planning.
In other words a technical fix to this problem without resources allocated is not likely an available option for you.
I apologize sincerely for not being able to give you a technology solution here, but this looks to be a solution that will require resources to be allocated to the problem and no simple shortcut or trick or magic silver bullet exists here.

Is it a good idea to collect data from devices into a message queue?

Let's say you have 1000's of devices all sending data in all the time, would a message queue be a good data collection tool for this data?
Obviously, it depends:
Do you need to process every set of data, regardless of how old it
is?
Will data arive in a stead stream, or be bursty?
Can a single
application process all the data, or will it need to be load
balanced?
Do you have any need for the "messaging" features, such as
topics?
If your devices would be the clients, is there a client
implementation that works on them?
If you have a stream of data that can be processed by a single application, and you are tolerant of occasional lost data, I would keep it simple and post the data via REST or equivalent. I would only look to messaging once you needed either scalability, durability, fault tolerance, or the ability to level out load of time.
You can't go wrong design-wise with employing queues, but as Chris (other responder) stated, it may not be worth your effort in terms of infrastructure as web servers are pretty good at handling reasonable load.
In the "real world" I have seen commercial instruments report status into a queue for processing, so it is certainly a valid solution.

Flow Based Programming

I have been doing a little reading on Flow Based Programming over the last few days. There is a wiki which provides further detail. And wikipedia has a good overview on it too. My first thought was, "Great another proponent of lego-land pretend programming" - a concept harking back to the late 80's. But, as I read more, I must admit I have become intrigued.
Have you used FBP for a real project?
What is your opinion of FBP?
Does FBP have a future?
In some senses, it seems like the holy grail of reuse that our industry has pursued since the advent of procedural languages.
1. Have you used FBP for a real project?
We've designed and implemented a DF server for our automation project (dispatcher, component iterface, a bunch of components, DF language, DF compiler, UI). It is written in bare C++, and runs on several Unix-like systems (Linux x86, MIPS, avr32 etc., Mac OSX). It lacks several features, e.g. sophisticated flow control, complex thread control (there is only a not too advanced component for it), so it is just a prototype, even it works. We're now working on a full-featured server. We've learnt lot during implementing and using the prototype.
Also, we'll make a visual editor some day.
2. What is your opinion of FBP?
2.1. First of all, dataflow programming is ultimate fun
When I met dataflow programming, I was feel like 20 years ago, when I met programming first. Altough, DF programming differs from procedural/OOP programming, it's just a kind of programming. There are lot of things to discover, even sooo simple ones! It's very funny, when, as an experienced programmer, you met a DF problem, which is a very-very basic thing, but it was completely unknown for you before. So, if you jump into DF programming, you will feel like a rookie programmer, who first met the "cycle" or "condition".
2.2. It can be used only for specific architectures
It's just a hammer, which are for hammering nails. DF is not suitable for UIs, web server and so on.
2.3. Dataflow architecture is optimal for some problems
A dataflow framework can make magic things. It can paralellize procedures, which are not originally designed for paralellization. Components are single-threaded, but when they're organized into a DF graph, they became multi-threaded.
Example: did you know, that make is a DF system? Try make -j (see man, what -j is used for). If you have multi-core machine, compile your project with and without -j, and compare times.
2.4. Optimal split of the problem
If you're writing a program, you often split up the problem for smaller sub-problems. There are usual split points for well-known sub-problems, which you don't need to implement, just use the existing solutions, like SQL for DB, or OpenGL for graphics/animation, etc.
DF architecture splits your problem a very interesting way:
the dataflow framework, which provides the architecture (just use an existing one),
the components: the programmer creates components; the components are simple, well-separated units - it's easy to make components;
the configuration: a.k.a. dataflow programming: the configurator puts the dataflow graph (program) together using components provided by the programmer.
If your component set is well-designed, the configurator can build such system, which the programmer has never even dreamed about. Configurator can implement new features without disturbing the programmer. Customers are happy, because they have personalised solution. Software manufacturer is also happy, because he/she don't need to maintain several customer-specific branches of the software, just customer-specific configurations.
2.5. Speed
If the system is built on native components, the DF program is fast. The only time loss is the message dispatching between components compared to a simple OOP program, it's also minimal.
3. Does FBP have a future?
Yes, sure.
The main reason is that it can solve massive multiprocessing issues without introducing brand new strange software architectures, weird languages. Dataflow programming is easy, and I mean both: component programming and dataflow configuration building. (Even dataflow framework writing is not a rocket science.)
Also, it's very economic. If you have a good set of components, you need only put the lego bricks together. A DF program is easy to maintain. The DF config building requires no experienced programmer, just a system integrator.
I would be happy, if native systems spread, with doors open for custom component creating. Also there should be a standard DF language, which means that it can be used with platform-independent visual editors and several DF servers.
Interesting discussion! It occurred to me yesterday that part of the confusion may be due to the fact that many different notations use directed arcs, but use them to mean different things. In FBP, the lines represent bounded buffers, across which travel streams of data packets. Since the components are typically long-running processes, streams may comprise huge numbers of packets, and FBP applications can run for very long periods - perhaps even "perpetually" (see a 2007 paper on a project called Eon, mostly by folks at UMass Amherst). Since a send to a bounded buffer suspends when the buffer is (temporarily) full (or temporarily empty), indefinite amounts of data can be processed using finite resources.
By comparison, the E in Grafcet comes from Etapes, meaning "steps", which is a rather different concept. In this kind of model (and there are a number of these out there), the data flowing between steps is either limited to what can be held in high-speed memory at one time, or has to be held on disk. FBP also supports loops in the network, which is hard to do in step-based systems - see for example http://www.jpaulmorrison.com/cgi-bin/wiki.pl?BrokerageApplication - notice that this application used both MQSeries and CORBA in a natural way. Furthermore, FBP is natively parallel, so it lends itself to programming of grid networks, multicore machines, and a number of the directions of modern computing. One last comment: in the literature I have found many related projects, but few of them have all the characteristics of FBP. A list that I have amassed over the years (a number of them closer than Grafcet) can be found in http://www.jpaulmorrison.com/cgi-bin/wiki.pl?FlowLikeProjects .
I do have to disagree with the comment about FBP being just a means of implementing FSMs: I think FSMs are neat, and I believe they have a definite role in building applications, but the core concept of FBP is of multiple component processes running asynchronously, communicating by means of streams of data chunks which run across what are now called bounded buffers. Yes, definitely FSMs are one way of building component processes, and in fact there is a whole chapter in my book on FBP devoted to this idea, and the related one of PDAs (1) - http://www.jpaulmorrison.com/fbp/compil.htm - but in my opinion an FSM implementing a non-trivial FBP network would be impossibly complex. As an example the diagram shown in
is about 1/3 of a single batch job running on a mainframe. Every one of those blocks is running asynchronously with all the others. By the way, I would be very interested to hearing more answers to the questions in the first post!
1: http://en.wikipedia.org/wiki/Pushdown_automaton Push-down automata
Whenever I hear the term flow based programming I think of LabView, conceptually. Ie component processes who's scheduling is driven primarily by a change to its input data. This really IS lego programming in the sense that the labview platform was used for the latest crop of mindstorm products. However I disagree that this makes it a less useful programming model.
For industrial systems which typically involve data collection, control, and automation, it fits very well. What is any control system if not data in transformed to data out? Ie what component in your control scheme would you not prefer to represent as a black box in a bigger picture, if you could do so. To achieve that level of architectural clarity using other methodologies you might have to draw a data domain class diagram, then a problem domain run time class relationship, then on top of that a use case diagram, and flip back and forth between them. With flow driven systems you have the luxury of being able to collapse a lot of this information together accurately enough that you can realistically design a system visually once the components are build and defined.
One question I never had to ask when looking at an application written in labview is "What piece of code set this value?", as it was inherent and easy to trace backwards from the data, and also mistakes like multiple untintended writers were impossible to create by mistake.
If only that was true of code written in a more typically procedural fashion!
1) I build a small FBP framework for an anomaly detection project, and it turns out to have been a great idea.
You can also have a look at some of the KNIME videos, that give a good idea of what a flow based framework feels like when the framework is put together by a great team. Admittedly, it is batch based and not created for continuous operation.
By far the best example of flow based programming, however, is UNIX pipes which is one of the oldest, most overlooked FBP framework. I don't think I have to elaborate on the power of nix pipes...
2) FBP is a very powerful tool for a large set of problems. The intrinsic parallelism is a great advantage, and any FBP framework can be made completely network transparent by using adapter modules. Smart frameworks are also absurdly fault tolerant, and able to dynamically reload crashed modules when necessary. The conceptual simplicity also allows cleaner communication with everybody involved in a project, and much cleaner code.
3) Absolutely! Pipes are here to stay, and are one of the most powerful feature of unix. The power inherent in a FBP framework compared to a static program are many, and trivialise change, to the point where some frameworks can be reconfigured while running with no special measures.
FBP FTW! ;-)
In automotive development, they have a language agnostic messaging protocol which is part of the MOST specification (Media Oriented Systems Transport), this was designed to communicate between components over a network or within the same device. Systems usually have both a real and visualized message bus - therefore you effectively have a form of flow based programming.
That was what made the light bulb go on for me several years ago and brought me here. It really is a fantastic way to work and so much more fun than conventional programming. The message catalog form the central specification and point of reference. It works well for both developers and management. i.e. Management are able to browse the message catalog instead of looking at source.
With integrated logging also referencing the catalog to produce intelligible analysis things can get really productive. I have real world experience of developing commercial products in this way. I am interested in taking things further, particularly with regards to tools and IDEs. Unfortunately I think many people within the automotive sector have missed the point about how great this is and have failed to build on it. They are now distracted by other fads and failed to realize that there was far more to most development than the physical bus.
I've used Spring Web Flow extensively in Java Web applications to model (typically) application processes, which tend to be complex wizard-like affairs with lots of conditional logic as to which pages to display. Its incredibly powerful. A new product was added and I managed to recut the existing pieces into a completely new application process in an hour or two (with adding a couple of new views/states).
I also looked into using OS Workflow to model business processes but that project got canned for various reasons.
In the Microsoft world you have Windows Workflow Foundation ("WWF"), which is becoming more popular, particularly in conjunction with Sharepoint.
FBP is just a means of implementing a finite state machine. It's nothing new.
I realize that it is not exactly the same thing, but this model has been used for years in PLC programming. ISO calls it Sequential Flow Chart, but many people call it Grafcet after a popular implementation. It offers parallel processing and defines transitions between states.
It's being used in the Business Intelligence world these days to mashup and process data. Data processing steps like ETL, querying, joining , and producing reports can be done by the end-user. I'm a developer on an open system - ComposableAnalytics.com In CA, the flow-based apps can be shared and executed via the browser.
This is what MQ Series, MSMQ and JMS are for.
This is cornerstone of Web Services and Enterprise Service Bus implementations.
Products like TIBCO and Sun's JCAPS are basically flow-based without using this particular buzz-word.
Most of the work of the application is done with small modules that pass messages through a processing network.