SSIS - Script Component pulling information from RabbitMQ - ssis

A question that might be mostly theoretical, but I'd love to have my concerns put to rest (or confirmed).
I built a Script Component to pull data from RabbitMQ. On RabbitMQ, we basically set up a durable queue. This means messages will continue to be added to the queue, even when the server reboots. This construction allows us to periodically execute the package and grab all "new" messages since the last time we did so.
(We know RabbitMQ isn't set up to accommodate to this kind of scenario, but rather it expects there to be a constant listener to process messages. However, we are not comfortable having some task start when SQL Server starts, and pretty much running 24/7 to handle that, so we built something we can schedule to run every n minutes and empty the queue that way. If we'd not be able to run the task, we most likely are dealing with a failed SQL Server, and have different priorities).
The component sets up a connection, and then connects to the specific exchange + queue we are pulling messages from. Messages are in JSON format, so we deserialize the message into a class we defined in the script component.
For every message found, we disable auto-acknowledge, so we can process it and only acknowledge it once we're done with it (which ensures the message will be processed, and doesn't slip through). Then we de-serialize the message and push it onto the output buffer of the script component.
There's a few places things can go wrong, so we built a bunch of Try/Catch blocks in the code. However, seeing we're dealing with the queue aspect, and we need the information available to us, I'm wondering if someone can explain how/when a message that is sent to the output buffer is processed.
Is it batched up and then pushed? Is it sent straight away, and is the SSIS component perhaps not updating information back to SSIS in a timely fashion?
Would there be a chance for us to acknowledge a message, but that it somehow ends up not getting committed to our database, yet popped from the queue (as I think happens once a message is acknowledged)?

Related

How to make polling from database scalable?

I am try to find a scalable way to allow for my desktop application to run command when a change in the database is made.
The application is for running a remote command on your PC. The user logs into the website and can choose the run the command. Currently, users have to download a desktop application that checks the database every few seconds to see if a value has changed. The value can only be changed when they login to a website and press a button.
For now it seems to be working fine since there aren't many users. But when I hit 100+ users hitting the database 100+ times every few seconds is not good. What might be a better approach?
It's true that polling for changes is too expensive, especially if you have many clients. The queries are often very costly, and it's tempting to run the queries frequently to make sure the client gets notified promptly after a change. It's better to avoid polling the database.
One suggestion in the comments above is to use a UDF called from a trigger. But I don't recommend this, because a trigger runs when you do an INSERT/UPDATE/DELETE, not when you COMMIT the change. So a client could be notified of a change, and then when they check the database the change appears to not be there, because either the transaction was rolled back, or else the transaction simply hasn't been committed yet.
Another reason the trigger solution is not good is that MySQL triggers execute once for each row changed, not once for each INSERT/UPDATE/DELETE statement. So you could cause notification spam, if you do an UPDATE that affects thousands of rows.
A different solution is to use a message queue like RabbitMQ or ActiveMQ or Amazon SQS (there are many others). When a client commits their INSERT/UPDATE/DELETE, they confirm the commit succeeded, then post a message on a message queue topic. Many clients can be notified efficiently this way. But it requires that every client who commits changes to the database write code to post to the message queue.
Another solution is for clients to subscribe to MySQL's binary log and read it as a change data capture log. Every committed change to the database is logged in the binary log. You can make clients read this, and it has no more impact to the database server than a replication client (MySQL can easily support hundreds of replicas).
A hybrid solution is to consume the binary log, and turn those changes into events in a message queue. This is how a product like Debezium works. It reads the binary log, and posts events to an Apache Kafka message queue. Then other clients can wait for events on the Kafka queue and respond to them.

Will IIS ever terminate the thread if a POST gets canceled by the browser [duplicate]

Environment:
Windows Server 2003 - IIS 6.x
ASP.NET 3.5 (C#)
IE 7,8,9
FF (whatever the latest 10 versions are)
User Scenario:
User enters search criteria against large data-set. After initiating the request, they are navigated to a results page, where they wait until the data is loaded and can then refine the data.
Technical Scenario:
After user sends search criteria (via ajax call), UI calls back-end service. Back-end service queries transactional system(s) and puts the resulting data into a db "cache" - a denormalized table, set-up for further refining the of the data (i.e. sorting, filtering). UI waits until the data is cached and then upon getting notified that the process is done, navigates to a resulting page. The resulting page then makes a call to get the data from the denormalized table.
Problem:
The search is relatively slow (15-25 seconds) for large queries that end up having to query many systems based on the criteria entered. It is relatively fast for other queries ( <4 seconds).
Technical Constraints:
We can not entirely re-architect this search / results system. There are way to many complexities here between how the UI and the back-end is tied together. The page is required (because of constraints that can not be solved on StackOverflow) to turn after performing the search criteria.
We also can not ask the organization to denormalize the data prior to searching because the data has to be real-time, i.e. if a user makes a change in other systems, the data has to show up correctly if they do a search afterwards.
Process that I want to follow:
I want to cheat a little. I want to issue the "Cache" request via an async HttpHandler in a fire-forget model.
After issuing the query, I want to transition the page to the resulting page.
On the transition page, I want to poll the "Cache" table to see if the data has been inserted into it yet.
The reason I want to do this transition right away, is that the resulting page is expensive on itself (even without getting the data) - still 2 seconds of load time before even getting to calling the service that gets the data from the cache.
Question:
Will the ASP.NET thread that is called via the async handler reliably continue processing even if I navigate away from the page using a javascript redirect?
Technical Boundaries 2:
Yes, I know... This search process does not sound efficient. There is nothing I can do about that right now. I am trying to do whatever I can to get it to perform a little better while we continue researching how we are going to re-architect it.
If your answer is to: "Throw it away and start over", please do not answer. That is not acceptable.
Yes.
There is the property Response.IsClientConnected which is used to know if a long running process is still connected. The reason for this property is a processes will continue running even if the client becomes disconnected and must be manually detected via the property and manually shut down if a premature disconnect occurs. It is not by default to discontinue a running process on client disconnect.
Reference to this property: http://msdn.microsoft.com/en-us/library/system.web.httpresponse.isclientconnected.aspx
update
FYI this is a very bad property to rely on these days with sockets. I strongly encourage you to do an approach which allows you to quickly complete a request that notes in some database or queue of some long running task to complete, probably use RabbitMQ or something like that, that in turns uses socket.io or similar to update the web page or app once completed.
How about don't do the async operation on an ASP.NET thread at all? Let the ASP.NET code call a service to queue the data search, then return to the browser with a token from the service, where it will then redirect to the result page that awaits the completed result? The result page will poll using the token from the service.
That way, you won't have to worry about whether or not ASP.NET will somehow learn that the browser has moved to a different page.
Another option is to use Threading (System.Threading).
When the user sends the search criteria, the server begins processing the page request, creates a new Thread responsible for executing the search, and finishes the response getting back to the browser and redirecting to the results page while the thread continues to execute on the server background.
The results page would keep verifying on the server if the query execution had finished as the started Thread would share the progress information. When it does finish, the results are returned when the next ajax call is done by the results page.
It could also be considered using WebSockets. In a sense that the Webserver itself could tell the browser when it is done processing the query execution as it offers full-duplex communications channels.

Push Data onto Queue vs Pull Data by Workers

I am building a web site backend that involves a client submitting a request to perform some expensive (in time) operation. The expensive operation also involves gathering some set of information for it to complete.
The work that the client submits can be fully described by a uuid. I am hoping to use a service oriented architecture (SOA) (i.e. multiple micro-services).
The client communicates with the backend using RESTful communication over HTTP. I plan to use a queue that the workers performing the expensive operation can poll for work. The queue has persistence and offers decent reliability semantics.
One consideration is whether I gather all of the data needed for the expensive operation upstream and then enqueue all of that data or whether I just enqueue the uuid and let the worker fetch the data.
Here are diagrams of the two architectures under consideration:
Push-based (i.e. gather data upstream):
Pull-based (i.e. worker gathers the data):
Some things that I have thought of:
In the push-based case, I would be likely be blocking while I gathered the needed data so the client's HTTP request would not be responded to until the data is gathered and then enqueued. From a UI standpoint, the request would be pending until the response comes back.
In the pull based scenario, only the worker needs to know what data is required for the work. That means I can have multiple types of clients talking to various backends. If the data needs change I update just the workers and not each of the upstream services.
Any thing else that I am missing here?
Another benefit of the pull based approach is that you don't have to worry about the data getting stale in the queue.
I think you already pretty much explained that the second (pull-based) approach is better.
If a user's request should anyway be processed asynchronously, why wait for the data to be gathered and then return a response. You need just to queue a work item and return HTTP response.
Passing data via queue is not a good option. If you get the data upstream, you will have to pass it somehow other than via queue to the worker (usually BLOB storage). That is additional work that is not really needed in your case.
I would recommend Cadence Workflow instead of queues as it supports long running operations and state management out of the box.
Cadence offers a lot of other advantages over using queues for task processing.
Built it exponential retries with unlimited expiration interval
Failure handling. For example it allows to execute a task that notifies another service if both updates couldn't succeed during a configured interval.
Support for long running heartbeating operations
Ability to implement complex task dependencies. For example to implement chaining of calls or compensation logic in case of unrecoverble failures (SAGA)
Gives complete visibility into current state of the update. For example when using queues all you know if there are some messages in a queue and you need additional DB to track the overall progress. With Cadence every event is recorded.
Ability to cancel an update in flight.
See the presentation that goes over Cadence programming model.

How to write an event trigger which send alerts to a JMS Queue

Is there any example where, we can trigger an event to send messages to JMS Queue when a table is updated/inserted ect for MYSQL/Postgre?
This sounds like a good task for pg_message_queue (which you can get off Google Code or PGXN), which allows you to queue requests. pg_message_queue doesn't do a great job of parallelism yet (in terms of parallel queue consumers), but I don't think you need that.
What you really want to do (and what pg_message_queue provides) is a queue table to hold the jms message, and then a trigger to queue that message. Then the question is how you get it from there to jms. You have basically two options (both of which are supported):
LISTEN for notifications, and when those come in handle them.
Periodically poll for notifications. You might do this if you have a lot of notifications coming in, so you can batch them every minute or so, or if you have few notifications coming in and you want to process them at midnight.
Naturally that is PostgreSQL only. Doing the same on MySQL? I don't know how to do that. I think you would be stuck with polling the table, but you could use pg_message_queue to understand basically how to do the rest. Note that in all cases this is fully transactional so the message would not be sent until after transaction commit, which is probably what you want.

How to retract a message in RabbitMQ?

I've got something like a job queue over RabbitMQ and, upon a request to cancel a job, I'd like to retract the tasks that have not yet started processing (their messages have not been ack'd), which corresponds to retracting these messages from the queues that they've been routed to.
I haven't found this functionality in AMQP or in the RabbitMQ API; perhaps I haven't searched well enough? Or will I have to use a workaround (it's not hard, but still)?
I would solve this scenario by having the worker check some sort of authoritative data source to determine if the the job should proceed or not. For example, the worker would check the job's status in a database to see if the job was canceled already.
For scenarios where the speed of processing jobs may be faster than the speed with which the authoritative store can be updated and read, a less guaranteed data store that trades speed for other characteristics may be useful.
An example of this would be to use Redis as the store for canceling processing of a message instead of a relational DB like MySQL. Redis is very fast, but makes fewer guarantees regarding the data it holds, whereas MySQL is much slower, but offers more guarantees about the data it holds.
In the end, the concept of checking with another source for whether or not to process a message is the same, but the way you implement that depends on your particular scenario.
RabbitMQ doesn't let you modify or delete messages after they've been enqueued. For that, you want some kind of database to hold the state of each job, and to use RabbitMQ to notify interested parties of changes in that state.
For lowish volumes, you can kludge it together with a queue per job. Create the queue, post the job description to the queue, announce the name of the queue to the workers. If the job needs to be cancelled before it is processed, deleted the job's queue; when the workers come to fetch the job description, they'll notice the queue has vanished.
Lighterweight and generally better would be to use redis or another key/value store to hold the job state (with a deleted or absent record meaning a cancelled or nonexistent job) and to use rabbitmq to notify about new/removed/changed records in the key/value store.
At least two ways to achieve your target:
basic.reject will requeue message if requeue=true is set (otherwise it will reject message).
(supported since RabbitMQ 2.0.0; see http://www.rabbitmq.com/blog/2010/08/03/well-ill-let-you-go-basicreject-in-rabbitmq/).
basic.recover will ask broker to redeliver unacked messages on channel.
You need to subscribe to all the queues to which messages have been routed, and consume them with ack.
For instance if you publish to a topic exchange with "test" as the routing key, and there are 3 persistent queues which subscribe to "test" you would need to consume those three queues. It might be better to add another queue which your consumer processes would also listen too, and tell them to ignore those messages.
An alternative, since you are using RabbitMQ, is to write a custom exchange plugin that will accept some out of band instruction to clear all queues. For instance you might have that exchange read a special message header that tells it to clear all queues to which this message is destined. This does require writing Erlang code, but there are 4 different exchange types implemented so you would only need to copy the most similar one and write the code for the new bahaviours. If you only use custom headers for this, then the body of the message can be a normal message for the consumers.
To sum up:
1) the publisher needs to consume the messages itself
2) the publisher can send a special message in a special queue to tell consumers to ignore the message
3) the publisher can send a special message to a custom exchange that will clear any existing messages from the queues before sending this special message to consumers.