Is using a WebServer for executing methods remotely a good architectural decision? - json

A very small explanation of what i am talking about is presented below.
In my organization a strange kind of effort for load distributing is
implemented, i will attempt to explain it below.
A JBoss server runs in a machine with some IP xxx.xxx.xxx.xxx
There is an app which has many heavy time & resource consuming work
to do (heavy in terms of I/O operation ex- Large file uploads &
downloads - in Gigabytes)
Methods written in a Rest WebApp accessible by the URLs & parameters
to methods passed as URL params, set to the type POST, return value
of the methods as JSON output.
The App in question has a framework written just to make post calls
to the aforementioned WebApp wrapped nicely so that the calls can be
made without the programmer knowing what happens in the background.
This framework has externalized parameters which takes in the IP of
the machines running the WebApps and configure a list of such
machines available & route the method calls to the one least busy
whenever a method call to the framework is made.
Everything looks good but i suspect wrapping things in a http websever & doing processing there & sending the outputs in json may slow down things & collecting the logs in case of failures maybe tough.
Questions
I want to know the views of other programmers on this & whether this is a nice approach or not.
Also do any existing commercial applications follow something similar while trying to distribute the load?

In my opinion, the efficiency of load balancing lies in the tricky process of choosing the right node that will process your request.
A usual approach can be to monitor the CPU usage of the nodes that have to take the load and send the load to the one least busy. This process should be both accurate and efficient.
In your case, wrapping up of request and transfer of Json data should be a thing of least concern as they seem to be required activities and light ones too. Focus should be on the load balancing activity.
Answer to comment
If I understood correctly, http methodserver are there to process the requests from the central app. Load balancing will not be done by them. There has to be a central load balancing mechanism/tool that has to decide which methodserver will take which request.
All the methodservers (slaves) should be identical. That reduces maintainability effort as a fix can be done on one node and can be propagated to other nodes. This is how it is done in my organization.
It may be a requirement if a similar operation is done repeatedly, application server may reduce some duplicity here.

Related

real time data web page

I am new to programming and am working on pushing real time data from a PLC to a web page either by deploying HTML 5 on the WAGO or a Modbus driver wrapper. I honestly have tried to research but don't know where to start. it will be a closed private network with little to no influence from the outside web. I am simply looking to display a single piece of live information for proof of concept. basically I'm trying to custom design a Groov program.
You might want to look into using OPC. Kepware & SoftwareToolbox are just 2 of many vendors that offer tools to help you get your data the way you want it.
There is an existing tool to do what you want, but I am under the impression you have to write one from scratch. The existing tool is http://www.softwaretoolbox.com/cogentdatahub/ if you are interested in looking at it for ideas.
I've been able to interface with PLC using python and modbusTCP and an Raspberry pi as the webserver. Python is a quick and easy to learn language. Websockets are the HTML5 component best used for realtime data.
simple connect code (after you install everything):
from pymodbus.client.sync import ModbusTcpClient as ModbusClient
from time import sleep
client = ModbusClient('ip_address_of_modbus_IO')
if(client.connect()):
print(client.read_discrete_inputs(200,1).bits[0])
client.write_coil(0,True)
sleep(100)
client.write_coil(2,True)
found here:
http://simplyautomationized.blogspot.com/2013/09/home-automation-project-2-rpi-light.html
Can create a websocket broadcast server using an example here:
http://simplyautomationized.blogspot.com/2015/09/raspberry-pi-create-websocket-api.html
Fortunately you can not push data to a browser.
The Internet would become an even greater mess if you could.
To solve this, have your webpage contain a timer, written in JavaScript.
Every say 1 sec. it does an AJAX request (e.g. use jQuery implementation) to the server, which then delivers (almost) realtime data.
The webpage then displays that in some DOM element, e.g. an empty DIV.
So it's the browser polling your server.
#BlueDog
The data is "almost" realtime because sampling once a second gives a delay of at least one second. In the ideal case, as soon as data changes, it would be pushed to the browser. Unfortunately the browser has no way of knowing that anything changed, so the best it can do is frequently "ask" for updates (polling).
How much the delay is depends on your poll frequency. If it's once per second one has to add the delays for transmission of the page request and the reply of the server. The transmission time depends on your network (which may be the Internet with all uncertainty involved). If the backbones involved have enough capacity I expect overall delay to be between 1 and 1.5 seconds. With a dedicated network and even more frequent polling, I expect that 0.5 seconds should be possible. These are however estimated averages. If I request a page over the Internet and my provider (again) has a problem, it may be hours before I receive what I want. Also things like virus scanners and OS updates may spoil your game.
So, practically: with a good broadband connection, a stable browser and the right process priorities it should be possible to get below 1 second overall delay (incl. poll time interval) for 95% of the time. Be prepared to reboot the client every few days. Most browsers leak memory and most OS'es do so too.

Mvvmcross - Best way to retrieve updates

I'm not after any code in particular but I want to know what is the most efficient way to build a function that will constantly check for updates for things such as messages e.g. Have a chat conversation window and I want live updates such as Facebook.
Currently I have implemented it by putting a while loop in my core code that checks if the view is currently visible run a Task every 5 seconds to get new messages. This works but I don't believe its the most efficient way to do it and I need to consider battery life. *Note I do change visibility when the view goes away e.g. on iOS i do
public override ViewDidDissapper {
Model.SetVisible(false)
}
Has anyone implemented some sort of polling on a cross platform app?
There are many different possible solutions here - which one you prefer depends a lot on your requirements in terms of latency, reliability, efficiency, etc - and it depends on how much you can change server side.
If your server is fixed as a normal http server, then frequent polling may be your best route forwards, although you could choose to modify the 5 seconds occasionally when you think updates aren't likely.
One step up from this is that you could try long polling http requests within your server.
Another step beyond that are using Socket (TCP, UDP or websocket) communications to provide "real time" messaging.
And in parallel to these things, you could also consider using PUSH notifications both within your app and in the background.
Overall, this is a big topic - I'd recommend reading up about PushSharp from #Redth and about SignalR from Microsoft - #gshackles has some blog posts about using this in Xamarin. Also, services like AzureMobileServices, UrbanAirship, Buddy, Parse, etc may help

Realtime synchronization of live data over network

How do you sync data between two processes (say client and server) in real time over network?
I have various documents/datasets constructed on the server, which are downloaded and displayed by clients. Once downloaded, the document receives continuous updates in order to remain fresh.
It seems to be a simple and commonly occurring concept, but I cannot find any tools that provide this level of abstraction. I am not even sure what I am looking for. Perhaps there is a similar concept with solid tool support? Perhaps there is a chain of different tools that must be put together? Here's what I have considered so far:
I am required to propagate every change in a single hop (0.5 RTT), which rules out polling (typically >10 RTT) and cache invalidation techniques (1.5 RTT).
Data replication and simple notification broadcasts are not an option, because there is too much data and too many changes. Clients must be able to select specific documents to download and monitor for changes.
I am currently using message passing pattern, which does the job, but it is hopelessly unproductive. It works at way too low level of abstraction. It is laborious, error-prone, and it doesn't scale well with increasing application complexity.
HTTP and other RPC-like techniques are good for the initial fetch, but they encourage polling for subsequent synchronization. When performing reverse requests (from data source to data consumer), change notifications are possible, but it's even more complicated than message passing.
Combining RPC (for the initial fetch) with message passing (for updates) turned out to be a nightmare due to the complexity involved in coordinating communication over the two parallel connections as well as due to the impedance mismatch between the two paradigms. I need something unified.
WebSocket & Comet are popular methods to implement change notification, but they need additional libraries to be productive and I am not aware of any libraries suitable for my application.
Message queues merely put an intermediary on the network while maintaining the basic message passing pattern. Custom message filters/routers allow me to get closer to the live document concept, but I feel like I am implementing custom middleware layer on top of the MQ.
I have tons of additional requirements (native observable data structure API on both ends, incremental updates, custom message filters, custom connection routing, cross-platform, robustness & scalability), but before considering those requirements, I need to find some tools that at least attempt to do what I need. I am trying to avoid in-house frameworks for the standard reasons - cost, time to market, long-term maintenance, and keeping developers happy.
My conclusion at the moment is that there is no such live document synchronization framework. In-house solution is the way to go, but many existing components can be used as part of the solution.
It is pretty simple to layer live document logic on top of WebSocket or any other message passing platform. Server just sends the document as a separate message when the connection is initiated and then after every change. Automated reconnection and some connection monitoring must be added to handle network failures.
Serialization at both ends is a separate problem targeted by many existing libraries. Detecting changes in server-side data structures (needed to initiate push) is yet another separate problem that has its own set of patterns and tools. Incremental updates and many other issues can be solved by intermediaries intercepting the connection.
This approach will work with current technology at the cost of extensive in-house glue code. It can be incrementally substituted with standard components as they become available.
WebSocket already includes resource URIs, routing, and a few other nice features. Useful intermediaries and libraries will likely emerge in the future. HTTP with text/event-stream MIME type is a possible future alternative to WebSocket. The advantage of HTTP is that existing tools can be reused with little modification.
I've completely thrown away the pattern of combining RPC pull with separate push channel despite rich tool support. Pushing everything in 0.5 RTT requires the push channel to use exactly the same technology as the pull channel, i.e. reverse RPC. Reverse RPC is like message passing except it introduces redundant returns, throws away useful connection semantics, and makes it hard to insert content-agnostic intermediaries into the stream.

Message queuing solution for millions of topics

I'm thinking about system that will notify multiple consumers about events happening to a population of objects. Every subscriber should be able to subscribe to events happening to zero or more of the objects, multiple subscribers should be able to receive information about events happening to a single object.
I think that some message queuing system will be appropriate in this case but I'm not sure how to handle the fact that I'll have millions of the objects - using separate topic for every of the objects does not sound good [or is it just fine?].
Can you please suggest approach I should should take and maybe even some open source message queuing system that would be reasonable?
Few more details:
there will be thousands of subscribers [meaning not plenty of them],
subscribers will subscribe to tens or hundreds of objects each,
there will be ~5-20 million of the objects,
events themselves dont have to carry any message. just information that that object was changed is enough,
vast majority of objects will never be subscribed to,
events occur at the maximum rate of few hundreds per second,
ideally the server should run under linux, be able to integrate with the rest of the ecosystem via http long-poll [using node js? continuations under jetty?].
Thanks in advance for your feedback and sorry for somewhat vague question!
I can highly recommend RabbitMQ. I have used it in a couple of projects before and from my experience, I think it is very reliable and offers a wide range of configuraions. Basically, RabbitMQ is an open-source ( Mozilla Public License (MPL) ) message broker that implements the Advanced Message Queuing Protocol (AMQP) standard.
As documented on the RabbitMQ web-site:
RabbitMQ can potentially run on any platform that Erlang supports, from embedded systems to multi-core clusters and cloud-based servers.
... meaning that an operating system like Linux is supported.
There is a library for node.js here: https://github.com/squaremo/rabbit.js
It comes with an HTTP based API for management and monitoring of the RabbitMQ server - including a command-line tool and a browser-based user-interface as well - see: http://www.rabbitmq.com/management.html.
In the projects I have been working with, I have communicated with RabbitMQ using C# and two different wrappers, EasyNetQ and Burrow.NET. Both are excellent wrappers for RabbitMQ but I ended up being most fan of Burrow.NET as it is easier and more obvious to work with ( doesn't do a lot of magic under the hood ) and provides good flexibility to inject loggers, serializers, etc.
I have never worked with the amount of amount of objects that you are going to work with - I have worked with thousands ( not millions ). However, no matter how many objects I have been playing around with, RabbitMQ has always worked really stable and has never been the source to errors in the system.
So to sum up - RabbitMQ is simple to use and setup, supports AMQP, can be managed via HTTP and what I like the most - it's rock solid.
Break up the topics to carry specific events for e.g. "Object updated topic" "Object deleted"...So clients need to only have to subscribe to the "finite no:" of event based topics they are interested in.
Inject headers into your messages when you publish them and put intelligence into the clients to use these headers as message selectors. For eg, client knows the list of objects he is interested in - and say you identify the object by an "id" - the id can be the header, and the client will use the "id header" to determine if he is interested in the message.
Depending on whether you want, you may also want to consider ensuring guaranteed delivery to make sure that the client will receive the message even if it goes off-line and comes back later.
The options that I would recommend top of the head are ActiveMQ, RabbitMQ and Redis PUB SUB ( Havent really worked on redis pub-sub, please use your due diligance)
Finally here are some performance benchmarks for RabbitMQ and Redis
Just saw that you only have few 100 messages getting pushed out / sec, this is not a big deal for activemq, I have been using Amq on a system that processes 240 messages per second , and it just works fine. I use a thread pool of workers to asynchronously process the messages though . Look at a framework like akka if you are in the java land, if not stick with nodejs and the cool Eco system around it.
If it has to be open source i'd go for ActiveMQ, and an application server to provide the JMS functionality for topics and it has Ajax Support so you can access them from your client
So, you would use the JMS infrastructure to publish the topics for the objects, and you can create topis as you need them
Besides, by using an java application server you may be able to take advantages from clustering, load balancing and other high availability features (obviously based on the selected product)
Hope that helps!!!
Since your messages are very small might want to consider MQTT, which is designed for small devices, although it works fine on powerful devices as well. Key consideration is the low overhead - basically a 2 byte header for a small message. You probably can't use any simple or open source MQTT server, due to your volume. You probably need a heavy duty dedicated appliance like a MessageSight to handle your volume.
Some more details on your application would certainly help. Also you don't mention security at all. I assume you must have some needs in this area.
Though not sure about your work environment but here are my bits. Can you identify each object with unique ID in your system. If so, you can have a topic per each event type. for e.g. you want to track object deletion event, object updation event and so on. So you can have topic for each event type. These topics would be published with Ids of object whenever corresponding event happened to the object. This will limit the no of topics you needed.
Second part of your problem is different subscribers want to subscribe to different objects. So not all subscribers are interested in knowing events of all objects. This problem statement scoped to message selector(filtering) mechanism provided by messaging framework. So basically you need to seek on what basis a subscriber interested in particular object. Have that basis as a message filtering mechanism. It could be anything: object type, object state etc. So ultimately your system would consists of one topic for each event type with someone publishing event messages : {object-type:object-id} information. Subscribers could subscribe to any topic and with an filtering criteria.
If above solution satisfy, you can use any messaging solution: activeMQ, WMQ, RabbitMQ.

How do web spiders differ from Wget's spider?

The next sentence caught my eye in Wget's manual
wget --spider --force-html -i bookmarks.html
This feature needs much more work for Wget to get close to the functionality of real web spiders.
I find the following lines of code relevant for the spider option in wget.
src/ftp.c
780: /* If we're in spider mode, don't really retrieve anything. The
784: if (opt.spider)
889: if (!(cmd & (DO_LIST | DO_RETR)) || (opt.spider && !(cmd & DO_LIST)))
1227: if (!opt.spider)
1239: if (!opt.spider)
1268: else if (!opt.spider)
1827: if (opt.htmlify && !opt.spider)
src/http.c
64:#include "spider.h"
2405: /* Skip preliminary HEAD request if we're not in spider mode AND
2407: if (!opt.spider
2428: if (opt.spider && !got_head)
2456: /* Default document type is empty. However, if spider mode is
2570: * spider mode. */
2571: else if (opt.spider)
2661: if (opt.spider)
src/res.c
543: int saved_sp_val = opt.spider;
548: opt.spider = false;
551: opt.spider = saved_sp_val;
src/spider.c
1:/* Keep track of visited URLs in spider mode.
37:#include "spider.h"
49:spider_cleanup (void)
src/spider.h
1:/* Declarations for spider.c
src/recur.c
52:#include "spider.h"
279: if (opt.spider)
366: || opt.spider /* opt.recursive is implicitely true */
370: (otherwise unneeded because of --spider or rejected by -R)
375: (opt.spider ? "--spider" :
378: (opt.delete_after || opt.spider
440: if (opt.spider)
src/options.h
62: bool spider; /* Is Wget in spider mode? */
src/init.c
238: { "spider", &opt.spider, cmd_boolean },
src/main.c
56:#include "spider.h"
238: { "spider", 0, OPT_BOOLEAN, "spider", -1 },
435: --spider don't download anything.\n"),
1045: if (opt.recursive && opt.spider)
I would like to see the differences in code, not abstractly. I love code examples.
How do web spiders differ from Wget's spider in code?
A real spider is a lot of work
Writing a spider for the whole WWW is quite a task --- you have to take care about many "little details" such as:
Each spider computer should receive data from a few thousand servers in parallel in order to make efficient use of the connection bandwidth. (asynchronous socket i/o).
You need several computers that spider in parallel in order to cover the vast amount of information on the WWW (clustering; partitioning the work)
You need to be polite to the spidered web sites:
Respect the robots.txt files.
Don't fetch a lot of information too quickly: this overloads the servers.
Don't fetch files that you really don't need (e.g. iso disk images; tgz packages for software download...).
You have to deal with cookies/session ids: Many sites attach unique session ids to URLs to identify client sessions. Each time you arrive at the site, you get a new session id and a new virtual world of pages (with the same content). Because of such problems, early search engines ignored dynamic content. Modern search engines have learned what the problems are and how to deal with them.
You have to detect and ignore troublesome data: connections that provide a seemingly infinite amount of data or connections that are too slow to finish.
Besides following links, you may want to parse sitemaps to get URLs of pages.
You may want to evaluate which information is important for you and changes frequently to be refreshed more frequently than other pages. Note: A spider for the whole WWW receives a lot of data --- you pay for that bandwidth. You may want to use HTTP HEAD requests to guess whether a page has changed or not.
Besides receiving, you want to process the information and store it. Google builds indices that list for each word the pages that contain it. You may need separate storage computers and an infrastructure to connect them. Traditional relational data bases don't keep up with the data volume and performance requirements of storing/indexing the whole WWW.
This is a lot of work. But if your target is more modest than reading the whole WWW, you may skip some of the parts. If you just want to download a copy of a wiki etc. you get down to the specs of wget.
Note: If you don't believe that it's so much work, you may want to read up on how Google re-invented most of the computing wheels (on top of the basic Linux kernel) to build good spiders. Even if you cut a lot of corners, it's a lot of work.
Let me add a few more technical remarks on three points
Parallel connections / asynchronous socket communication
You could run several spider programs in parallel processes or threads. But you need about 5000-10000 parallel connections in order to make good use of your network connection. And this amount of parallel processes/threads produces too much overhead.
A better solution is asynchronous input/output: process about 1000 parallel connections in one single thread by opening the sockets in non-blocking mode and use epoll or select to process just those connections that have received data. Since Linux kernel 2.4, Linux has excellent support for scalability (I also recommend that you study memory-mapped files) continuously improved in later versions.
Note: Using asynchronous i/o helps much more than using a "fast language": It's better to write an epoll-driven process for 1000 connections written in Perl than to run 1000 processes written in C. If you do it right, you can saturate a 100Mb connection with processes written in perl.
From the original answer:
The down side of this approach is that you will have to implement the HTTP specification yourself in an asynchronous form (I am not aware of a re-usable library that does this). It's much easier to do this with the simpler HTTP/1.0 protocol than the modern HTTP/1.1 protocol. You probably would not benefit from the advantages of HTTP/1.1 for normal browsers anyhow, so this may be a good place to save some development costs.
Edit five years later:
Today, there is a lot of free/open source technology available to help you with this work. I personally like the asynchronous http implementation of node.js --- it saves you all the work mentioned in the above original paragraph. Of course, today there are also a lot of modules readily available for the other components that you need in your spider. Note, however, that the quality of third-party modules may vary considerably. You have to check out whatever you use. [Aging info:] Recently, I wrote a spider using node.js and I found the reliability of npm modules for HTML processing for link and data extraction insufficient. For this job, I "outsourced" this processing to a process written in another programming language. But things are changing quickly and by the time you read this comment, this problem may already a thing of the past...
Partitioning the work over several servers
One computer can't keep up with spidering the whole WWW. You need to distribute your work over several servers and exchange information between them. I suggest to assign certain "ranges of domain names" to each server: keep a central data base of domain names with a reference to a spider computer.
Extract URLs from received web pages in batches: sort them according to their domain names; remove duplicates and send them to the responsible spider computer. On that computer, keep an index of URLs that already are fetched and fetch the remaining URLs.
If you keep a queue of URLs waiting to be fetched on each spider computer, you will have no performance bottlenecks. But it's quite a lot of programming to implement this.
Read the standards
I mentioned several standards (HTTP/1.x, Robots.txt, Cookies). Take your time to read them and implement them. If you just follow examples of sites that you know, you will make mistakes (forget parts of the standard that are not relevant to your samples) and cause trouble for those sites that use these additional features.
It's a pain to read the HTTP/1.1 standard document. But all the little details got added to it because somebody really needs that little detail and now uses it.
I am not sure exactly what the original author of the comment was referring to, but I can guess that wget is slow as a spider, since it appears to only use a single thread of execution (at least by what you have shown).
"Real" spiders such as heritrix use a lot of parallelism and tricks to optimize their crawling speed, while simultaneously being nice to the website they are crawling. This typically means limiting hits to one site at a rate of 1 per second (or so), and crawling multiple websites at the same time.
Again this is all just a guess based on what I know of spiders in general, and what you posted here.
Unfortunately, many of the more well-known 'real' web spiders are closed-source, and indeed closed-binary. However there are a number of basic techniques wget is missing:
Parallelism; you're never going to be able to keep up with the entire web without retrieving multiple pages at a time
Prioritization; some pages are more important to spider than others
Rate limiting; you'll be banned quickly if you keep pulling down pages as quickly as you can
Saving to something other than a local filesystem; the Web is big enough that it's not going to fit in a single directory tree
Rechecking pages periodically without restarting the entire process; in practice, with a real spider you'll want to recheck 'important' pages for updates frequently, while less interesting pages can go for months.
There are also various other inputs that can be used such as sitemaps and the like. Point is, wget isn't designed to spider the entire web, and it's not really a thing that can be captured in a small code sample, as it's a problem of the whole overall technique being used, rather than any single small subroutine being wrong for the task.
I'm not going to go into details of how to spider the internet, I think that wget comment is regarding to spidering one website which is still a serious challenge.
As a spider you need to figure out when to stop, not go into recursive crawls just because the URL changed like date=1/1/1900 to 1/2/1900 and so
Even bigger challenge to sort out URL Rewrite (I have no clue what so ever how google or any other handles this). It's pretty big challenge to crawl enough but not too much. And how one can automatically recognise URL Rewrite with some random parameters and random changes in the content?
You need to parse Flash / Javascript at least up to some level
You need to consider some crazy HTTP issues like base tag. Even parsing the HTML is not easy, considering most of the websites are not XHTML and browsers are so flexible in the syntax.
I don't know how much of these implemented or considered in wget but you might want to take a look at httrack to understand the challenges of this task.
I'd love to give you some code examples but this is big tasks and a decent spider will be about 5000 loc without 3rd party libraries.
+ Some of them already explained by #yaakov-belch so I'm not going to type them again