network stream analysis using wireshark and apache storm - json

I want to make a simple application that analyzes network traffic generated by Wireshark and counts how often I access google. From this project I want to introduce myself to network traffic processing.
In order to get traffic, I use Wireshark.
In order to analyze the network traffic streams I want to use Apache Storm.
Well, I am new to stream processing, apache storm and wireshark (this is very first use), so I want to ask you if is possible to realize this simple project.
I think Apache Storm can read the data easily from json files.
So the things that puzzles me are:
Can I export wireshark data to json files?
Can I, somehow read in real time the data generated from wireshark,
Process this data using Apache Storm,
Or, if you can suggest me a more appropriate instrument with a concrete tutorial I would be grateful.

Sounds like an interesting project to start with.
To read data from Wireshark in real-time, I suppose you can make use of any Wireshark API and then create a client that pushes data to some port in your system.
Create a Storm Spout that reads from this port
Create a Storm Bolt to parse these message (if you want to convert to json)
Create a Storm Bolt to filter the traffic (the ones between google? may be based on IP address?)
Optionally store the results to some datastore.
Alternatives : You can also try apache-flink or apache-spark, which also provides stream processing capabilities.

Related

How to generate notifications from MySQL table updates?

I have a full stack app that uses React, Node.js, Express, and MySQL. I want the react app to respond to database updates similar to Firebase: When data changes, I want a real-time notification sent to my app.
I want to use stock MySQL (no plugins), so that I can use AWS RDB or whatever.
I will use socket.io to push the real-time notifications to the web app.
To avoid off-target responses, I'll summarize various approaches that are not what I am looking for:
The server could poll, or each client could poll. (Not real-time, but included for completeness. When I search, polling is the only solution I find.)
Write a wrapper that handles all MySQL updates, handles subscriptions, and sends the notifications. This is a complicated component that adds complexity. Firebase is popular because it both increases performance and reduces complexity. I like Firebase a lot but want to do the same thing with MySQL.
Use Firebase to handle the real-time notifications. The MySQL wrapper could use Firebase to handle the subscriptions and notifications, but there is still the problem of triggering the notifications in the first place. Also, I don't want to use Firebase. (For example, my application needs to run in an air-gapped environment.)
The question: Using a stock MySQL database, when a table changes, can a notification server discover the change in real-time (no polling), so that it can send notifications?
The approach that works is to listen to the binary logs. This way, any change to the database will be communicated in real-time. The consumer of the binary logs can then publish this information in a number of ways. A common choice is to feed a stream of events to Apache Kafka.
Debezium, Maxwell, and NiFi work this way.

designing an agnostic configuration service

Just for fun, I'm designing a few web applications using a microservices architecture. I'm trying to determine the best way to do configuration management, and I'm worried that my approach for configuration may have some enormous pitfalls and/or something better exists.
To frame the problem, let's say I have an authentication service written in c++, an identity service written in rust, an analytics services written in haskell, some middletier written in scala, and a frontend written in javascript. There would also be the corresponding identity DB, auth DB, analytics DB, (maybe a redis cache for sessions), etc... I'm deploying all of these apps using docker swarm.
Whenever one of these apps is deployed, it necessarily has to discover all the other applications. Since I use docker swarm, discovery isn't an issue as long all the nodes share the requisite overlay network.
However, each application still needs the upstream services host_addr, maybe a port, the credentials for some DB or sealed service, etc...
I know docker has secrets which enable apps to read the configuration from the container, but I would then need to write some configuration parser in each language for each service. This seems messy.
What I would rather do is have a configuration service, which maintains knowledge about how to configure all other services. So, each application would start with some RPC call designed to get the configuration for the application at runtime. Something like
int main() {
AppConfig cfg = configClient.getConfiguration("APP_NAME");
// do application things... and pass around cfg
return 0;
}
The AppConfig would be defined in an IDL, so the class would be instantly available and language agnostic.
This seems like a good solution, but maybe I'm really missing the point here. Even at scale, tens of thousands of nodes can be served easily by a few configuration services, so I don't forsee any scaling issues. Again, it's just a hobby project, but I like thinking about the "what-if" scenarios :)
How are configuration schemes handled in microservices architecture? Does this seem like a reasonable approach? What do the major players like Facebook, Google, LinkedIn, AWS, etc... do?
Instead of building a custom configuration management solution, I would use one of these existing ones:
Spring Cloud Config
Spring Cloud Config is a config server written in Java offering an HTTP API to retrieve the configuration parameters of applications. Obviously, it ships with a Java client and a nice Spring integration, but as the server is just a HTTP API, you may use it with any language you like. The config server also features symmetric / asymmetric encryption of configuration values.
Configuration Source: The externalized configuration is stored in a GIT repository which must be made accessible to the Spring Cloud Config server. The properties in that repository are then accessible through the HTTP API, so you can even consider implementing an update process for configuration properties.
Server location: Ideally, you make your config server accessible through a domain (e.g. config.myapp.io), so you can implement load-balancing and fail-over scenarios as needed. Also, all you need to provide to all your services then is just that exact location (and some authentication / decryption info).
Getting started: You may have a look at this getting started guide for centralized configuration on the Spring docs or read through this Quick Intro to Spring Cloud Config.
Netflix Archaius
Netflix Archaius is part of the Netflix OSS stack and "is a Java library that provides APIs to access and utilize properties that can change dynamically at runtime".
While limited to Java (which does not quite match the context you have asked), the library is capable of using a database as source for the configuration properties.
confd
confd keeps local configuration files up-to-date using data stored in external sources (etcd, consul, dynamodb, redis, vault, ...). After configuration changes, confd restarts the application so that it can pick up the updated configuration file.
In the context of your question, this might be worthwhile to try as confd makes no assumption about the application and requires no special client code. Most languages and frameworks support file-based configuration so confd should be fairly easy to add on top of existing microservices that currently use env variables and did not anticipate decentralized configuration management.
I don't have a good solution for you, but I can point out some issues for you to consider.
First, your applications will presumably need some bootstrap configuration that enables them to locate and connect to the configuration service. For example, you mentioned defining the configuration service API with IDL for a middleware system that supports remote procedure calls. I assume you mean something like CORBA IDL. This means your bootstrap configuration will not be just the endpoint to connect to (specified perhaps as a stringified IOR or a path/in/naming/service), but also a configuration file for the CORBA product you are using. You can't download that CORBA product's configuration file from the configuration service, because that would be a chicken-and-egg situation. So, instead, you end up with having to manually maintain a separate copy of the CORBA product's configuration file for each application instance.
Second, your pseudo-code example suggests that you will use a single RPC invocation to retrieve all the configuration for an application in a single go. This coarse level of granularity is good. If, instead, an application used a separate RPC call to retrieve each name=value pair, then you could suffer major scalability problems. To illustrate, let's assume an application has 100 name=value pairs in its configuration, so it needs to make 100 RPC calls to retrieve its configuration data. I can foresee the following scalability problems:
Each RPC might take, say, 1 millisecond round-trip time if the application and the configuration server are on the same local area network, so your application's start-up time is 1 millisecond for each of 100 RPC calls = 100 milliseconds = 0.1 second. That might seem acceptable. But if you now deploy another application instance on another continent with, say, a 50 millisecond round-trip latency, then the start-up time for that new application instance will be 100 RPC calls at 50 milliseconds latency per call = 5 seconds. Ouch!
The need to make only 100 RPC calls to retrieve configuration data assumes that the application will retrieve each name=value pair once and cache that information in, say, an instance variable of an object, and then later on access the name=value pair via that local cache. However, sooner or later somebody will call x = cfg.lookup("variable-name") from inside a for-loop, and this means the application will be making a RPC every time around the loop. Obviously, this will slow down that application instance, but if you end up with dozens or hundreds of application instances doing that, then your configuration service will be swamped with hundreds or thousands of requests per second, and it will become a centralised performance bottleneck.
You might start off writing long-lived applications that do 100 RPCs at start-up to retrieve configuration data, and then run for hours or days before terminating. Let's assume those applications are CORBA servers that other applications can communicate with via RPC. Sooner or later you might decide to write some command-line utilities to do things like: "ping" an application instance to see if it is running; "query" an application instance to get some status details; ask an application instance to gracefully terminate; and so on. Each of those command-line utilities is short-lived; when they start-up, they use RPCs to obtain their configuration data, then do the "real" work by making a single RPC to a server process to ping/query/kill it, and then they terminate. Now somebody will write a UNIX shell script that calls those ping and query commands once per second for each of your dozens or hundreds of application instances. This seemingly innocuous shell script will be responsible for creating dozens or hundreds of short-lived processes per second, and of those short-lived processes will make numerous RPC calls to the centralised configuration server to retrieve name=value pairs one at a time. That sort of shell script can pu a massive load on your centralised configuration server.
I am not trying to discourage you from designing a centralised configuration server. The above points are just warning about scalability issues you need to consider. Your plan for an application to retrieve all its configuration data via one coarse-granularity RPC call will certainly help you to avoid the kinds of scalability problems I mentioned above.
To provide some food for thought, you might want to consider a different approach. You could store each application's configuration files on a web sever. A shell start script "wrapper" for an application can do the following:
Use wget or curl to download "template" configuration files from the web server and store the files on the local file system. A "template" configuration file is a normal configuration file but with some placeholders for values. A placeholder might look like ${host_name}.
Also use wget or curl to download a file containing search-and-replace pairs, such as ${host_name}=host42.pizza.com.
Perform a global search-and-replace of those search-and-replace terms on all the downloaded template configuration files to produce the configuration files that are ready to use. You might use UNIX shell tools like sed or a scripting language to perform this global search-and-replace. Alternatively, you could use a templating engine like Apache Velocity.
Execute the actual application, using a command-line argument to specify the path/to/downloaded/config/files.

Meteor performance comparison of publishing static data vs. getting data via HTTP Get request

I am building an app that receives a bunch of static data that is read only. The user does not change the data, or send any data to the server. The app just gets the data and presents it to the user in various views.
Like for example a parts list, with part numbers and prices. This data is currently stored in mongoDB.
I have few options for getting the data to the client. I could just use meteor's publication system, and have the client subscribe to the data it needs.
Or I could map all the data the client needs into one JSON file, save the JSON file to Amazon S3, and have the client make simple GET request to grab the data.
If we wanted this app to scale to many, many users, would not using meteor publication be the best? Or would either method be similar in terms of performance? Using meteor publication system would be the easiest, but I am worried that going down this route would lead to performance issues if a lot of clients request the data. If the performance between publishing and get request is about the same, I would just stick with the publication as its the easiest.
In this case Meteor will provide better performance. If your data is mostly server to client driven then clients do not have to worry about polling the server and the server will not have to worry about handling the request.
Also Meteor requires very little resources to send data to the client because the connection is persistent. Take an app like code fights which is built on Meteor constantly has thousands of connections to and from it, its performance runs great.
As a side note, if you are ready to serve your static data as a JSON file in a separate server (AWS S3), then it means you do not expect that data to be that big, so that it can be handled in a single file and entirely loaded in client's memory.
In that case, you might even want to reconsider the need to perform any separate request (whether HTTP or Meteor Pub/Sub).
For instance, simply embedding the data in your app, or served through SSR / Fast Render package.
Then if you are really concerned about your scalability, you might even reconsider the need to use Meteor, since you do not seem to need any client-server interactivity (no real need for Pub/Sub, no reactivity…). After your prototype is ready, you could rework it as a separate and static SPA, so that you do not even need to serve it through Node / Meteor.

Stream data from MySQL Binary Log to Kinesis

We have a write-intensive table (on AWS RDS MySQL) from a legacy system and we'd like to stream every write event (insert or updated) from that table to kinesis. The idea is to create a pipe to warmup caches and update search engines.
Currently we do that using a rudimentar polling architecture, basically using SQL, but the ideal would be to have a push architecture reading the events directly from the transaction log.
Has anyone tried it? Any suggested architecture?
I've worked with some customers doing that already, in Oracle. Seems also that LinkedIn uses a lot that technique of streaming data from databases to somewhere else. They created a platform called Databus to accomplish that in an agnostic way - https://github.com/linkedin/databus/wiki/Databus-for-MySQL.
There is a public project in Github, following LinkedIn principles that is already streaming binlog from Mysql to Kinesis Streams - https://github.com/cmerrick/plainview
If you want to get into the nitty gritty details of LinkedIn approach, there is a really nice (and extensive) blog post available - https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying.
Last but not least, Yelp is doing that as well, but with Kafka - https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html
Not getting into the basics of Kinesis Streams, for the sake of brevity, if we bring Kinesis Streams to the game, I don't see why it shouldn't work. As a matter of fact, it was built for that - your database transaction log is a stream of events. Borrowing an excerpt from Amazon Web Services public documentation: Amazon Kinesis Streams allows for real-time data processing. With Amazon Kinesis Streams, you can continuously collect data as it is generated and promptly react to critical information about your business and operations.
Hope this helps.
aws DMS service offers data migration from SQL db to kinesis.
Use the AWS Database Migration Service to Stream Change Data to Amazon Kinesis Data Streams
https://aws.amazon.com/blogs/database/use-the-aws-database-migration-service-to-stream-change-data-to-amazon-kinesis-data-streams/

What is an appropriate tool for processing JSON Activity Streams for both BI and Hadoop?

I have a number of systems, most of which are capable of generating data using JSON Activity Streams[1] (or can be coerced into doing so), and I want to use this data for analytics.
I want to use both a traditional SQL datamart for OLAP use, and also to dump the raw JSON data into Hadoop for running batch mapreduce jobs.
I've been reading up on Kafka, Flume, Scribe, S4, Storm and a whole load of other tools but I'm still not sure which is best suited to the task at hand. These seem to be either focussed on logfile data, or real-time processing of the activity stream, whereas I guess I'm more interested in doing ETL on activity streams.
The type of setup I'm thinking of is where I provide a configuration for all the streams I'm interested in (URLs, params, credentials), and the tool periodically polls them, dumps the output in HDFS, and also has a hook for me to process and transform the JSON for insertion into the datamart.
Do any of the existing open-source tools fit this case particularly well?
(In terms of scale I expect a max of 30,000 users interacting with ~10 systems - not simultaneously - so not really "Big Data", but not trivial either.)
Thanks!
[1] http://activitystrea.ms/
You should check out streamsets.com
It's an open source tool (and free to use) built exactly for these kinds of use cases.
You can use the HTTP Client source and HDFS destination to achieve your main goals. If you decide you need to use Kafka or Flume as well, support for both are also built-in.
You can also do transformations in a variety of ways including writing python or javascript for more complex transformations (or your own stage in Java if you choose).
You can also check out logstash (elastic.co) and NiFi to see if one of those works better for you.
*Full disclosure, I'm an engineer at StreamSets.