I'm building a system which requires an Arduino board to send data to the server.
The requirements/constraints of the app are:
The server must receive data and store them in a MySQL database.
A web application is used to graph and plot historical data.
Data consumption is critical
Web application must also be able to plot data in real time.
So far, the system is working fine, however, optimization is required.
The current adopted steps are:
Accumulate data in Arduino board for 10 seconds.
Send the data to the server using POST with data containing an XML string representing the 10 records.
The server parse the received XML and store the values in the database.
This approach is good for historical data, but not for realtime monitoring.
My question is: Is there a difference between:
Accumulating the data and send them as XML, and,
Send the data each second.
In term of data consumption, is sending a POST request each second too much?
Thanks
EDIT: Can anybody provide a mathematical formula benchmarking the two approaches in term of data consumption?
For your data consumption question you need to figure out how much each POST costs you giving your cell phone plan. I don't know if there is a mathematical formula, but you could easily test and work it out.
However, using 3G (even Wifi for that matter), the power consumption will be an issue, especially if your circuit runs on a battery; each POST bursts around 1.5 amps, that's too much for sending data every second.
But again, why would you send data every second?
Real time doesn't mean sending data every second, it means being at least as fast as the system.
For example, if you are sending temperatures, temperature doesn't change from 0° to 100° in one second. So all those POSTs will be a waste of power and data.
You need to know how fast the parameters change in your system and adapt your POST accordingly.
Related
I have a web service which handles a significant volume of traffic. This traffic can be in range of millions per minute. The service is hosted on AWS EC2 behind an ELB and uses HTTP APIs. This leads to a good chunk of AWS bill comprising of Data Transfer fees. The Data Transfer Out component is mostly higher since the 50% responses from web service are somewhat large and encoded as JSON in addition to SSL negotiations.
Now gRPC payloads are smaller in size compared to similar data represented as JSON due to binary serialization. So it is possible to save upon the data transfer costs by switching from HTTP APIs to gRPC?
I couldn't find any benchmark/article anywhere correlating AWS Data Transfer costs with HTTP APIs/gRPC services. Even 5-10% savings would be beneficial.
PS: Here the clients accessing the web service are also mine. So it is possible for to make changes on both server side and client side.
Maybe, but probably not. It depends on your actual data.
If you're using HTTP for communications, then there are two components of overall message size: HTTP headers and response body. If the headers represent a significant portion of your overall message size, then it makes more sense to get rid of them by using an alternative layer-7 protocol, such as WebSockets.
If the headers aren't significant, then it depends on what your actual message content is. That's because Protocol Buffers, which is used by gRPC, performs essentially two optimizations:
Replacement of field names with a one- or two-byte value. This can be a big savings, as long as your JSON response doesn't frequently use the same field names (ie, repeated objects). If it does, then using GZip encoding will reduce the average cost of a field name down to somewhere around 5 bytes (my observation with large files, YMMV).
Storage of numeric values in fewer than their normal number of bits. If your message content consists of arrays of numbers, this will be a huge win. If it's mostly text, you won't see much benefit, because the same byte sequence will have to be sent in either case.
Personally, I think switching to WebSockets would be the best first step. That assumes, of course, that these messages are coming from a relatively small number of clients. If every message is from a different client, you won't save anything.
I am trying to connect the data collected from Adobe Analytics to my local instance of MYSQL, is this possible? if so what would be the method of doing so?
There isn't a way to directly connect your mysql db with AA, make queries or whatever.
The following is just some top level info to point you in a general direction. Getting into specifics is way too long and involved to be an answer here. But below I will list some options you have for getting the data out of Adobe Analytics.
Which method is best largely depends on what data you're looking to get out of AA and what you're looking to do with it, within your local db. But in general, I listed them in order of level of difficulty of setting something up for it and doing something with the file(s) once received, to get them into your database.
First option is to within the AA interface, schedule data to be FTP'd to you on a regular basis. This can be a scheduled report from the report interface or from Data Warehouse, and can be delivered in a variety of formats but most commonly done as a CSV file. This will export data to you that has been processed by AA. Meaning, aggregated metrics, etc. Overall, this is pretty easy to setup and parse the exported CSV files. But there are a number of caveats/limitations about it. But it largely depends on what specifically you're aiming to do.
Second option is to make use of their API endpoint to make requests and receive response in JSON format. Can also receive it in XML format but I recommend not doing that. You will get similar data as above, but it's more on-demand than scheduled. This method requires a lot more effort on your end to actually get the data, but it gives you a lot more power/flexibility for getting the data on-demand, building interfaces (if relevant to you), etc. But it also comes with some caveats/limitations same as first option, since the data is already processed/aggregated.
Third option is to schedule Data Feed exports from the AA interface. This will send you CSV files with non-aggregated, mostly non-processed, raw hit data. This is about the closest you will get to the data sent to Adobe collection servers without Adobe doing anything to it, but it's not 100% like a server request log or something. Without knowing any details about what you are ultimately looking to do with the data, other than put it in a local db, at face value, this may be the option you want. Setting up the scheduled export is pretty easy, but parsing the received files can be a headache. You get files with raw data and a LOT of columns with a lot of values for various things, and then you have these other files that are lookup tables for both columns and values within them. It's a bit of a headache piecing it all together, but it's doable. The real issue is file sizes. These are raw hit data files and even a site with moderate traffic will generate files many gigabytes large, daily, and even hourly. So bandwidth, disk space, and your server processing power are things to consider if you attempt to go this route.
i am facing the problem of parsing large json-results from a rest-endpoint (elasticsearch).
besides the design of the system has got its flaws, I am wondering whether there is another way to do the parsing.
The rest-response contains 10k Object in Json-Array. I am using the native Json-mapper of elasticsearch and Jsoniter. Both lack performance and slow down the application. The request duration raises up to 10-15 sec.
I will encourage a change of the interface but the big result list will remain for the next 6 month.
Could anyone give me an advice what to do to speed up the performance with elasticsearch?
Profile everything.
Is Elasticsearch slow in generating the response?
If you perform the query with Curl, redirect the output to a file, and time it, what fraction of your app's time taken does that take?
Are you running it locally? You might be dropping packets/being throttled by low bandwidth over the network.
Is the performance hit is purely decoding the response?
How long does it take to decode the same blob of JSON using Jsoniter once loaded into memory from a static file?
Have you considered chunking your query?
What about spinning it off as a separate process and immediately returning to the event loop?
There are lots of options and not enough detail in your question to be able to give solid advice.
I have a massive table that records events happening on our website. It has tens of millions of rows.
I've already tried adding indexing and other optimizations.
However, it's still very taxing on our server (even though we have quite a powerful one) and takes 20 seconds on some large graph/chart queries. So long in fact that our daemon intervenes to kill the queries often.
Currently we have a Google Compute instance on the frontend and a Google SQL instance on the backend.
So my question is this - is there some better way of storing an querying time series data using the Google Cloud?
I mean, do they have some specialist server or storage engine?
I need something I can connect to my php application.
Elasticsearch is awesome for time series data.
You can run it on compute engine, or they have a hosted version.
It is accessed via an HTTP JSON API, and there are several PHP clients (although I tend to make the API calls directly as i find it better to understand their query language that way).
https://www.elastic.co
They also have an automated graphing interface for time series data. It's called Kibana.
Enjoy!!
Update: I missed the important part of the question "using the Google Cloud?" My answer does not use any specialized GC services or infrastructure.
I have used ElasticSearch for storing events and profiling information from a web site. I even wrote a statsd backend storing stat information in elasticsearch.
After elasticsearch changed kibana from 3 to 4, I found the interface extremely bad for looking at stats. You can only chart 1 metric from each query, so if you want to chart time, average time, and avg time of 90% you must do 3 queries, instead of 1 that returns 3 values. (the same issue existing in 3, just version 4 looked more ugly and was more confusing to my users)
My recommendation is to choose a Time Series Database that is supported by graphana - a time series charting front end. OpenTSDB stores information in a hadoop-like format, so it will be able to scale out massively. Most of the others store events similar to row-based information.
For capturing statistics, you can either use statsd or reimann (or reimann and then statsd). Reimann can add alerting and monitoring before events are sent to your stats database, statsd merely collates, averages, and flushes stats to a DB.
http://docs.grafana.org/
https://github.com/markkimsal/statsd-elasticsearch-backend
https://github.com/etsy/statsd
http://riemann.io/
We have a simple JSON feed which provides stock/price information at a certain point in time.
e.g.
{t0, {MSFT, 20}, {AAPL, 30}}
{t1, {MSFT, 10}, {AAPL, 40}}
{t2, {MSFT, 5}, {AAPL, 50}}
What would be a preferred mechanism to store/retrieve this data and to plot a graph based on this data (say MSFT). Should I use redis or mysql?
I would also want to show the latest entries to all users in the portal as and when new data is received. The data could be retrieved every minute. Should I use node.js for this
Ours is a rails application and would like to know what libraries/database should I use to model this capability.
Depends on the traffic and the data. If the data is relational, meaning it is formally described and organized according to the relational model, then MySQL is better. If most of the queries are get and set with key->value , meaning you are going to get the data using one key, and you need to support many clients and many set/get per minute, then defiantly go with Redis. There are many other noSQL DBs that might fit, have a look at this post for a great review of some of the most popular ones.
So many ways to do this.. if getting an update once a minute is enough have the client do AJAX calls every minute to get the updated data, and then you can build your server side using php, .NET, java servlet ot node.js, again, depend on the expected user concurrency. PHP is very easy to develop on, while node.js can support many short i/o requests. Another option you might want to consider is you use server push (Node's socket.io for example) instead of the client AJAX call. In this way the client will be notified immediately on an update.
Personally, I like both node.js and Redis and used them couple of production applications, supporting many concurrent users using a single server. I like node since it's easy to develop, and support many users, and Redis for it's amazing speed and concurrent requests. Having said that, I also use MySQL for saving relational data, and PHP servers for fast development of APIs. Each have its own benefits.
Hope you'll find this info helpful.
Kuf.
As Kuf mentioned, there are so many ways to go about this and it really does depends on your needs: low latency, data storage, or ease of implementation.
Redis will most likely be the best solution if you are going for a low latency and easy solution to implement. You can use Pub/Sub to push updates to clients (e.g. Node’s socket.io) in real-time and run a second Redis instance to store the JSON data as a sorted set using the timestamp as a score. I’ve used the same to much success storing time-based statistical data. The downside to this solution is that it is resource (i.e. memory) expensive if you want to store a lot of data.
If you are looking to store a lot of data in JSON format and want to use a pull to fetch data every minute, then using ElasticSearch to store/retrieve data is another possibility. You can use ElasticSearch’s range query to search using a timestamp field, for example:
"range": {
"#timestamp": {
"gte": date_from,
"lte": now
}
}
This adds the flexibility of using an extremely scalable and redundant system, storing larger amounts of data, and a RESTful real-time API.
Best of luck!
Since you're basically storing JSON data...
Postgres has a native JSON datatype
Also MongoDB might be a good fit too as JSON -> BSON
But if its just serving data even something as simple as memcached would suffice.
If you have a lot of data to keep updated in real-time like stock ticker prices, the solution should involve the server publishing to the client, not the client continually hitting the server for updates. Publish/subscribe (pub/sub) type model with websockets might be a good choice at the moment, depending on your client requirements.
For plotting the data using data from websockets there is already a question about that here.
Ruby-toolbox has a category called HTTP Pub Sub which might be a good place to start. Whether MySQL or Redis is better depends on what you will be doing with it aside from just streaming stock prices. Redis may be a better choice for performance. Note also that websocket-rails assumes Redis, if you were to use that- just as an example.
I would not recommend a simple JSON API (non-pubsub) in this case, because it will not scale as well (see this answer), but if you don't think you'll have many clients, go for it.
Cube could be a good example for reference. It uses MongoDB for data storage.
For plotting time series data, you may try out cubism.js.
Both projects are from square.