How do I Store JSON data from websockets? - json

I’m looking at storing raw JSON data received from a bunch of websockets and am feeling a bit overwhelmed with all the choices available. I found this from 2012 but things have moved on a bit since.
My requirements are to be able to query the data through an API, to get the latest message (ideally in realtime) or a subset of messages (like all messages received on a particular date).
I'll be storing around 10,000 messages daily from different sources with different schemas.
From some basic research I think I need some kind of nosql document store? The main ones seem to be:
Elastic search
Redis
MongoDB
CouchDB
Am I on the right lines here? What DB service should I use to store JSON data?
Thanks.

I ended up streaming data from Kafka to both Postgres (for persistent storage) and Elasticsearch (you know, for search) if that helps anyone.

Related

Connect Adobe Analytics to MYSQL

I am trying to connect the data collected from Adobe Analytics to my local instance of MYSQL, is this possible? if so what would be the method of doing so?
There isn't a way to directly connect your mysql db with AA, make queries or whatever.
The following is just some top level info to point you in a general direction. Getting into specifics is way too long and involved to be an answer here. But below I will list some options you have for getting the data out of Adobe Analytics.
Which method is best largely depends on what data you're looking to get out of AA and what you're looking to do with it, within your local db. But in general, I listed them in order of level of difficulty of setting something up for it and doing something with the file(s) once received, to get them into your database.
First option is to within the AA interface, schedule data to be FTP'd to you on a regular basis. This can be a scheduled report from the report interface or from Data Warehouse, and can be delivered in a variety of formats but most commonly done as a CSV file. This will export data to you that has been processed by AA. Meaning, aggregated metrics, etc. Overall, this is pretty easy to setup and parse the exported CSV files. But there are a number of caveats/limitations about it. But it largely depends on what specifically you're aiming to do.
Second option is to make use of their API endpoint to make requests and receive response in JSON format. Can also receive it in XML format but I recommend not doing that. You will get similar data as above, but it's more on-demand than scheduled. This method requires a lot more effort on your end to actually get the data, but it gives you a lot more power/flexibility for getting the data on-demand, building interfaces (if relevant to you), etc. But it also comes with some caveats/limitations same as first option, since the data is already processed/aggregated.
Third option is to schedule Data Feed exports from the AA interface. This will send you CSV files with non-aggregated, mostly non-processed, raw hit data. This is about the closest you will get to the data sent to Adobe collection servers without Adobe doing anything to it, but it's not 100% like a server request log or something. Without knowing any details about what you are ultimately looking to do with the data, other than put it in a local db, at face value, this may be the option you want. Setting up the scheduled export is pretty easy, but parsing the received files can be a headache. You get files with raw data and a LOT of columns with a lot of values for various things, and then you have these other files that are lookup tables for both columns and values within them. It's a bit of a headache piecing it all together, but it's doable. The real issue is file sizes. These are raw hit data files and even a site with moderate traffic will generate files many gigabytes large, daily, and even hourly. So bandwidth, disk space, and your server processing power are things to consider if you attempt to go this route.

How to store JSON in DB without schema

I have a requirement to design an app to store JSON via REST API. I don't want to put limitation on JSON size(number of keys,etc). I see that MySQL supports to store JSON, but we have to create table/schema and then store the records
Is there any way to store JSON in any type of DB and have to query data with keys
EDIT: I don't want use any in-memory DB like Redis
Use ElasticSearch. In addition to schema less json, it support fast search.
The tagline of ElasticSearch is "You know, for search".
It is built on top of text indexing library called "Apache Lucene".
The advantage of using ElasticSearch are:
Scalable to petabytes of data clusters.
Fully open source. No cost to pay.
Enterprise support available for platinum license.
Comes with additional benefits such as analytics using Kibana.
I believe NoSQL is best solution. i.e MongoDB. I have tested MongoDB, looks good and has python module to interact easily. For quick overview on pros https://www.studytonight.com/mongodb/advantages-of-mongodb
I've had great results with Elasticsearch, so I second this approach as well. One question to ask yourself is how you plan to access the JSON data once it is in a repository like Elasticsearch. Will you simply store the JSON doc or will you attempt to flatten out the properties so that they can be individually aggregated? But yes, it is indeed fully scalable by increasing your compute capacity via instance size, expanding your disk space or by implementing index sharding if you have billions of records in a single index.

Real time migration of data from MySQL to elasticsearch?

I have tons of data present in MySQL in form of different database, and their respective tables. They all are related to each other. But when I have to do analysis in data, I have to create different scripts, that combine data, merge it and show me as a result, but this takes a lot of time, and effort too. I love elasticsearch for its speed and visualization of data via kibana, therefore I have decided to move my entire MySQL data in real time to elasticsearch, keeping data in MySQL too. But I want a scalable strategy, and process that migrates that data to elasticsearch.
Suggest the best tool, or methods to do the job.
Thank you.
Prior to Elasticsearch 2.x you could write your own Elasticsearch _river plugin that you can install into elasticsearch. You can control how often you want this said data you've munged with your scripts to be pulled in by the _river (Note: this is not truly recommended).
You may also use your favourite Queuing Message Broker tool such as ActiveMQ to push your data into elasticsearch
If you want full control to meet your need for real time migration of data you may also write a simple app that makes use of elasticsearch REST end point, by simply writing to it via REST. You can even do bulk POST
Make use of any of the elasticsearch tools such as beat, logstash that are great at shipping almost any type of data into elasticsearch
For other alternatives of munging your data to a flat file, or if you want to maintain relationships see this post here

Technology stack for a multiple queue system

I'll describe the application I'm trying to build and the technology stack I'm thinking at the moment to know your opinion.
Users should be able to work in a list of task. These tasks are coming from an API with all the information about it: id, image urls, description, etc. The API is only available in one datacenter and in order to avoid the delay, for example in China, the tasks are stored in a queue.
So you'll have different queues depending of your country and once that you finish with your task it will be send to another queue which will write this information later on in the original datacenter
The list of task is quite huge that's why there is an API call to get the tasks(~10k rows), store it in a queue and users can work on them depending on the queue the country they are.
For this system, where you can have around 100 queues, I was thinking on redis to manage the list of tasks request(ex: get me 5k rows for China queue, write 500 rows in the write queue, etc).
The API response are coming as a list of json objects. These 10k rows for example need to be stored somewhere. Due to you need to be able to filter in this queue, MySQL isn't an option at least that I store every field of the json object as a new row. First think is a NoSQL DB but I wasn't too happy with MongoDB in the past and an API response doesn't change too much. Like I need relation tables too for other thing, I was thinking on PostgreSQL. It's a relation database and you have the ability to store json and filter by them.
What do you think? Ask me is something isn't clear
You can use HStore extension from PostgreSQL to store JSON, or dynamic columns from MariaDB (MySQL clone).
If you can move your persistence stack to java, then many interesting options are available: mapdb (but it requires memory and its api is changing rapidly), persistit, or mvstore (the engine behind H2).
All these would allow to store json with decent performances. I suggest you use a full text search engine like lucene to avoid searching json content in a slow way.

What would be a preferrable approach for rendering time series data

We have a simple JSON feed which provides stock/price information at a certain point in time.
e.g.
{t0, {MSFT, 20}, {AAPL, 30}}
{t1, {MSFT, 10}, {AAPL, 40}}
{t2, {MSFT, 5}, {AAPL, 50}}
What would be a preferred mechanism to store/retrieve this data and to plot a graph based on this data (say MSFT). Should I use redis or mysql?
I would also want to show the latest entries to all users in the portal as and when new data is received. The data could be retrieved every minute. Should I use node.js for this
Ours is a rails application and would like to know what libraries/database should I use to model this capability.
Depends on the traffic and the data. If the data is relational, meaning it is formally described and organized according to the relational model, then MySQL is better. If most of the queries are get and set with key->value , meaning you are going to get the data using one key, and you need to support many clients and many set/get per minute, then defiantly go with Redis. There are many other noSQL DBs that might fit, have a look at this post for a great review of some of the most popular ones.
So many ways to do this.. if getting an update once a minute is enough have the client do AJAX calls every minute to get the updated data, and then you can build your server side using php, .NET, java servlet ot node.js, again, depend on the expected user concurrency. PHP is very easy to develop on, while node.js can support many short i/o requests. Another option you might want to consider is you use server push (Node's socket.io for example) instead of the client AJAX call. In this way the client will be notified immediately on an update.
Personally, I like both node.js and Redis and used them couple of production applications, supporting many concurrent users using a single server. I like node since it's easy to develop, and support many users, and Redis for it's amazing speed and concurrent requests. Having said that, I also use MySQL for saving relational data, and PHP servers for fast development of APIs. Each have its own benefits.
Hope you'll find this info helpful.
Kuf.
As Kuf mentioned, there are so many ways to go about this and it really does depends on your needs: low latency, data storage, or ease of implementation.
Redis will most likely be the best solution if you are going for a low latency and easy solution to implement. You can use Pub/Sub to push updates to clients (e.g. Node’s socket.io) in real-time and run a second Redis instance to store the JSON data as a sorted set using the timestamp as a score. I’ve used the same to much success storing time-based statistical data. The downside to this solution is that it is resource (i.e. memory) expensive if you want to store a lot of data. 
If you are looking to store a lot of data in JSON format and want to use a pull to fetch data every minute, then using ElasticSearch to store/retrieve data is another possibility. You can use ElasticSearch’s range query to search using a timestamp field, for example:
"range": {
"#timestamp": {
"gte": date_from,
"lte": now
}
}
This adds the flexibility of using an extremely scalable and redundant system, storing larger amounts of data, and a RESTful real-time API. 
Best of luck!
Since you're basically storing JSON data...
Postgres has a native JSON datatype
Also MongoDB might be a good fit too as JSON -> BSON
But if its just serving data even something as simple as memcached would suffice.
If you have a lot of data to keep updated in real-time like stock ticker prices, the solution should involve the server publishing to the client, not the client continually hitting the server for updates. Publish/subscribe (pub/sub) type model with websockets might be a good choice at the moment, depending on your client requirements.
For plotting the data using data from websockets there is already a question about that here.
Ruby-toolbox has a category called HTTP Pub Sub which might be a good place to start. Whether MySQL or Redis is better depends on what you will be doing with it aside from just streaming stock prices. Redis may be a better choice for performance. Note also that websocket-rails assumes Redis, if you were to use that- just as an example.
I would not recommend a simple JSON API (non-pubsub) in this case, because it will not scale as well (see this answer), but if you don't think you'll have many clients, go for it.
Cube could be a good example for reference. It uses MongoDB for data storage.
For plotting time series data, you may try out cubism.js.
Both projects are from square.