Storing/versioning large amount of JSON objects

Storing/versioning large amount of JSON objects - json

Our cloud service deals with chunks of JSON data (item) which is being manipulated all the time. It can be changed as fast as every second.
At the moment item is JSON object that is being modified all the time. Now we need to implement versioning of these items as well.
Basically, every time request to modify the object arrives, it is modified, saved to DB and then we also need to store that version somewhere. So later on you will be able to say "give me version 345 of this item".
My question is - what would be the ideal way to store this history. Mind you, we do not need to query or alter the data once saved, all we need is to load it if necessary (0.01% of time) - the data is meaningless blob basically.
We are researching multiple approaches:
Simple text files (file system)
Cloud storage (eg S3)
Version control (eg GIT)
Database (any)
Vault (For example Vault from hashicorp)
Main problem is that since items are updated every second, we end up with a lot of blobs. Consider - 100 items, updated every second - thats 8,640,000 records in a single day. Not to mention 100rps for the DB.
Do you have any recommendation as to what would be the optimal approach? We need it to be scalable, fast, reliable, encryption out-of-the-box would be great plus.

Related

Connect Adobe Analytics to MYSQL

I am trying to connect the data collected from Adobe Analytics to my local instance of MYSQL, is this possible? if so what would be the method of doing so?

There isn't a way to directly connect your mysql db with AA, make queries or whatever.
The following is just some top level info to point you in a general direction. Getting into specifics is way too long and involved to be an answer here. But below I will list some options you have for getting the data out of Adobe Analytics.
Which method is best largely depends on what data you're looking to get out of AA and what you're looking to do with it, within your local db. But in general, I listed them in order of level of difficulty of setting something up for it and doing something with the file(s) once received, to get them into your database.
First option is to within the AA interface, schedule data to be FTP'd to you on a regular basis. This can be a scheduled report from the report interface or from Data Warehouse, and can be delivered in a variety of formats but most commonly done as a CSV file. This will export data to you that has been processed by AA. Meaning, aggregated metrics, etc. Overall, this is pretty easy to setup and parse the exported CSV files. But there are a number of caveats/limitations about it. But it largely depends on what specifically you're aiming to do.
Second option is to make use of their API endpoint to make requests and receive response in JSON format. Can also receive it in XML format but I recommend not doing that. You will get similar data as above, but it's more on-demand than scheduled. This method requires a lot more effort on your end to actually get the data, but it gives you a lot more power/flexibility for getting the data on-demand, building interfaces (if relevant to you), etc. But it also comes with some caveats/limitations same as first option, since the data is already processed/aggregated.
Third option is to schedule Data Feed exports from the AA interface. This will send you CSV files with non-aggregated, mostly non-processed, raw hit data. This is about the closest you will get to the data sent to Adobe collection servers without Adobe doing anything to it, but it's not 100% like a server request log or something. Without knowing any details about what you are ultimately looking to do with the data, other than put it in a local db, at face value, this may be the option you want. Setting up the scheduled export is pretty easy, but parsing the received files can be a headache. You get files with raw data and a LOT of columns with a lot of values for various things, and then you have these other files that are lookup tables for both columns and values within them. It's a bit of a headache piecing it all together, but it's doable. The real issue is file sizes. These are raw hit data files and even a site with moderate traffic will generate files many gigabytes large, daily, and even hourly. So bandwidth, disk space, and your server processing power are things to consider if you attempt to go this route.

Options for storing and retrieving small objects (node.js), is a database necessary?

I am in the process of building my first live node.js web app. It contains a form that accepts data regarding my clients current stock. When submitted, an object is made and saved to an array of current stock. This stock is then permanently displayed on their website until the entry is modified or deleted.
It is unlikely that there will ever be more than 20 objects stored at any time and these will only be updated perhaps once a week. I am not sure if it is necessary to use MongoDB to store these, or whether there could be a simpler more appropriate alternative. Perhaps the objects could be stored to a JSON file instead? Or would this have too big an implication on page load times?

You could potentially store in a JSON file or even in a cache of sorts such as Redis but I still think MongoDB would be your best bet for a live site.
Storing something in a JSON file is not scalable so if you end up storing a lot more data than originally planned (this often happens) you may find you run out of storage on your server hard drive. Also if you end up scaling and putting your app behind a load balancer, then you will need to make sure there are matching copy's of that JSON file on each server. Further more, it is easy to run into race conditions when updating a JSON file. If two processes are trying to update the file at the same time, you are going to potentially lose data. Technically speaking, JSON file would work but it's not recommended.
Storing in memory (i.e.) Redis has similar implications that the data is only available on that one server. Also the data is not persistent, so if your server restarted for whatever reason, you'd lose what was stored in memory.
For all intents and purposes, MongoDB is your best bet.

The only way to know for sure is test it with a load test. But as you probably read html and js files from the file system when serving web pages anyway, the extra load of reading a few json files shouldn't be a problem.

If you want to go with simpler way i.e JSON file use nedb API which is plenty fast as well.

Frequently updating a large JSON file on Amazon S3 and potential write conflict

I first want to give a little overview on what I'm trying to tackle. My service is frequently fetching posts from various sources such as Instagram, Twitter, etc. and I want to store the posts in one large JSON file on S3. The file name would be something like: {slideshowId}_feed.json
My website will display the posts in a slideshow, and the slideshow will simply poll the S3 file every minute or so to get the latest data. It might even poll another file such as {slideshowId}_meta.json that has timestamp from when the large file changed in order to save bandwidth.
The reason I want to keep the posts in a single JSON file is mainly to save cost. I could have each source as its own file, e.g. {slideshowId}_twitter.json, {slideshowId}_instagram.json, etc. but then the slideshow would need to send GET request to every source every minute, thus increasing the cost. We're talking about thousands of slideshows running at once, so the cost needs to scale well.
Now back to the question. There may be more than one instance of the service running that checks Instagram and other sources for new posts, depending on how much I need to scale out. The problem with that is the risk of one service overwriting the S3 file while another one might
already be writing to it.
Each service that needs to save posts to the JSON file would first have to GET the file, process it and check that the new posts are not duplicated in the JSON file, and then store the new or updated posts.
Could I have each service write the data to some queue like the Simple Queue Service
(SQS) and then have some worker that takes care of writing the posts to the S3 file?
I thought about using AWS Kinesis, but it just processes the data
from the sources and dumps it to S3. I need to process what has been
written to the large JSON file as well to do some book keeping.
I had an idea of using DynamoDB to store the posts (basically to do the book keeping), and
then I would simply have the service query all the data needed for a
single slideshow from DynamoDB and store it to S3. That way the services would simply send the posts to DynamoDB.
There must be some clever way to solve this problem.

Ok for your use case
there are many users for a single large s3 file
the file is updated often
the file path (ideally) should be consistent to make it easier to get and cache
the s3 file is generated by a process on a ec2 and updated once per minute
If the GET rate is less than 800 per second then AWS is happy with it. If not then you'll have to talk to them and maybe find another way. See http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
The file updates will be atomic so there are no issues with locking etc. See http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html
Presumably if a user requests "during" an update they will see the old version. This behaviour is transparent to both parties
File updates are "eventually" consistent. As you want to keep the url the same you will be updating the same object path in s3.
If you are serving across regions then the time it takes to become consistent might be an issue. For the same region it seems to take a few seconds. AWS don't seem to be very open about this, so it's probably best to test it for your use case. As your file is small and the updates are per 60 seconds then I would imagine it would be ok. You might have to assume in your API description that updates actually happen over a greater time than 60 seconds to take this into account
As ec2 and s3 run on different parts of the AWS infrastructure (ec2 in a VPC and s3 behind a public https) You will pay for transfer costs from ec2 to s3
I would imagine that you will be serving the s3 file via the s3 "pretend to be a website" feature. You will have to configure this too, but that is trivial

This is what I would do:
The Kinesis stream would need to have enough capacity to handle writes from all your feed producers. For about 25/month you get to do 2000 writes per second.
Lambda would be simply fired whenever there is enough new items on your stream. You can configure trigger to wait for 1000 new items and then run the Lambda to read all new items from the stream, process them and write them to REDIS (ElastiCache). Your bill for that should be well under 10/month.
Smart key selection would take care of duplicate items. You can also set the items to expire if you need to. According to your description your items should definitely fit into memory and you can add instances if you need more capacity for reading and/or reliability. Running two REDIS instances with enough memory to handle your data would cost around 26/month.
Your service would use REDIS instead of S3, so you would only pay for the data transfer and only if your service is not on AWS (<10/month?).

Accessing objects from json files on disk

I have ~500 json files on my disk that represents hotels all over the world, each around 30 mbs, all objects have the same structure.
At certain points in my spring server I require to get the information of a single hotel, let's say via code (which is inside the json object).
The data is read only, but I might get updates from the hotels providers at certain times, like extra json files or delta changes.
Now I don't want to migrate my json files to a relational database that's for sure, so I've been investigating in the best solution to achieve what I want.
I tried Apache Drill because querying straight from json files made me think less headaches of dealing with the data, I did a directory query using Drill, something like:
SELECT * FROM dfs.'C:\hotels\' WHERE code='1b3474';
but this obviously does not seem to be the most efficient way for me as it takes around 10 seconds to fetch a single hotel.
At the moment I'm trying out Couch DB, but I'm still learning it. Should I migrate all the hotels to a single document (makes a bit of sense to me)? Or should I consider each hotel a document?
I'm just looking for pointers on what is a good solution to achieve what I want, so here to take your opinion.

The main issue here is that json files do not have indexes associated with them, and Drill does not create indexes for them. So whenever you do a query like SELECT * FROM dfs.'C:\hotels\' WHERE code='1b3474'; Drill has no choice but to read each json file and parse and process all the data in each file. The more files and data you have, the longer this query will take. If you need to do lookups like this often, I would suggest not using Drill for this use case. Some alternatives are:
A relational database where you have an index built for the code column.
A key value store where code is the key.

Technology stack for a multiple queue system

I'll describe the application I'm trying to build and the technology stack I'm thinking at the moment to know your opinion.
Users should be able to work in a list of task. These tasks are coming from an API with all the information about it: id, image urls, description, etc. The API is only available in one datacenter and in order to avoid the delay, for example in China, the tasks are stored in a queue.
So you'll have different queues depending of your country and once that you finish with your task it will be send to another queue which will write this information later on in the original datacenter
The list of task is quite huge that's why there is an API call to get the tasks(~10k rows), store it in a queue and users can work on them depending on the queue the country they are.
For this system, where you can have around 100 queues, I was thinking on redis to manage the list of tasks request(ex: get me 5k rows for China queue, write 500 rows in the write queue, etc).
The API response are coming as a list of json objects. These 10k rows for example need to be stored somewhere. Due to you need to be able to filter in this queue, MySQL isn't an option at least that I store every field of the json object as a new row. First think is a NoSQL DB but I wasn't too happy with MongoDB in the past and an API response doesn't change too much. Like I need relation tables too for other thing, I was thinking on PostgreSQL. It's a relation database and you have the ability to store json and filter by them.
What do you think? Ask me is something isn't clear

You can use HStore extension from PostgreSQL to store JSON, or dynamic columns from MariaDB (MySQL clone).
If you can move your persistence stack to java, then many interesting options are available: mapdb (but it requires memory and its api is changing rapidly), persistit, or mvstore (the engine behind H2).
All these would allow to store json with decent performances. I suggest you use a full text search engine like lucene to avoid searching json content in a slow way.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008