Technology stack for a multiple queue system - mysql

I'll describe the application I'm trying to build and the technology stack I'm thinking at the moment to know your opinion.
Users should be able to work in a list of task. These tasks are coming from an API with all the information about it: id, image urls, description, etc. The API is only available in one datacenter and in order to avoid the delay, for example in China, the tasks are stored in a queue.
So you'll have different queues depending of your country and once that you finish with your task it will be send to another queue which will write this information later on in the original datacenter
The list of task is quite huge that's why there is an API call to get the tasks(~10k rows), store it in a queue and users can work on them depending on the queue the country they are.
For this system, where you can have around 100 queues, I was thinking on redis to manage the list of tasks request(ex: get me 5k rows for China queue, write 500 rows in the write queue, etc).
The API response are coming as a list of json objects. These 10k rows for example need to be stored somewhere. Due to you need to be able to filter in this queue, MySQL isn't an option at least that I store every field of the json object as a new row. First think is a NoSQL DB but I wasn't too happy with MongoDB in the past and an API response doesn't change too much. Like I need relation tables too for other thing, I was thinking on PostgreSQL. It's a relation database and you have the ability to store json and filter by them.
What do you think? Ask me is something isn't clear

You can use HStore extension from PostgreSQL to store JSON, or dynamic columns from MariaDB (MySQL clone).
If you can move your persistence stack to java, then many interesting options are available: mapdb (but it requires memory and its api is changing rapidly), persistit, or mvstore (the engine behind H2).
All these would allow to store json with decent performances. I suggest you use a full text search engine like lucene to avoid searching json content in a slow way.

Related

Storing/versioning large amount of JSON objects

Our cloud service deals with chunks of JSON data (item) which is being manipulated all the time. It can be changed as fast as every second.
At the moment item is JSON object that is being modified all the time. Now we need to implement versioning of these items as well.
Basically, every time request to modify the object arrives, it is modified, saved to DB and then we also need to store that version somewhere. So later on you will be able to say "give me version 345 of this item".
My question is - what would be the ideal way to store this history. Mind you, we do not need to query or alter the data once saved, all we need is to load it if necessary (0.01% of time) - the data is meaningless blob basically.
We are researching multiple approaches:
Simple text files (file system)
Cloud storage (eg S3)
Version control (eg GIT)
Database (any)
Vault (For example Vault from hashicorp)
Main problem is that since items are updated every second, we end up with a lot of blobs. Consider - 100 items, updated every second - thats 8,640,000 records in a single day. Not to mention 100rps for the DB.
Do you have any recommendation as to what would be the optimal approach? We need it to be scalable, fast, reliable, encryption out-of-the-box would be great plus.

Is it good to store mysql event logs in MongoDB?

Earlier in our database design, we use to create mandate fields for each of the table and few important fields were:
created_by
created_time
created_by_ip
updated_by
updated_time
updated_by_ip
Now, its an era of no-schema design. We prefer mongodb or some other just writing databases.
My question here is:
Is it a good practise to maintain logs in a separate database?
Do we need to create separate log table for each mysql tables considering mongodb or is it okay to have single mongodb audit table for
all mysql tables?
What things need to be considered in querying the results from mongodb?
What should be the structure for mongodb table structure?
Any other alternatives to store logs?
Considering situation where if we want to delete registered user if not authenticated in specified time(max of 48hrs).
If all the time logs are handled in mongodb. How can we query the same from mysql?
You usually want this (audit?) data next to the real data and definitely not in a different DB engine as the number of partial errors to support becomes quite a nightmare (e.g. someone registered, but you fail to insert audit data - is this ok? should the account become orphan? What happens if the app goes down half way?).
Systems that have this separation usually use messaging and 2 different listeners are responsible for storing the data and storing the audit (e.g. one in a relational DB and the other in an event store). In this way you have a higher chance of achieving eventual consistency.
Edit
There are a few options around using messaging and the assumption here is that both sources of data must be in sync (or as close as possible). Please bear in mind that I still think that storing data+audit together is by far the simplest and more sensible approach.
Using messaging, your app can emit a message on certain events (e.g. user created). Then 2 different listeners react to this message. One listener stores the data in one DB engine; Another listener stores the audit data. The problem with this approach is that you might need to ensure ordering on the messages, which makes it really slow.
Another (scary) approach is to use distributed (XA) transactions between MySQL and a messaging system (as mongo doesn't support transactions). Then the data to MySQL and the message would be committed together, and a listener can receive the audit data and store it in mongo.
I need to emphasize that the 2 approaches above are horrible and should never be implemented.
There are more sensible approaches but might require a different tech stack. For example using an EventSourcing+CQRS you can store the events (with the audit data) and store the final read models without the audit data.

How did Facebook or Twitter implement their subscribe system

I'm working on a SNS like mobile app project, where users upload their contents and can see updates of their subscribed topic or friends on their homepage.
I store user contents in mysql, and query the user specific homepage data by simply querying out first who and what the user subscribed and then query the content table filtering out using the 'where userid IN (....) or topic IN (....)' clause.
I suspect this would become quite slow when the content table piles up or when a user subscribe tons of users or topics. Our newly released app is already starting to have thousands of new users each week, and getting more over time. Scalability must be a concern for us right now.
So I wonder how Facebook or Twitter handle this subscribing problem with their amazing number of users. Do they handle a list for each user? I tried to search, but all I got is how to interact with Facebook or Twitter rather than how they actually implement this feature.
I noticed that you see only updates rather than history in your feed when using Facebook. Which means that subscribing a new user won't dump lots out dated content into your feed as how it would be by using my current method.
How do Facebook design their database and how did they dispatch new contents to subscribed users?
My backend is currently PHP+MySQL, and I don't mind introducing other backend technologies such as Redis or JMS and stuff if that's the way it should be done.
Sounds like you guys are still in a pretty early stage. There are N-number of ways to solve this, all depending on which stage of DAUs you think you'll hit in the near term, how much money you have to spend on hardware, time in your hands to build it, etc.
You can try an interim table that queues up the newly introduced items, its meta-data on what it entails (which topic, friend user_id list, etc.). Then use a queue-consumer system like RabbitMQ/GearMan to manage the consumption of this growing list, and figure out who should process this. Build the queue-consumer program in Scala or a J2EE system like Maven/Tomcat, something that can persist. If you really wanna stick with PHP, build a PHP REST API that can live in php5-fpm's memory, and managed by the FastCGI process manager, and called via a proxy like nginx, initiated by curl calls at an appropriate interval from a cron executed script.
[EDIT] - It's probably better to not use a DB for a queueing system, use a cache server like Redis, it outperforms a DB in many ways and it can persist to disk (lookup RDB and AOF). It's not very fault tolerant in case the job fails all of a sudden, you might lose a job record. Most likely you won't care on these crash edge cases. Also lookup php-resque!
To prep for the SNS to go out efficiently, I'm assuming you're already de-normalizing the tables. I'd imagine a "user_topic" table with the topic mapped to users who subscribed to them. Create another table "notification_metadata" describing where users prefer receiving notifications (SMS/push/email/in-app notification), and the meta-data needed to push to those channels (mobile client approval keys for APNS/GCM, email addresses, user auth-tokens). Use JSON blobs for the two fields in notification_metadata, so each user will have a single row. This saves I/O hits on the DB.
Use user_id as your primary key for "notification_meta" and user_id + topic_id as PK for "user_topic". DO NOT add an auto-increment "id" field for either, it's pretty useless in this use case (takes up space, CPU, index memory, etc). If both fields are in the PK, queries on user_topic will be all from memory, and the only disk hit is on "notification_meta" during the JOIN.
So if a user subscribes to 2 topics, there'll be two entries in "user_topic", and each user will always have a single row in "notification_meta"
There are more ways to scale, like dynamically creating a new table for each new topic, sharding to different MySQL instances based on user_id, partitioning, etc. There's N-ways to scale, especially in MySQL. Good luck!

What would be a preferrable approach for rendering time series data

We have a simple JSON feed which provides stock/price information at a certain point in time.
e.g.
{t0, {MSFT, 20}, {AAPL, 30}}
{t1, {MSFT, 10}, {AAPL, 40}}
{t2, {MSFT, 5}, {AAPL, 50}}
What would be a preferred mechanism to store/retrieve this data and to plot a graph based on this data (say MSFT). Should I use redis or mysql?
I would also want to show the latest entries to all users in the portal as and when new data is received. The data could be retrieved every minute. Should I use node.js for this
Ours is a rails application and would like to know what libraries/database should I use to model this capability.
Depends on the traffic and the data. If the data is relational, meaning it is formally described and organized according to the relational model, then MySQL is better. If most of the queries are get and set with key->value , meaning you are going to get the data using one key, and you need to support many clients and many set/get per minute, then defiantly go with Redis. There are many other noSQL DBs that might fit, have a look at this post for a great review of some of the most popular ones.
So many ways to do this.. if getting an update once a minute is enough have the client do AJAX calls every minute to get the updated data, and then you can build your server side using php, .NET, java servlet ot node.js, again, depend on the expected user concurrency. PHP is very easy to develop on, while node.js can support many short i/o requests. Another option you might want to consider is you use server push (Node's socket.io for example) instead of the client AJAX call. In this way the client will be notified immediately on an update.
Personally, I like both node.js and Redis and used them couple of production applications, supporting many concurrent users using a single server. I like node since it's easy to develop, and support many users, and Redis for it's amazing speed and concurrent requests. Having said that, I also use MySQL for saving relational data, and PHP servers for fast development of APIs. Each have its own benefits.
Hope you'll find this info helpful.
Kuf.
As Kuf mentioned, there are so many ways to go about this and it really does depends on your needs: low latency, data storage, or ease of implementation.
Redis will most likely be the best solution if you are going for a low latency and easy solution to implement. You can use Pub/Sub to push updates to clients (e.g. Node’s socket.io) in real-time and run a second Redis instance to store the JSON data as a sorted set using the timestamp as a score. I’ve used the same to much success storing time-based statistical data. The downside to this solution is that it is resource (i.e. memory) expensive if you want to store a lot of data. 
If you are looking to store a lot of data in JSON format and want to use a pull to fetch data every minute, then using ElasticSearch to store/retrieve data is another possibility. You can use ElasticSearch’s range query to search using a timestamp field, for example:
"range": {
"#timestamp": {
"gte": date_from,
"lte": now
}
}
This adds the flexibility of using an extremely scalable and redundant system, storing larger amounts of data, and a RESTful real-time API. 
Best of luck!
Since you're basically storing JSON data...
Postgres has a native JSON datatype
Also MongoDB might be a good fit too as JSON -> BSON
But if its just serving data even something as simple as memcached would suffice.
If you have a lot of data to keep updated in real-time like stock ticker prices, the solution should involve the server publishing to the client, not the client continually hitting the server for updates. Publish/subscribe (pub/sub) type model with websockets might be a good choice at the moment, depending on your client requirements.
For plotting the data using data from websockets there is already a question about that here.
Ruby-toolbox has a category called HTTP Pub Sub which might be a good place to start. Whether MySQL or Redis is better depends on what you will be doing with it aside from just streaming stock prices. Redis may be a better choice for performance. Note also that websocket-rails assumes Redis, if you were to use that- just as an example.
I would not recommend a simple JSON API (non-pubsub) in this case, because it will not scale as well (see this answer), but if you don't think you'll have many clients, go for it.
Cube could be a good example for reference. It uses MongoDB for data storage.
For plotting time series data, you may try out cubism.js.
Both projects are from square.

neo4j or neo4j+mysql for partial graph dataset

Even though I read another question here advising not to use both neo4j and mysql (neo4j - graph database along with a relational database?), I was wondering what approach would be the best for dataset that has some data which can be modeled like a graph and the rest looks relational. For some reasons, I can't post the kind of data I'm using.
I can shoehorn the relational part into neo4j but it looks ugly and complex, something I would want to avoid.
On the other hand, if I use both together, I'll have to do double the amount of queries to get the result, decreasing performance (assume the DBs are on cloud in separate machines).
I can't use mysql alone because one of the queries requires a depth of around 20-30 which I assume can't be handled by mysql.
Have any of you encountered such a situation before ? If so, how did you solve it ?
As everyone else says: "give us a better idea of what data you are trying to model so we can best give you a suggestion".
That being said, dealing with 2 DBs is not an issue and its more common than people think: often-times you use a Full-Text store for searches and then get back a list of Document IDs which you then hit the relational DB for additional metadata. Or hitting Redis to get a list of IDs which you also hit the relational DB for more data.
I proof-of-concepted a system of Neo4j+MySQL for targeted searching based on your social network ("show me all restaurants my network has recommended ordered by depth (e.g. 1st level friend recs are weighted higher than 2nd level, and so on) and it didn't feel awkward. But I also didn't take it to scale.
You will be having to keep both datastores in sync. So in my case when a user recommends a place on the web app (which inserts it into MySQL) you then need to turn around and do the same insert into Neo. You probably want to do this asynchronously as well, so you'll need to setup a message queue with workers.