I am going to work on a distributed application. The data is going to be streamed and analyzed. Also, the end users need to have access to the last streamed data as quickly as possible. Also, I need to keep back-up of the data as well as worked on it.
My initial idea is as follows:
1) Keep redis as a cache to hold the last entries.
2) MySQL - storing data
3) Hadoop/Hbase - convenient way of storing data to analyze it.
What do you think of such a setup? Would you recommend anything else?
Thanks!
I think a combination of Spark and Cassandra would be an excellent way to go. Cassandra can easily handle the data throughput and storage. Spark provides lightning quick analytics.
Related
So im building a project and we have fairly large data. My averga json has a size of 20 Kb sometimes more sometimes less, but it doesnt fluctuate a lot.
The thing is im using Spring Boot + React with Microsoft Azure and to render some data i use innerHtml (react's dangerouslySetInnerHTML). My question is how can i calculate/decide when is worth to put the data in a json file in storage and send the link through rest compared to have it as an entity in mysql. Im not sure if im making myself clear but i'd appreciate some clarity. Thanks
This is hard to answer without knowing exactly how all the pieces fit together frequency, etc.
Aside from the typical "test them and see what works" here are some thoughts to maybe help you make a choice.
blob/table will be faster than DB. 100%. I get double digit ms responses from Azure storage almost always. You're talking about a roundtrip in < 100ms for most items.
you cannot use a query type lookup - to use blob/table you'll need to know the exact URL, 2 keys (partition/row) or at least 1 key (partition) if you want to get more than a single record). This provides super fast access. If you're going to need SQL type lookups, stick with DB.
Azure storage is way cheaper than DB
You need a good storage strategy for Azure storage. How do you plan on purging, archiving, cleaning up, etc. There's no good way to say "all records from 2020" unless you also implement a tracking table. This is a good read on some patterns: https://learn.microsoft.com/en-us/azure/storage/tables/table-storage-design-patterns
I really like Azure storage when it's doable. It's cheaper and faster so often. With some tracking tables, it's workable in more scenarios.
Where it dies: reporting. It's hard to do reporting (for business and enterprise level expectations) with data in storage (unless you track elsewhere)
Hope that helps a bit.
I'm being told this question is subjective, but hey ho.
Am I best storing user activity in a table in a mysql database or in an xml file. The aim is for the data to be printed on their account page.
I'm worried that I will either end up with a huge/slow database or many many xml files on the server (one for each user).
Thanks
Use a DB of some sort. Files may have issues regarding I/O, locking, concurrent access and so on.
If you do use files, prefer json over xml.
For RDMS, Mysql is fine.
I would suggest using a NOSQL, my choice would be Redis.
Store it in a table. If you're storing billions of records you'll want to investigate partitioning or sharding, but those are problems you should tackle if and only if you will be hitting limits.
Test any design you have by simulating enough user activity to represent a year or two worth of vigorous use. If it holds up, you're okay. If not you'll have specific problems to address.
Remember in tables of this sort having indexes is important for retrieval speed, but too many indexes can slow down inserts. There's a balance here between too much and too little indexing you'll have to find.
XML files are often extremely expensive to append to unless you do something like what Adium did with their reverse XML parser built to append to XML logs efficiently.
I suggest it should be on the DB.
1) As it would be much easier to maintain a Database table for log information than separate log files. not much load on the server.
2) for RDBMS you need to query for those user log history which would be hard for the xml files
3) Proper indexing will help for faster data retrieval.
4) XML read/write cost more I/O OP
I am developing a Live chat application using Node JS, Socket IO and JSON file. I am using JSON file to read and write the chat data. Now I am stuck on one issue, When I do the stress testing i.e pushing continuous messages into the JSON file, the JSON format becomes invalid and my application crashes.Although I am using forever.js which should keep application up but still the application crashes.
Does anybody have idea on this?
Thanks in advance for any help.
It is highly recommended that you re-consider your approach for persisting data to disk.
Among other things, one really big issue is that you will likely experience data loss. If we both get the file at the exact same time - {"foo":"bar"} - we both make a change and you save it before me, my change will overwrite yours since I started with the same thing as you. Although you saved it before me, I didn't re-open it after you saved.
What you are possibly seeing now in an append-only approach is that we're both adding bits and pieces without regard to valid JSON structure (IE: {"fo"bao":r":"ba"for"o"} from {"foo":"bar"} x 2).
Disk I/O is actually pretty slow. Even with an SSD hard drive. Memory is where it's at.
As recommended, you may want to consider MongoDB, MySQL, or otherwise. This may be a decent use case for Couchbase which is an in-memory key/value store based on memcache that persists things to disk ASAP. It is extremely JSON friendly (it is actually mostly based on JSON), offers great map/reduce support to query data, is super easy to scale to multiple servers, and has a node.js module.
This would allow you to very easily migrate your existing data storage routine into a database. Also, it provides CAS support which will prevent you from data loss in the scenarios outlined earlier.
At minimum though, you should possibly just modify an in memory object that you save to disk ever so often to prevent permanent data loss. However, this only works well with 1 server and then you're back at likely needing to look at a database.
I read some where that it is better to use Redis as cache server,because Redis holds the data in memory,so if you are going to save lots of data,Redis is not a good choice. Redis is good for keeping temporary data.now my question is:
1.where do rest of databases (especially neo4j and sql server) save data?
Don't they save data in memory?
if no,so where they save them?
if yes,why do we use them for saving lots of data?
2."It is better to save indices/relationships in neo4j and data in mysql,and retrieve the index from neo4j and then take the data related to the index from mysql" (I have read it some where),is this because neo4j has the same problem as Redis does?
Neo4J and SQL Server both store data on the file system. However, both also implement caching strategies. I am not an expert on the caching in these databases. Usually you can expect recently accessed data to be cached and data that has not been accessed for a while to fall out of the cache. If the DB needs to get something that is in the cache, it can avoid hits to the file system. Neo4j saves data in a subfolder called "data" by default. This linke may help you find the location of a SQL Server database: http://technet.microsoft.com/en-us/library/dd206993.aspx
This will depend a lot on your specific use-case and the required performance characteristics. My gut feeling is to put data in one or the other based on some initial performance tests. Split the data up if it solves some specific problem.
This is more of a concept/database architecture related question. In order to maintain data consistency, instead of a NoSQL data store, I'm just storing JSON objects as strings/Text in MySQL. So a MySQL row will look like this
ID, TIME_STAMP, DATA
I'll store JSON data in the DATA field. I won't be updating any rows, instead I'll add new rows with the current time stamp. So, when I want the latest data I just fetch the row with the max(timestamp). I'm using Tornado with the Python MySQLDB driver as my primary backend application.
I find this approach very straight forward and less prone to errors. The JSON objects are fairly simple and are not nested heavily.
Is this approach fundamentally wrong ? Are there any issues with storing JSON data as Text in MySQL or should I use a file system based storage such as HDFS. Please let me know.
MySQL, as you probably know, is a relational database manager. It is designed for being used in a way where data is related to each other through keys, forming relations which can then be used to yield complex retrieval of data. Your method will technically work (and be quite fast), but will probably (based on what I've seen so far) considerably impair your possibility of leveraging the technology you're using, should you expand the complexity of your scope!
I would recommend you use a database like Redis or MongoDB as they are designed for document storage rather than relational architectures.
That said, if you find the approach works fine for what you're building, just go ahead. You might face some blockers up ahead if you want to add complexity to your solution but either way, you'll learn something new! Good luck!
Pradeeb, to help answer your question you need to analyze your use case. What kind of data are you storing? For me, this would be the deciding factor: every technology has its specific use case where it excels at.
I think it is safe to assume that you use JSON since your data structure needs to very flexible documents, compared to a traditional relational DB. There are certain data stores that natively support such data structures, such as MongoDB (they call it "binary JSON" or BSON) as Phil pointed out. This would give you improved storage and/or improved search capabilities. Again, the utility depends entirely on your use case.
If you are looking for something like a job queue and horizontal scalability is not an issue and you just need fast access of the latest you could use RedisDB, an in-memory key value store, that has a hash (associative array) data type and lists for this kind of thing. Alternatively, since you mentioned HDFS and horizontal scalability may very well be an issue, I can recommend using queue systems like Apache ActiveMQ or RabbitMQ.
Lastly, if you are writing heavily, and your are not client limited but your data storage is your bottle neck: look into distributed, flexible-schema data storage like HBase or Cassandra. They offer flexible data schemas, are heavily write optimized, and data can be appended and remains in chronological order, so you can fetch the newest data efficiently.
Hope that helps.
This is not a problem. You can also use memcached storage engine in modern MySQL which would be perfect. Although I have never tried that.
Another approach is to use memcached as cache. Write everything to both memcached, and also mysql. When you go to read data, try reading from memcached. If it does not exist, read from mysql. This is a common technique to reduce database bottleneck.