Apache spark to store and query json data is a good use case?

Apache spark to store and query json data is a good use case? - json

Architecture - A brief description about the architecture, I am working on a answering engine where people query and wait for answer (something different to a search engine). Back-end looks for automated answer or if doesn't finds the answer directly it sends out snippet to the interface with the confidence score. Whatever snippets and answers gets generated are stored in Mongodb collection. Each query asked get a unique URL and snippetid, this ids I save in Mongodb and whenever an user jumps on to the URL from other search engines, a query to fetch the data from Mongodb collection is made. At start this architecture ran well but now the data is increasing I am seriously in need of better architecture.
Should I store data in Hadoop and can write a MR program to fetch the data.
Should I use spark and shark preferably
Should I stick to Mongodb
Should I go for HBase or HIVE

You are confusing architecture and technology selection. Though they are related these are separate notions. (You can find a couple of article I wrote about it in the past here and here etc.)
Anyway to your question - generally speaking JSON is an expensive format that need re-parsing every time you fetch it (unless you always want is as a "blob") there are several other formats like Avro, Google ProtoBuff, ORC, Parquet etc. that support schema evolution but also use binary formats that are more efficient and faster to access.
Regarding choice of persistent store - that highly depends on your intended use and anticipated loads. Note that some of the options you've mentioned are aimed at completely different usages (e.g. HBase which you can use for real-time queries vs. Hive which has a rich analytical interface (via SQL) but is batch oriented)

Related

Best way to store/edit GEOJSON working on Wordpress [duplicate]

I want to create a website that will have an ajax search. It will fetch the data or from a JSON file or from a database.I do not know which technology to use to store the data. JSON file or MySQL. Based on some quick research it is gonna be about 60000 entries. So the file size if i use JSON will be around 30- 50 MB and if use MySQL will have 60000 rows. What are the limitations of each technique and what are the benefits?
Thank you

I can't seem to comment since I need 50 rep. for commenting, so I will give it as an answer:
MySQL will be preferable for many reasons, not the least of which being you do not want your web server process to have write access to the filesystem (except for possibly logging) because that is an easy way to get exploited.
Also, the MySQL team has put a lot of engineering effort into things such as replication, concurrent access to data, ACID compliance, and data integrity.
Imagine if, for instance, you add a new field that is required in whatever data structure you are storing. If you store in JSON files, you will have to have some process that opens each file, adds the field, then saves it. Compare this to the difficulty of using ALTER TABLE with a DEFAULT value for the field. (A bit of a contrived example, but how many hacks do you want to leave in your codebase for dealing with old data?) so to be really blunt about, MySQL is a database while JSON is not, so the correct answer is MySQL, without hesitation. JSON is just a language, and barely even that. JSON was never designed to handle anything like concurrent connections or any sort of data manipulation, since its own function is to represent data, not to manage it.
So go with MySQL for storing the data. Then you should use some programming language to read that database, and send that information as JSON, rather than actually storing anything in JSON.
If you store the data in files, whether in JSON format or anything else, you will have all sorts of problems that people have stopped worrying about since databases started being used for the same thing. Size limitations, locks, name it. It's good enough when you have one user, but the moment you add more of them, you'll start solving so many problems that you would probably end up by writing an entire database engine just to handle the files for you, while all along you could have simply used an actual database. Do note! Don't take my word for granted, I am not an expert on this field, so let others post their answer and then judge by that. I think enough people here on stackoverflow have more experience then I do haha. These are NOT entirely my words, but I have taken out the parts that were true from what I knew and know and added some of my own knowledge :) Have a great time making your website

For MySQl :you can select specific rows,or specific column using queries ,filter data based on a key,order alphabetically
downside:need a REST API to fetch data because it can't be accessed directly,you have to use php or python or whatever programming language for backend code.
for json file :benefits :no backend code directly accessed using GET http request.
downside:no filtering ,ordering or any queries,you have to do it manually.

Best IoT Database?

I have many IoT devices sending data currently to MySQL Database.
I want to port it to some other Database, which will be Open Source and provide me with:
JSON support
Scalability
Flexibility to add multiple columns automatically as per payload
Python and PHP Support
Extremely Fast Read, Write
Ability to export at least 6 months of data in CSV format
Please revert back soon.
Any help will be appreciated.
Thanks

Shaping your database based on input data is a mistake. Think of tomorrow your data will be CSV or XML, in a slight different format. Design your database based on your abstract data model, normalize it and apply existing data to your model. Shape your structure based on what input you have and what output you plan to get. If you retrieve the same content as the input, storing data in files will be sufficient, you don't need a database.
Also, you don't want to store "raw" records the database. Even if your database can compose a data record out of the raw element at run time, you cannot run a selection based on a certain extracted element, without visiting all the records.
Most of the databases allow you to connect from anywhere (there is not such thing as better support of PostgreSQL in Java as compared to Python, but the quality and level of standardization for drivers may vary). The question is what features shall your driver support. For example, you may require support for bulk import (don't issue large INSERT sets to the database).
What you actually look for is:
scalability: can your database grow with your data? Would the DB benefit of adding additional CPUs (MySQL particularly doesn't for large queries). Can you shard your database on multiple instances? (MySQL again fails to handle that).
does your model looks like a snowflake? If yes, you may consider NoSQL, otherwise stay away of it. If you manage to model as a snowflake (and this means you are open for compromises) you may use anything like Lucene based search products, Mongo, Cassandra, etc. The fact you have timeseries doesn't qualify you for NoSQL. For example, you may have 10K devices issuing 5k message types. Specific data is redundantly recorded at device level and at message type level. In that case, because of the n:m relation, you don't have the snowflake anymore.
why do you store the data? What queries are you going to issue?

Why do you want to move away from MySQL? It is open source and can meet all of the criteria you listed above. This is a very subjective question so it's hard to give a good answer, but MySQL is not a bad option

Store JSON data as Text in MySQL

This is more of a concept/database architecture related question. In order to maintain data consistency, instead of a NoSQL data store, I'm just storing JSON objects as strings/Text in MySQL. So a MySQL row will look like this
ID, TIME_STAMP, DATA
I'll store JSON data in the DATA field. I won't be updating any rows, instead I'll add new rows with the current time stamp. So, when I want the latest data I just fetch the row with the max(timestamp). I'm using Tornado with the Python MySQLDB driver as my primary backend application.
I find this approach very straight forward and less prone to errors. The JSON objects are fairly simple and are not nested heavily.
Is this approach fundamentally wrong ? Are there any issues with storing JSON data as Text in MySQL or should I use a file system based storage such as HDFS. Please let me know.

MySQL, as you probably know, is a relational database manager. It is designed for being used in a way where data is related to each other through keys, forming relations which can then be used to yield complex retrieval of data. Your method will technically work (and be quite fast), but will probably (based on what I've seen so far) considerably impair your possibility of leveraging the technology you're using, should you expand the complexity of your scope!
I would recommend you use a database like Redis or MongoDB as they are designed for document storage rather than relational architectures.
That said, if you find the approach works fine for what you're building, just go ahead. You might face some blockers up ahead if you want to add complexity to your solution but either way, you'll learn something new! Good luck!

Pradeeb, to help answer your question you need to analyze your use case. What kind of data are you storing? For me, this would be the deciding factor: every technology has its specific use case where it excels at.
I think it is safe to assume that you use JSON since your data structure needs to very flexible documents, compared to a traditional relational DB. There are certain data stores that natively support such data structures, such as MongoDB (they call it "binary JSON" or BSON) as Phil pointed out. This would give you improved storage and/or improved search capabilities. Again, the utility depends entirely on your use case.
If you are looking for something like a job queue and horizontal scalability is not an issue and you just need fast access of the latest you could use RedisDB, an in-memory key value store, that has a hash (associative array) data type and lists for this kind of thing. Alternatively, since you mentioned HDFS and horizontal scalability may very well be an issue, I can recommend using queue systems like Apache ActiveMQ or RabbitMQ.
Lastly, if you are writing heavily, and your are not client limited but your data storage is your bottle neck: look into distributed, flexible-schema data storage like HBase or Cassandra. They offer flexible data schemas, are heavily write optimized, and data can be appended and remains in chronological order, so you can fetch the newest data efficiently.
Hope that helps.

This is not a problem. You can also use memcached storage engine in modern MySQL which would be perfect. Although I have never tried that.
Another approach is to use memcached as cache. Write everything to both memcached, and also mysql. When you go to read data, try reading from memcached. If it does not exist, read from mysql. This is a common technique to reduce database bottleneck.

Using MongoDB vs MySQL with lots of JSON fields?

There is a microblogging type of application. Two main basic database stores zeroed upon are:
MySQL or MongoDB.
I am planning to denormalize lot of data I.e. A vote done on a post is stored in a voting table, also a count is incremented in the main posts table. There are other actions involved with the post too (e.g. Like, vote down).
If I use MySQL, some of the data better suits as JSON than fixed schema, for faster lookups.
E.g.
POST_ID | activity_data
213423424 | { 'likes': {'count':213,'recent_likers' :
['john','jack',..fixed list of recent N users]} , 'smiles' :
{'count':345,'recent_smilers' :
['mary','jack',..fixed list of recent N users]} }
There are other components of the application as well, where usage of JSON is being proposed.
So, to update a JSON field, the sequence is:
Read the JSON in python script.
Update the JSON
Store the JSON back into MySQL.
It would have been single operation in MongoDB with atomic operations like $push,$inc,$pull etc. Also
document structure of MongoDB suits my data well.
My considerations while choosing the data store.
Regarding MySQL:
Stable and familiar.
Backup and restore is easy.
Some future schema changes can be avoided using some fields as schemaless JSON.
May have to use layer of memcached early.
JSON blobs will be static in some tables like main Posts, however will be updated alot in some other tables like Post votes and likes.
Regarding MongoDB:
Better suited to store schema less data as documents.
Caching might be avoided till a later stage.
Sometimes the app may become write intensive, MongoDB can perform better at those points where unsafe writes are not an issue.
Not sure about stability and reliability.
Not sure about how easy is it to backup and restore.
Questions:
Shall we chose MongoDB if half of data is schemaless, and is being stored as JSON if using MySQL?
Some of the data like main posts is critical, so it will be saved using safe writes, the counters etc
will be saved using unsafe writes. Is this policy based on importance of data, and write intensiveness correct?
How easy is it to monitor, backup and restore MongoDB as compared to MySQL? We need to plan periodic backups ( say daily ), and restore them with ease in case of disaster. What are the best options I have with MongoDB to make it a safe bet for the application.
Stability, backup, snapshots, restoring, wider adoption I.e.database durability are the reasons pointing me
to use MySQL as RDBMS+NoSql even though a NoSQL document storage could serve my purpose better.
Please focus your views on the choice between MySQL and MongoDB considering the database design I have in mind. I know there could be better ways to plan database design with either RDBMS or MongoDB documents. But that is not the current focus of my question.
UPDATE : From MySQL 5.7 onwards, MySQL supports a rich native JSON datatype which provides data flexibility as well as rich JSON querying.
https://dev.mysql.com/doc/refman/5.7/en/json.html

So, to directly answer the questions...
Shall we chose mongodb if half of data is schemaless, and is being stored as JSON if using MySQL?
Schemaless storage is certainly a compelling reason to go with MongoDB, but as you've pointed out, it's fairly easy to store JSON in a RDBMS as well. The power behind MongoDB is in the rich queries against schemaless storage.
If I might point out a small flaw in the illustration about updating a JSON field, it's not simply a matter of getting the current value, updating the document and then pushing it back to the database. The process must all be wrapped in a transaction. Transactions tend to be fairly straightforward, until you start denormalizing your database. Then something as simple as recording an upvote can lock tables all over your schema.
With MongoDB, there are no transactions. But operations can almost always be structured in a way that allow for atomic updates. This usually involves some dramatic shifts from the SQL paradigms, but in my opinion they're fairly obvious once you stop trying to force objects into tables. At the very least, lots of other folks have run into the same problems you'll be facing, and the Mongo community tends to be fairly open and vocal about the challenges they've overcome.
Some of the data like main posts is critical , so it will be saved using safe writes , the counters etc will be saved using unsafe writes. Is this policy based on importance of data, and write intensiveness correct?
By "safe writes" I assume you mean the option to turn on an automatic "getLastError()" after every write. We have a very thin wrapper over a DBCollection that allows us fine grained control over when getLastError() is called. However, our policy is not based on how "important" data is, but rather whether the code following the query is expecting any modifications to be immediately visible in the following reads.
Generally speaking, this is still a poor indicator, and we have instead migrated to findAndModify() for the same behavior. On the occasion where we still explicitly call getLastError() it is when the database is likely to reject a write, such as when we insert() with an _id that may be a duplicate.
How easy is it to monitor,backup and restore Mongodb as compared to mysql? We need to plan periodic backups (say daily), and restore them with ease in case of disaster. What are the best options I have with mongoDb to make it a safe bet for the application?
I'm afraid I can't speak to whether our backup/restore policy is effective as we have not had to restore yet. We're following the MongoDB recommendations for backing up; #mark-hillick has done a great job of summarizing those. We're using replica sets, and we have migrated MongoDB versions as well as introduced new replica members. So far we've had no downtime, so I'm not sure I can speak well to this point.
Stability,backup,snapshots,restoring,wider adoption i.e.database durability are the reasons pointing me to use MySQL as RDBMS+NoSql even though a NoSQL document storage could serve my purpose better.
So, in my experience, MongoDB offers storage of schemaless data with a set of query primitives rich enough that transactions can often be replaced by atomic operations. It's been tough to unlearn 10+ years worth of SQL experience, but every problem I've encountered has been addressed by the community or 10gen directly. We have not lost data or had any downtime that I can recall.
To put it simply, MongoDB is hands down the best data storage ecosystem I have ever used in terms of querying, maintenance, scalability, and reliability. Unless I had an application that was so clearly relational that I could not in good conscience use anything other than SQL, I would make every effort to use MongoDB.
I don't work for 10gen, but I'm very grateful for the folks who do.

I'm not going to comment on the comparisons (I work for 10gen and don't feel it's appropriate for me to do so), however, I will answer the specific MongoDB questions so that you can better make your decision.
Back-Up
Documentation here is very thorough, covering many aspects:
Block-Level Methods (LVM makes it very easy and quite a lot of folk do this)
With/Without Journaling
EBS Snapshots
General Snapshots
Replication (technically not back-up, however, a lot of folk use replica sets for their redundancy and back-up - not recommending this but it is done)
Until recently, there is no MongoDB equivalent of mylvmbackup but a nice guy wrote one :) In his words
Early days so far: it's just a glorified shell script and needs way more error checking. But already it works for me and I figured I'd share the joy. Bug reports, patches & suggestions welcome.
Get yourself a copy from here.
Restores
Formats etc
mongodump is completely documented here and mongorestore is here.
mongodump will not contain the indexes but does contain the system.indexes collection so mongorestore can rebuild the indexes when you restore the bson file. The bson file is the actual data whereas mongoexport/mongoimport are not type-safe so it could be anything (techically speaking) :)
Monitoring
Documented here.
I like Cacti but afaik, the Cacti templates have not kept up with the changes in MongoDB and so rely on old syntax so post 2.0.4, I believe there are issues.
Nagios works well but it's Nagios so you either love or hate it. A lot of folk use Nagios and it seems to provide them with great visiblity.
I've heard of some folk looking at Zappix but I've never used it so can't comment.
Additionally, you can use MMS, which is free and hosted externally. Your MongoDB instances run an agent and one of those agents communicate (using python code) over https to mms.10gen.com. We use MMS to view all performance statistics on the MongoDB instances and it is very beneficial from a high-level wide view as well as offering the ability to drill down. It's simple to install and you don't have to run any hardware for this. Many customers run it and some compliment it with Cacti/Nagios.
Help information on MMS can be found here (it's a very detailed, inclusive document).

One of the disadvantages of a mysql solution with stored json is that you will not be able to efficiently search on the json data. If you store it all in mongodb, you can create indexes and/or queries on all of your data including the json.
Mongo's writes work very well, and really the only thing you lose vs mysql is transaction support, and thus the ability to rollback multipart saves. However, if you are able to commit your changes in atomic operations, then there isn't a data safety issue. If you are replicated, mongo provides an "eventually consistent" promise such that the slaves will eventually mirror the master.
Mongodb doesn't provide native enforcement or cascading of certain db constructs such as foreign keys, so you have to manage those yourself (such as either through composition, which is one of mongo's strenghts), or through use of dbrefs.
If you really need transaction support and robust 'safe' writes, yet still desire the flexibility provided by nosql, you might consider a hybrid solution. This would allow you to use mysql as your main post store, and then use mongodb as your 'schemaless' store. Here is a link to a doc discussing hybrid mongo/rdbms solutions: http://www.10gen.com/events/hybrid-applications The article is from 10gen's site, but you can find other examples simply by doing a quick google search.
Update 5/28/2019
The here have been a number of changes to both MySQL and Mongodb since this answer was posted, so the pros/cons between them have become even blurrier. This update doesn't really help with the original question, but I am doing it to make sure any new readers have a bit more recent information.
MongoDB now supports transactions: https://docs.mongodb.com/manual/core/transactions/
MySql now supports indexing and searching json fields:
https://dev.mysql.com/doc/refman/5.7/en/json.html

which database suits my application mysql or mongodb ? using Node.js , Backbone , Now.js

I want to make an application like docs.google.com (without its api,completely on my own server) using
frontend : backbone
backend : node
What database would u think is better ? mysql or mongodb ? Should support good scalability .
I am familiar with mysql with php and i will be happy if the answer is mysql.
But many tutorials i saw, they used mongodb, why did they use mongodb without mysql ?
What should i use ?
Can anyone give me link for some sample application(with source) build using backbone , Node , mysql (or mongo) . or atleast app. with Node and mysql
Thanks

With MongoDB, you can just store JSON objects and retrieve them fully-formed, so you don't really need an ORM layer and you spend less CPU time translating your data back-and-forth. The developers behind MongoDB have also made horizontally scaling the database a higher priority and let you run arbitrary Javascript code to pre-process data on the DB side (allowing map-reduce style filtering of data).
But you lose some for these gains: You can't join records. Actually, the JSON structure you store could only be done via joins in SQL, but in MongoDB you only have that one structure to your data, while in SQL you can query differently and get your data represented in alternate ways much easier, so if you need to do a lot of analytics on your database, MongoDB will make that harder.
The query language in MongoDB is "rougher", in my opinion, than SQL's, partly because it's less familiar, and partly because the querying features "feel" haphazardly put together, partially to make it valid JSON, and partially because there are literally a couple of ways of doing the same thing, and some are older ways that aren't as useful or regularly-formatted as the others. And there's the added complexity of the array and sub-object types over SQL's simple row-based design, so the syntax has to be able to handle querying for arrays that contain some of the values you defined, contain all of the values you defined, contain only the values you defined, and contain none of the values you defined. The same distinctions apply to object keys and their values, and this makes the query syntax harder to grasp. (And while I can see the need for edge-cases, the $where query parameter, which takes a javascript function that is run on every record of the data and returns a boolean, is a Siren song because you can easily define what objects you want to return or not, but it has to run on every record in the database, no indexes can be used.)
So, it depends on what you want to do, but since you say it's for a Google Docs clone, you probably don't care about any representation but the document representation, itself, and you're probably only going to query based on document ID, document name, or the owner's ID/name, nothing too complex in the querying.
Then, I'd say being able to take the JSON representation of the document your user is editing, and just throw it into the database and have it automatically index these important fields, is worth the price of learning a new database.

I was also struggling with this choice looking at the hype created by using MongoDB for tasks it was not built for. So my 2 cents are:
Storing and retrieving hierarchical objects, that your documents probably are, is easier in MongoDB, as David says. It becomes more complicated if you want to store documents that are bigger than 16Mb though - MongoDB's answer is GridFS.
Organising documents in folders, groups, keeping track of which user owns which documents and who he/she provided access to them is definitely easier with MySQL - you have the advantage of powerful SQL queries with joins etc., built in EXPLAIN optimization, triggers, functions, stored procedures, etc. MongoDB is nowhere near.
So what prevents you from using both MySQL to organize the documents and MongoDB to store one collection of documents identified by id (or several collections - one for each document type)? It seems to me the best choice and using two databases in one application is not a problem, really.
MySQL will store users, groups, folders, permissions - whatever you fancy - and for each document it will store a reference to the collection and the document id (MongoDB has a special format for it - DBRefs). MongoDB will store documents themselves in collections, if they are all less than 16MB, or the previews and metadata of documents in collections and the whole documents in GridFS.

David provided a good answer. A few things to add to it.
MongoDB's flexible nature permits for easy agile / iterative development.
MongoDB like node.js is asyncronous in nature and works very well within asyncronous environments.
Mongoose is a good ODM (object document mapper) that makes working with MongoDB with Node.js feel very natural. Unlike ORMs this is a very thin layer.
For Google Doc like functionality, the flexibility & very rich data structure provided by MongoDB feels like a much better fit.
You can find some good example posts by searching for mongoose, node and MongoDB.
Here's one that also uses backbone.js and looks good http://mattkopala.com/blog/2012/02/12/getting-started-with-nodejs/

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008