Structure of MongoDB vs MySQL - mysql

As mentioned in the following article : http://www.couchbase.com/why-nosql/nosql-database
When looking up data, the desired information needs to be collected from many tables (often hundreds in today’s enterprise applications) and combined before it can be provided to the application. Similarly, when writing data, the write needs to be coordinated and performed on many tables.
and the given example of data in JSON format tells
ease of efficiently distributing the resulting documents and read and write performance improvements make it an easy trade-off for web-based applications
But what if i capture all my data in a single table in mysql as is done in mongoDB [in the link given] , would that performance be like equivalent to mongoDB [meaning extracting data from mysql without JOINS] ?

It all depends on the structure you require. The main point of splitting data into tables is being able to index pieces of data, accelerating the retrieval of data.
Another point is that the normalization that a relational database offers ties you to a rigid structure. You can, of course, store json in mysql, but the json document won't have its pieces indexed. If you want fast retrieval of a json document by its pieces then you are looking into splitting it into parts.
If your data can change, which means, doesn't require a schema, then use Mongo.
If your data structure doesn't change then I'd go with MySQL

Related

Storing large JSON data in Postgres is infeasible, so what are the alternatives?

I have large JSON data, greater than 2kB, in each record of my table and currently, these are being stored in JSONB field.
My tech stack is Django and Postgres.
I don't perform any updates/modifications on this json data but i do need to read it, frequently and fast. However, due to the JSON data being larger than 2kB, Postgres splits it into chunks and puts it into the TOAST table, and hence the read process has become very slow.
So what are the alternatives? Should i use another database like MongoDB to store these large JSON data fields?
Note: I don't want to pull the keys out from this JSON and turn them into columns. This data comes from an API.
It is hard to answer specifically without knowing the details of your situation, but here are some things you may try:
Use Postgres 12 (stored) generated columns to maintain the fields or smaller JSON blobs that are commonly needed. This adds storage overhead, but frees you from having to maintain this duplication yourself.
Create indexes for any JSON fields you are querying (Postgresql allows you to create indexes for JSON expressions).
Use a composite index, where the first field in the index the field you are querying on, and the second field (/json expression) is that value you wish to retrieve. In this case Postgresql should retrieve the value from the index.
Similar to 1, create a materialised view which extracts the fields you need and allows you to query them quickly. You can add indexes to the materialised view too. This may be a good solution as materialised views can be slow to update, but in your case your data doesn't update anyway.
Investigate why the toast tables are being slow. I'm not sure what performance you are seeing, but if you really do need to pull back a lot of data then you are going to need fast data access whatever database you choose to go with.
Your mileage may vary with all of the above suggestions, especially as each will depend on your particular use case. (see the questions in my comment)
However, the overall idea is to use the tools that Postgresql provides to make your data quickly accessible. Yes this may involve pulling the data out of its original JSON blob, but this doesn't need to be done manually. Postgresql provides some great tools for this.
If you just need to store and read fully this json object without using the json structure in your WHERE query, what about simply storing this data as binary in a bytea column? https://www.postgresql.org/docs/current/datatype-binary.html

Best Practice for Storing Data from Multiple Sources for Machine Learning Purposes

Currently, I am pulling data from multiple sources and investigating different methods of machine learning to train models using these data sets. Moving forward, I want to come up with the best plan for data storage.
At the moment, I am using plain old CSVs. However, one reason why I am motivated to switch is due to the existence of related fields in the data sets that all belong to the same object. For example, if we are storing data about multiple restaurants I will number the restaurant and have multiple fields for it. More specifically, I will have a fields in the header that are related i.e. restaurant_1_name, restaurant_1_location, restaurant_2_name, restaurant_2_location... and so on. Furthermore, in particular cases, some data points will have a variable number of restaurants, so I will have to create null entries for many of the potential fields in the CSV. Moreover, to add to this variability, data from different sources will have additional fields and missing fields.
Due to the object-oriented nature of our data, I thought it might be better to consider another form of data storage. As an initial solution JSON comes to mind as it allows for a variable number of attributes and grouping of objects as lists of dictionaries. As a bonus, it is a fairly compatible form with Python dictionaries and the pandas module, the language/module I am using (but so are most data formats).
Based on the nature of this data, what are the best practices and methodologies for choosing the most viable data approach among options such as CSV, JSON, NoSQL (i.e. Mongo), SQL (i.e. Postgres, MySQL) keeping in mind the variability among the data sources/points and the objective nature of the data? Furthermore, is it worth consolidating data into one format or rather keeping it separate by data source?
I would suggest going with mongo as it is flexible enough and it allows you to store unstructured data and it will be much easier to query. IMO

Data compression in RDBMS like Oracle, MySQL etc

I was reading about in-memory database which incorporates a feature like data compression. Using that, instead of storing first name, last name, father's name etc. values as it is in the column (which leads to a lot of data duplication and waste of disk storage), it creates a dictionary and attribute vector table for each column, so that only unique values are stored in dictionary and its corresponding attribute vector is stored in original table.
Clear advantage of this approach is that it a lot of space by removing overhead of data duplication.
I want to know:
Does RDBMS like Oracle, MySQL etc. implicitly follow this approach when they store the data on disk? Or when we use these RDBMS we have to implement the same if we want to take advantage of the same?
As we know there is no free lunch, so I would like to understand what are the trade-offs if developer implements above explained data compression approach? One I can think of is that in order to fetch the data from database, I will have to make a join between my dictionary table and main table. Isn't it?
Please share your thoughts and inputs.
This answer is based on my understanding of your query. It appears that you are mixing up two concepts : data normalisation and data storage optimisation.
Data Normalisation : This is a process that needs to be undertaken by the application developer. Here pieces of data that would need to be stored repeatedly are stored only once and are referenced using their identifiers which would typically be integers. This way the database consumes spaces only as much as needed to store the repeating data once. This is a common practice while storing string and variable length data into the database tables. In order to retrieve data, the application would have to perform joins between the related tables. And this process contributes directly to application performance depending on the manner in which the related tables are designed.
Data storage optimisation : This is what is handled by the RDBMS itself. This involves various steps like maintaining the B-Tree structures to hold data, compressing data before storage, managing the free space within the data files etc. Different RDBMS systems would handle them in different ways (some of them patented and proprietary while others are more general); however when we are speaking of RDBMS like Oracle and MySQL you can be assured that they would follow the best in class storage algorithms to efficiently store this data.
Hope this helps.

Mongo vs MySql Search Optimization

So I'm in the process of designing a system that is going to store document type of data (i.e. transcribed documents). Immediately, I thought this would be a great opportunity to leverage a NOSQL implementation like MongoDB. However, given that I have zero experience with Mongo, I'm wondering: on each of these docuemnts, I have a number of metadata tags I want to be able to search across: things like date, author, keywords, etc. If I were to use an RDBMS like MySql, I'd probably store these items in a separate table liked by a foreign key and the index the items that were most likely to be searched on. Then I could run queries against that table and only pull back and the full text results for the items that matched (saves on disk read not having to reach through a row that contains a lot of text or BLOB information).
Would something similar be possible with Mongo? I know in Mongo I could simply create 1 document that would have all the metadata AND the actual transcription but is it easy and highly performant to search the various fields in the metadata if the document is stored like that? Is there a best practice when needing to perform searches across various items in a document in Mongo? Or is this type of scenario more suited for an RDBMS rather than a NOSQL implementation?
You can add indexes for individual fields in mongodb documents. Only when the indexes get larger than your memory, performance of index based searches may become a problem.
When you decide if to go with mongodb, keep in mind that there is no join operation. This has to be done by your db layer or above.
If your primary concern is searching, there is an ElasticSearch river for mongodb, so you can utilize ElasticSearch on your dataset.
The NoSQL model, is geared for data storage in long range (OLTP model), yes you can create indexes and perform queries that you want, instead of you having related entities across tables, you have a complete entity that owns all entities dependent on it within herself.
When you have to extract complex reports with many joins in a relational database in a context of millions of data becomes impractical such an act, because you may end up compromising other applications.
For example:
We have the room and student bodies.
Each room have many students, the relational model we would have the following:
SELECT * FROM ROOM R
INNER JOIN
S STUDENT
ON = S.ID R.STUDENTID
Imagine doing that with some 20 tables with thousands of data? His performance will be horrible.
With MongoDB you will do so:
db.sala.find (null)
And will have all their rooms with their students.
MongoDB is a database that performs scanning horizontally.
You can read:
http://openmymind.net/mongodb.pdf
This site also has an interactive tutorial that uses the book's examples. Very nice.
And here you can experience the mongodb online and test your commands.
Search for try mongo db.
Also read about shards with replicaSets. I believe it will open your mind greatly.
You can install Robomongo which is a graphical interface for you to tinker with mongodb.
http://robomongo.org/

mysql key/value store problem

I'm trying to implement a key/value store with mysql
I have a user table that has 2 columns, one for the global ID and one for the serialized data.
Now the problem is that everytime any bit of the user's data changes, I will have to retrieve the serialized data from the db, alter the data, then reserialize it and throw it back into the db. I have to repeat these steps even if there is a very very small change to any of the user's data (since there's no way to update that cell within the db itself)
Basically i'm looking at what solutions people normally use when faced with this problem?
Maybe you should preprocess your JSON data and insert data as a proper MySQL row separated into fields.
Since your input is JSON, you have various alternatives for converting data:
You mentioned many small changes happen in your case. Where do they occur? Do they happen in a member of a list? A top-level attribute?
If updates occur mainly in list members in a part of your JSON data, then perhaps every member should in fact be represented in a different table as separate rows.
If updates occur in an attribute, then represent it as a field.
I think cost of preprocessing won't hurt in your case.
When this is a problem, people do not use key/value stores, they design a normalized relational database schema to store the data in separate, single-valued columns which can be updated.
To be honest, your solution is using a database as a glorified file system - I would not recommend this approach for application data that is core to your application.
The best way to use a relational database, in my opinion, is to store relational data - tables, columns, primary and foreign keys, data types. There are situations where this doesn't work - for instance, if your data is really a document, or when the data structures aren't known in advance. For those situations, you can either extend the relational model, or migrate to a document or object database.
In your case, I'd see firstly if the serialized data could be modeled as relational data, and whether you even need a database. If so, move to a relational model. If you need a database but can't model the data as a relational set, you could go for a key/value model where you extract your serialized data into individual key/value pairs; this at least means that you can update/add the individual data field, rather than modify the entire document. Key/value is not a natural fit for RDBMSes, but it may be a smaller jump from your current architecture.
when you have a key/value store, assuming your serialized data is JSON,it is effective only when you have memcached along with it, because you don't update the database on the fly every time but instead you update the memcache & then push that to your database in background. so definitely you have to update the entire value but not an individual field in your JSON data like address alone in database. You can update & retrieve data fast from memcached. since there are no complex relations in database it will be fast to push & pull data from database to memcache.
I would continue with what you are doing and create separate tables for the indexable data. This allows you to treat your database as a single data store which is managed easily through most operation groups including updates, backups, restores, clustering, etc.
The only thing you may want to consider is to add ElasticSearch to the mix if you need to perform anything like a like query just for improved search performance.
If space is not an issue for you, I would even make it an insert only database so any changes adds a new record that way you can keep the history. Of course you may want to remove the older records but you can have a background job that would delete the superseded records in a batch in the background. (Mind you what I described is basically Kafka)
There's many alternatives out there now that beats RDBMS in terms of performance. However, they all add extra operational overhead in that it's yet another middleware to maintain.
The way around that if you have a microservices architecture is to keep the middleware as part of your microservice stack. However, you have to deal with transmitting the data across the microservices so you'd still end up with a switch to Kafka underneath it all.