Using NoSQL database for relational purpose - mysql

Non-relational databases are attracting more attention day by day. The main limitation is that today's complicated data are indeed connected. Isn't it convenient to connect databases as we connect tables in RDBMS? Of course, I just mean simple cases. Imagine three tables of Articles, Tags, Relationships. In a RDBMS like Mysql, we can run three queries to
1. Find ID of a given tag
2. Find Articles connected with the captured Tag ID
3. Fetch the contents of Articles tagged with the term
Instead of three queries, we perform a single query by JOIN. I think three queries in a key/value database like BerkeleyDB is faster than a JOIN query in Mysql.
Is this idea practical? Or other issues are involved to ignore this approach?

NoSQL databases can support relational data models just fine. You're just left to implement the relational mapping yourself in your application, and that effort is typically not insignificant.
In some applications this extra effort will be worthwhile. Perhaps you only have a small number of tables and the joins you need are very simple. Or perhaps you've done some performance evaluation between a traditional relational DBMS and a NoSQL alternative and found that the NoSQL option is more appropriate for your needs for any number of reasons (performance, scalability, flexibility, whatever).
You should keep one thing in mind, however. A typical SQL DBMS is basically a NoSQL DB with an optimized, well-built relational engine in front of it. Some databases even let you bypass the relational layer and treat their system like a pure NoSQL DB.
Therefore, the moment you start to build your own relational mappings and joins on top of a NoSQL DB you should ask yourself, "Didn't someone build this for me already?" The answer may well be "yes", and the solution might be to go with a traditional SQL DBMS.
To answer the "3 query" part of your question specifically, the answer is "maybe". You certainly might be able to make such a query run faster in a NoSQL DB than in an RDBMS, but you need to keep in mind that there are more things to consider here than just the raw speed of your query:
The technical debt you will incur as you build join-like functionality that you wouldn't have had to build otherwise
The time it will take you to build, test and optimize your query code which will likely be more significant than writing a simple SQL query
Any difference in transactional guarantees or other typical product features (replication, management tools, etc) which you may lose or gain depending on the NoSQL option you choose
The ability to hire DBMs who know how to run your database from an operational perspective
You might review that list and say to yourself, "No big deal, I'm running a simple app with only a few thousand DB entries and I'll maintain it myself". If so, knock yourself out - Berkeley (and other NoSQL options) would work fine. I've used Berkeley many times for those kinds of applications. But you may have a different answer if you are building the back-end for a significantly-sized SaaS product which might soon have millions of users and very complex queries.
We can't give a one-size-fits-all answer, unfortunately. You'll have to make the judgement call yourself based on the needs of you application and organization.

Sure, a single record join is pretty speedy in either solution, but that's not the big advantage of joins. Joins are useful when you're joining many, many rows with many, many other rows. Imagine if, in your example, you wanted to do that for 100 different tags. Without joins, you're talking 300 queries to SQL's one.

Another solution on noSql systems is playOrm. It does Joins BUT only in partitions so the table can be infinite size, but the partitions have to be on par with the size of RDBMS tables. It does all the fancy hibernate stuff as well for you with all the related annotations though it has some differences and will be adding Embedded for use when you denormalize. It makes things much easier. Typically dealing with nosql is kind of a pain in all the translation logic you have to do and all the manual indexing and updates and removes from the index....playOrm does all this for you instead.

Related

How to design and store history message in database in IM(instant message system)?

By storing historical messages in persist storage, we can achieve multi-device synchronization and message roaming.
But How to design the table schema and divide the table?
In my most immediate thoughts, maybe every chat group should have a table, and then the messages sent in the chat group or channel will be appended to the table.
In this way, we will have lots of tables, like table group_123,table group_345,table group_${gid}. The only question with this method is whether it will be bad to divide so many tables.
I have searched some answers before, and they are mostly stored in one big table, where $gid is just a field of the table.
Besides, the difference in this scene between mysql and mongodb also puzzles me. I can't figure out which one is better, like why use mysql or why not use mysql or why use mongodb or why not use mongodb.
Be very wary of any design that starts, "I'll create a table per X," because whatever X is, it's likely to become too numerous, and soon you'll have thousands of tables, and discover that just managing the metadata becomes a burden.
In general, the way to approach relational table design is to follow rules of database normalization. Your table to store messages is a set of similar objects. Normalization does not make a distinction between sets that are modest in size versus large in size. If they are the same type of thing, they go in the same table. At least that's what normalization would guide us to do.
There are practical limits of any implementation, though, and you may find the need to bend the rules of normalization, by using partitioning or sharding of various forms. Even defining indexes is not called for by normalization, but it is a good idea to help optimize queries.
That's the key: any optimization strategy must be chosen in the context of specific queries that you need to run in your application. Optimization means to improve efficiency of one type of query, at the expense of other types of queries. You cannot choose which optimization strategy is best for your application without knowing the queries.
This is also the way to choose between relational and non-relational types of databases. Non-relational databases optimize for certain query types, so you need to know which queries are most important in your application before choosing any non-relational technology, or choosing which data model once you have chosen that technology.

When is it time to switch to NoSQL?

I am dealing with a large database that is collecting historical pricing data. The schema is relatively simple and does not change.
Something like:
SKU (char), type(enum), price(double), datetime(datetime)
The issue is that this table now has over 500,000,000 rows and is around 20gb and growing. It is already getting a bit difficult to run queries. One common query is to get all skus from a specific date range consisting of maybe 500,000 records. Add any complexity like group by, and you can forget it.
This db is mostly writes. But we obviously need to crunch the data and run queries occasionally. I understand that better index planning can help speed up the queries, but I am wondering if this is the type of data that would benefit from a noSQL solution like MongoDB? Can I expect mysql (probably moving to MariaDB) to continue to work for us, even after it grows beyond 100-200 gb in size? Or should I explore alternatives before things get unweildly?
NoSQL is not a solution to a "large database" problem; NoSQL--specifically document databases--are designed for scenarios where the nature of the data you're storing varies, so you don't want to define rigid schemas and relationships up front.
What you have is simple, well-defined data. This is ideally suited for a relational database, but for something of that scale I would recommend looking something either commercial (i.e. SQL Server or Oracle, depending on your platform). The databases I work with in SQL Server are around four terabytes in size with several tables in the hundreds-of-millions records like you have. A relational database can easily accommodate the simple data you've outlined.
You actually have an ideal use-case for SQL, and a rather bad fit for NoSQL. MySQL devs report people using databases of 5,000,000,000 records. Some other SQL servers will be even more scalable than that. However, if you don't have a proper index support, it should be impossible to manage even a fraction of that.
BTW, what is your table schema, including indices?
You could switch to mariadb and then use the spider engine. The spider engine makes it possible to split your data across multiple mariadb instances without loosing the abillity to run queries against your existing instance.
So you can define your own rules for partitioning and then create one instance per partition. So in the end you have multiple instances of mariadb but all your records are virtual sumed up in one table with the spider engine.
Your performace gain would be because you split your data across multiple instances and therefore reduce the amount of records per table or instance and of course by using more hardware ressources.

RDBMS for extremely large data sets - what are people using?

I have to perform some serious data mining on very large data sets stored in MySQL db. However, queries that require a bit more than a basic SELECT * FROM X WHERE ... tend to become rather inefficient since they return results on the order of 10e6 or more, especially when JOIN on one or more tables is introduced - think of joining 2 or more tables containing several tens of millions rows (after filtering data), which is something that pretty much happens on every query. More than often we'd like to run aggregate functions on these (sum, avg, count, etc), but this is impossible since MySQL simply chokes.
I should note that many efforts were put to optimize the current performance - all tables are indexed properly and queries are tuned, the hardware is top notch, the storage engine was configured and so on. However, still each query takes very long - to the point where "let's run it before we go home and hope for the best when we come to work tomorrow." Not good.
This has to be a solvable problem - many large companies perform very data and computational intensive mining, and handle it well (without writing their own storage engines, google). I'm willing to accept time penalty to get the job done, but on the order of hours, not days. My question is - what do people use to counter problems like this? I've heard of storage engines geared to this type of problem (greenplum, etc.), but I wanted to hear how this problem is typically approached. Our current data store is obviously relational and should probably remain such, but any thoughts or suggestions are welcome. Thanks.
I suggest PostgreSQL, which I've been working with quite successfully on tables with ~0.5B rows that required some complex join operations. Oracle should be good for that too, but I don't have much experience with it.
It should be noted that switching an RDBMS isn't a magic solution, if you want to scale to those sizes there's a LOT of hard work to be done in optimizing your queries, optimizing the database structure and indexes, fine tuning the database configuration, using the right hardware for your usage, replication, using materialized views (which are extremely powerful when used correctly. see here and here - its postgres specific, but applies to other RDBMSs too)... and at some point, you just have to throw more money on the problem.
edited fixed some weird typos (useless android auto correct...) and added some resources about materialized views
We have used MS SqlServer to run analytics on financial data with ten of millions of rows and more using complex JOIN and aggregation. Several things that we have done other than what you have mentioned are:
We chunk the calculation into a lot of temporary tables instead of using sub-query. These tables then we apply proper keys, indexing and so on via the code. Query with sub-query just fails for us
In the temporary tables, we often apply the clustered index that makes sense for us. Note that this temporary tables are filtered results so applying the index on the fly is not expensive compared to use the sub query in place of this temporary tables. Note I am speaking from our experience and might not apply to all cases
As we have done a lot of aggregation function as well, we did a lot indexing on the group columns
We do a lot of query run planning using SQL Query Analyzer that shows us the execution plan. Based on the plan, we revised the query, change the index
We provide hints for the SQL Server that we think could help the execution such as the choice of JOIN Algorithm to take (Hash, Merged or Nested)

Optimizing of database with multiple JOINs

First, some details about the website and the database structure -
With my website you can learn English words, and you can insert on each word a sentence, an association, an image, in addition - each word has a category, sub category, group...
My database includes about 20 tables. any user who registers to my website 'add' to users table something like 4000 rows - the number of the words on my website. I have a serious problem while the user is filtering words (somthing like 'search' word but according char/s & category/s & group/s etc.. I have 9 JOINs in my sql query, and it takes something like 1 MIN to display results..
The target of JOINs - inside the table users (where each user has 4000 rows / each row = word) there are joins on this style:
$this->db->join('users', 'sentences.id = users.sentence_id' ,'left');
The same thing with associations, groups, images, binds between words etc..
The users table includes id of sentences, associations, groups.. and with the JOIN there is a connection.
I don't know what to do.. it takes too much time. maybe the problem is the structure of the database? multiple joins? maybe using indexing? but how and where? because it's necessary sometimes retrieve all the words so indexing wouldn't help.
I'm using MySQL.
First of all, if you're using that many joins, indexes will not save you (as they will not be used in joins most of the time).
There are a few things you can do.
Schema Design
You probably would want to reconsider your schema design/query if you need 9 joins to achieve what you are doing!
From the looks of it, it seems your your tables are very normalized, perhaps in 3rd normal form? In that case consider denormalizing your tables into a larger one to avoid joins (joins are more expensive than full table scans!). There are many online documentations on this, however there's always costs to this, as it increases development complexity and data redundancy. Also by denormalizing your tables you avoid joins and can make better use of indexes.
Also I believe MyISAM is the only storage engine in MySQL that supports FULL TEXT indexes. However it does not have transactions and have table level-locking and no MVCC, so it depends on what you need.
Resources
I suggest you have a read at this book High Performance MySQL.
A truly awesome book on tuning MySQL databases
I also suggest having a read at the official documentation on your chosen storage engine. This is significant as each storage engine is VERY DIFFERENT! InnoDB is completely different from MyISAM which is also completely different from PBXT. Each engine has its benefits and you will have to consider which one fits your situation.
I would draw out the relational schema and work out the number of operations for the queries you are running, and go from there. Most DBMS's attempt to optimise queries implicitly, but not always optimally. You should look into re-ordering the joins so that the most restrictive are carried out first. Indexes could help, and again, would require some analysis to find which attributes you are searching on.
Building databases to deal with natural language is a very challenging subject and there is a lot of research on the subject. Have you looked into Markov chains? Have you taken a step back and thought about the computational complexity of what you are trying to do? If you arrive at the same conclusion of nine joins, then it may be fair to say that the problem is not scalable enough for a real-time application.
As an aside, I believe Google App Engine's data store attempts to index attributes for you, with implicit scalability. If you're running your database on a small web server, then you may see better results deploying it with a more comprehensive DBMS. I would only look into this as a last resort, however.

Database structure - To join or not to join

We're drawing up the database structure with the help of mySQL Workbench for a new app and the number of joins required to make a listing of the data is increasing drastically as the many-to-many relationships increases.
The application will be quite read-heavy and have a couple of hundred thousand rows per table.
The questions:
Is it really that bad to merge tables where needed and thereby reducing joins?
Should we start looking at horizontal partitioning? (in conjunction with merging tables)
Is there a better way then pivot tables to take care of many-to-many relationships?
We discussed about instead storing all data in serialized text columns and having the application make the sorting instead of the database, but this seems like a very bad idea, even though that the database will be heavily cached. What do you think?
Go with the normalized form of the database. For most part of the tasks you won't need more than 3 or 4 Joins and you still can write views for the most common joins. Denormalization will have you to always think of updating fields in multiple places/tables when changing one property and will surely lead to more problems than benefits.
If you worry about reporting performance then you still can extract the data in timed batches into separate tables to get the desired performance for your reporting queries. If it's for query simplicity you can use views.
In inverse order:
Forget it. Use the database. People saynig "make it in the application" are pretty often those ignorant to the amount of work going into writing databases.
Depends on exact need.
Depends on exact need. OLTP (Transaction processing) - go for for firth normal form. OLAP (Analytical processing) - go for a proper star diagram and denormalize to get optimal performance. Mixed - forget it. Does not work for larger installs because the theories are different... except if you make the database OLTP and then use a special OLAP cube database (which mySQL does not have).
Databases are designed to handle lots of joins. Use this feature as it will make many kinds of data manipulation in the database much easier. Otherwise, why not just use a flat file?
As always, it depends on your application, but in general, too much denormalisation can come back and bite you later on. A well normalised database means that you should be able to query your data in most ways that you may need later on, particularly for reporting (which often is an afterthought).
If you stick all your data in serialized text columns and your client asks for a report showing all rows that have a particular attribute, then you're going to have to do a bunch of string manipulation to get this data out.
If you're worried about too many joins for your queries, you could consider exposing certain sets of the data as a view...
If you make sure to index the foreign keys (you did set up foreign keys didn't you?) and have proper where clauses in your queries, 10-15 joins should be easily handled by a database. Especially with so few rows. I have queries with that many joins on tables with millions of rows and they run fine.
Usually it is better to partition data than to denormalize.
As far as denomalizing goes, don't do it unless you also institute a strategy for keeping the denormalized data in synch with the parent table.
As to whether you really need that many tables or if your design is bad, well the only way we could comment on that is if we saw the table structure.
Unless you have clear evidence that performance is suffering because of the joins, stay normalised. Otherwise, as others have said, you'll have to worry about multiple updates.
Especially if the database is heavily cached, as you say, you'll be surprised how quick the DBMS is at doing this kind of thing - it is what it's designed for, after all.
Unless it's the sort of monster application, with huge amounts of data, that demands special performance optimisations, you'll find that keeping down the development, testing, and later, maintenance effort, will be much more important.
Joins are good, usually, not bad. They allow you to keep the data where it should be, which gives you maximum flexibility.
And as has been said many times, premature optimisation is usually bad, not good.