RDBMS for extremely large data sets - what are people using? - mysql

I have to perform some serious data mining on very large data sets stored in MySQL db. However, queries that require a bit more than a basic SELECT * FROM X WHERE ... tend to become rather inefficient since they return results on the order of 10e6 or more, especially when JOIN on one or more tables is introduced - think of joining 2 or more tables containing several tens of millions rows (after filtering data), which is something that pretty much happens on every query. More than often we'd like to run aggregate functions on these (sum, avg, count, etc), but this is impossible since MySQL simply chokes.
I should note that many efforts were put to optimize the current performance - all tables are indexed properly and queries are tuned, the hardware is top notch, the storage engine was configured and so on. However, still each query takes very long - to the point where "let's run it before we go home and hope for the best when we come to work tomorrow." Not good.
This has to be a solvable problem - many large companies perform very data and computational intensive mining, and handle it well (without writing their own storage engines, google). I'm willing to accept time penalty to get the job done, but on the order of hours, not days. My question is - what do people use to counter problems like this? I've heard of storage engines geared to this type of problem (greenplum, etc.), but I wanted to hear how this problem is typically approached. Our current data store is obviously relational and should probably remain such, but any thoughts or suggestions are welcome. Thanks.

I suggest PostgreSQL, which I've been working with quite successfully on tables with ~0.5B rows that required some complex join operations. Oracle should be good for that too, but I don't have much experience with it.
It should be noted that switching an RDBMS isn't a magic solution, if you want to scale to those sizes there's a LOT of hard work to be done in optimizing your queries, optimizing the database structure and indexes, fine tuning the database configuration, using the right hardware for your usage, replication, using materialized views (which are extremely powerful when used correctly. see here and here - its postgres specific, but applies to other RDBMSs too)... and at some point, you just have to throw more money on the problem.
edited fixed some weird typos (useless android auto correct...) and added some resources about materialized views

We have used MS SqlServer to run analytics on financial data with ten of millions of rows and more using complex JOIN and aggregation. Several things that we have done other than what you have mentioned are:
We chunk the calculation into a lot of temporary tables instead of using sub-query. These tables then we apply proper keys, indexing and so on via the code. Query with sub-query just fails for us
In the temporary tables, we often apply the clustered index that makes sense for us. Note that this temporary tables are filtered results so applying the index on the fly is not expensive compared to use the sub query in place of this temporary tables. Note I am speaking from our experience and might not apply to all cases
As we have done a lot of aggregation function as well, we did a lot indexing on the group columns
We do a lot of query run planning using SQL Query Analyzer that shows us the execution plan. Based on the plan, we revised the query, change the index
We provide hints for the SQL Server that we think could help the execution such as the choice of JOIN Algorithm to take (Hash, Merged or Nested)

Related

Does a complex sql query ever become complex enough that it would be more machine-efficient to do multiple queries?

I have this sense that - inefficiently written queries aside - getting the information you want out of a database is always faster the less queries you make to do so. I don't know where I got that idea from and it gets challenged the more complicated the queries are that I produce (am I really doing MySQL any favors with all these joins?). I'm not asking for an opinion on ease for the programmer or best coding practices, but do conditions exist under which a program would perform faster with a query broken out into multiple steps? If so, how might one make an educated guess a query might reach such an upper limit before going through the effort of coding and comparing?
Yes, although it is less likely with MySQL. The reason is that MySQL doesn't have a really sophisticated cost-based optimizer. The advantage to intermediate tables is that the sizes are known. A cost-based optimizer can take advantage of this information to improve the query plan.
One place where this can help is when a subquery is repeated multiple times in a query. An intermediate table ensures that it is processed only once (although CTEs would normally do the same thing).
Another place where this can really help is when you add indexes to the intermediate tables. Adding the indexes -- and using them -- can be a big cost savings, more than making up for the cost of creating the index.
That said, I generally discourage using intermediate tables for this purpose, unless the results are needed for multiple queries. I find that just the overhead in debugging makes it not worth it -- for some reason, I don't always delete the intermediate tables and then waste time wondering why some modification doesn't work.
More importantly, as the data changes, modifying the queries can be a pain. I find that changing a column name, for instance, is simpler in a single query than when the logic is spread across multiple queries.

Optimizing of database with multiple JOINs

First, some details about the website and the database structure -
With my website you can learn English words, and you can insert on each word a sentence, an association, an image, in addition - each word has a category, sub category, group...
My database includes about 20 tables. any user who registers to my website 'add' to users table something like 4000 rows - the number of the words on my website. I have a serious problem while the user is filtering words (somthing like 'search' word but according char/s & category/s & group/s etc.. I have 9 JOINs in my sql query, and it takes something like 1 MIN to display results..
The target of JOINs - inside the table users (where each user has 4000 rows / each row = word) there are joins on this style:
$this->db->join('users', 'sentences.id = users.sentence_id' ,'left');
The same thing with associations, groups, images, binds between words etc..
The users table includes id of sentences, associations, groups.. and with the JOIN there is a connection.
I don't know what to do.. it takes too much time. maybe the problem is the structure of the database? multiple joins? maybe using indexing? but how and where? because it's necessary sometimes retrieve all the words so indexing wouldn't help.
I'm using MySQL.
First of all, if you're using that many joins, indexes will not save you (as they will not be used in joins most of the time).
There are a few things you can do.
Schema Design
You probably would want to reconsider your schema design/query if you need 9 joins to achieve what you are doing!
From the looks of it, it seems your your tables are very normalized, perhaps in 3rd normal form? In that case consider denormalizing your tables into a larger one to avoid joins (joins are more expensive than full table scans!). There are many online documentations on this, however there's always costs to this, as it increases development complexity and data redundancy. Also by denormalizing your tables you avoid joins and can make better use of indexes.
Also I believe MyISAM is the only storage engine in MySQL that supports FULL TEXT indexes. However it does not have transactions and have table level-locking and no MVCC, so it depends on what you need.
Resources
I suggest you have a read at this book High Performance MySQL.
A truly awesome book on tuning MySQL databases
I also suggest having a read at the official documentation on your chosen storage engine. This is significant as each storage engine is VERY DIFFERENT! InnoDB is completely different from MyISAM which is also completely different from PBXT. Each engine has its benefits and you will have to consider which one fits your situation.
I would draw out the relational schema and work out the number of operations for the queries you are running, and go from there. Most DBMS's attempt to optimise queries implicitly, but not always optimally. You should look into re-ordering the joins so that the most restrictive are carried out first. Indexes could help, and again, would require some analysis to find which attributes you are searching on.
Building databases to deal with natural language is a very challenging subject and there is a lot of research on the subject. Have you looked into Markov chains? Have you taken a step back and thought about the computational complexity of what you are trying to do? If you arrive at the same conclusion of nine joins, then it may be fair to say that the problem is not scalable enough for a real-time application.
As an aside, I believe Google App Engine's data store attempts to index attributes for you, with implicit scalability. If you're running your database on a small web server, then you may see better results deploying it with a more comprehensive DBMS. I would only look into this as a last resort, however.

Database structure - To join or not to join

We're drawing up the database structure with the help of mySQL Workbench for a new app and the number of joins required to make a listing of the data is increasing drastically as the many-to-many relationships increases.
The application will be quite read-heavy and have a couple of hundred thousand rows per table.
The questions:
Is it really that bad to merge tables where needed and thereby reducing joins?
Should we start looking at horizontal partitioning? (in conjunction with merging tables)
Is there a better way then pivot tables to take care of many-to-many relationships?
We discussed about instead storing all data in serialized text columns and having the application make the sorting instead of the database, but this seems like a very bad idea, even though that the database will be heavily cached. What do you think?
Go with the normalized form of the database. For most part of the tasks you won't need more than 3 or 4 Joins and you still can write views for the most common joins. Denormalization will have you to always think of updating fields in multiple places/tables when changing one property and will surely lead to more problems than benefits.
If you worry about reporting performance then you still can extract the data in timed batches into separate tables to get the desired performance for your reporting queries. If it's for query simplicity you can use views.
In inverse order:
Forget it. Use the database. People saynig "make it in the application" are pretty often those ignorant to the amount of work going into writing databases.
Depends on exact need.
Depends on exact need. OLTP (Transaction processing) - go for for firth normal form. OLAP (Analytical processing) - go for a proper star diagram and denormalize to get optimal performance. Mixed - forget it. Does not work for larger installs because the theories are different... except if you make the database OLTP and then use a special OLAP cube database (which mySQL does not have).
Databases are designed to handle lots of joins. Use this feature as it will make many kinds of data manipulation in the database much easier. Otherwise, why not just use a flat file?
As always, it depends on your application, but in general, too much denormalisation can come back and bite you later on. A well normalised database means that you should be able to query your data in most ways that you may need later on, particularly for reporting (which often is an afterthought).
If you stick all your data in serialized text columns and your client asks for a report showing all rows that have a particular attribute, then you're going to have to do a bunch of string manipulation to get this data out.
If you're worried about too many joins for your queries, you could consider exposing certain sets of the data as a view...
If you make sure to index the foreign keys (you did set up foreign keys didn't you?) and have proper where clauses in your queries, 10-15 joins should be easily handled by a database. Especially with so few rows. I have queries with that many joins on tables with millions of rows and they run fine.
Usually it is better to partition data than to denormalize.
As far as denomalizing goes, don't do it unless you also institute a strategy for keeping the denormalized data in synch with the parent table.
As to whether you really need that many tables or if your design is bad, well the only way we could comment on that is if we saw the table structure.
Unless you have clear evidence that performance is suffering because of the joins, stay normalised. Otherwise, as others have said, you'll have to worry about multiple updates.
Especially if the database is heavily cached, as you say, you'll be surprised how quick the DBMS is at doing this kind of thing - it is what it's designed for, after all.
Unless it's the sort of monster application, with huge amounts of data, that demands special performance optimisations, you'll find that keeping down the development, testing, and later, maintenance effort, will be much more important.
Joins are good, usually, not bad. They allow you to keep the data where it should be, which gives you maximum flexibility.
And as has been said many times, premature optimisation is usually bad, not good.

mysql tables structure - one very large table or separate tables?

I'm working on a project which is similar in nature to website visitor analysis.
It will be used by 100s of websites with average of 10,000s to 100,000s page views a day each so the data amount will be very large.
Should I use a single table with websiteid or a separate table for each website?
Making changes to a live service with 100s of websites with separate tables for each seems like a big problem. On the other hand performance and scalability are probably going to be a problem with such large data. Any suggestions, comments or advice is most welcome.
How about one table partitioned by website FK?
I would say use the design that most makes sense given your data - in this case one large table.
The records will all be the same type, with same columns, so from a database normalization standpoint they make sense to have them in the same table. An index makes selecting particular rows easy, especially when whole queries can be satisfied by data in a single index (which can often be the case).
Note that visitor analysis will necessarily involve a lot of operations where there is no easy way to optimise other than to operate on a large number of rows at once - for instance: counts, sums, and averages. It is typical for resource intensive statistics like this to be pre-calculated and stored, rather than fetched live. It's something you would want to think about.
If the data is uniform, go with one table. If you ever need to SELECT across all websites
having multiple tables is a pain. However if you write enough scripting you can do it with multiple tables.
You could use MySQL's MERGE storage engine to do SELECTs across the tables (but don't expect good performance, and watch out for the Windows hard limit on the number of open files - in Linux you may haveto use ulimit to raise the limit. There's no way to do it in Windows).
I have broken a huge table into many (hundreds) of tables and used MERGE to SELECT. I did this so the I could perform off-line creation and optimization of each of the small tables. (Eg OPTIMIZE or ALTER TABLE...ORDER BY). However the performance of SELECT with MERGE caused me to write my own custom storage engine. (Described http://blog.coldlogic.com/categories/coldstore/'>here)
Use the single data structure. Once you start encountering performance problems there are many solutions like you can partition your tables by website id also known as horizontal partitioning or you can also use replication. This all depends upon the the ratio of reads vs writes.
But for start keep things simple and use one table with proper indexing. You can also determine if you need transactions or not. You can also take advantage of various different mysql storage engines like MyIsam or NDB (in memory clustering) to boost up the performance. Also caching plays a very good role in offloading the load from the database. The data that is mostly read only and can be computed easily is usually put in the cache and the cache serves the request instead of going to the database and only the necessary queries go to the database.
Use one table unless you have performance problems with MySQL.
Nobody here cannot answer performance questions, you should just do performance tests yourself to understand, whether having one big table is sufficient.

What techniques are most effective for dealing with millions of records?

I once had a MySQL database table containing 25 million records, which made even a simple COUNT(*) query takes minute to execute. I ended up making partitions, separating them into a couple tables. What i'm asking is, is there any pattern or design techniques to handle this kind of problem (huge number of records)? Is MSSQL or Oracle better in handling lots of records?
P.S
the COUNT(*) problem stated above is just an example case, in reality the app does crud functionality and some aggregate query (for reporting), but nothing really complicated. It's just that it takes quite a while (minutes) to execute some these queries because of the table volume
See Why MySQL could be slow with large tables and COUNT(*) vs COUNT(col)
Make sure you have an index on the column you're counting. If your server has plenty of RAM, consider increasing MySQL's buffer size. Make sure your disks are configured correctly -- DMA enabled, not sharing a drive or cable with the swap partition, etc.
What you're asking with "SELECT COUNT(*)" is not easy.
In MySQL, the MyISAM non-transactional engine optimises this by keeping a record count, so SELECT COUNT(*) will be very quick.
However, if you're using a transactional engine, SELECT COUNT(*) is basically saying:
Exactly how many records exist in this table in my transaction ?
To do this, the engine needs to scan the entire table; it probably knows roughly how many records exist in the table already, but to get an exact answer for a particular transaction, it needs a scan. This isn't going to be fast using MySQL innodb, it's not going to be fast in Oracle, or anything else. The whole table MUST be read (excluding things stored separately by the engine, such as BLOBs)
Having the whole table in ram will make it a bit faster, but it's still not going to be fast.
If your application relies on frequent, accurate counts, you may want to make a summary table which is updated by a trigger or some other means.
If your application relies on frequent, less accurate counts, you could maintain summary data with a scheduled task (which may impact performance of other operations less).
Many performance issues around large tables relate to indexing problems, or lack of indexing all together. I'd definitely make sure you are familiar with indexing techniques and the specifics of the database you plan to use.
With regards to your slow count(*) on the huge table, i would assume you were using the InnoDB table type in MySQL. I have some tables with over 100 million records using MyISAM under MySQL and the count(*) is very quick.
With regards to MySQL in particular, there are even slight indexing differences between InnoDB and MyISAM tables which are the two most commonly used table types. It's worth understanding the pros and cons of each and how to use them.
What kind of access to the data do you need? I've used HBase (based on Google's BigTable) loaded with a vast amount of data (~30 million rows) as the backend for an application which could return results within a matter of seconds. However, it's not really appropriate if you need "real time" access - i.e. to power a website. Its column-oriented nature is also a fairly radical change if you're used to row-oriented DBMS.
Is count(*) on the whole table actually something you do a lot?
InnoDB will have to do a full table scan to count the rows, which is obviously a major performance issue if counting all of them is something you actually want to do. But that doesn't mean that other operations on the table will be slow.
With the right indexes, MySQL will be very fast at retrieving data from tables much bigger than that. The problem with indexes is that they can hurt insert speeds, particularly for large tables as insert performance drops dramatically once the space required for the index reaches a certain threshold - presumably the size it will keep in memory. But if you only need modest insert speeds, MySQL should do everything you need.
Any other database will have similar tradeoffs between retrieve speed and insert speed; they may or may not be better for your application. But I would look first at getting the indexes right, and maybe rewriting your queries, before you try other databases. For what it's worth, we picked MySQL originally because we found it performed best.
Note that MyISAM tables in MySQL store the total size of the table. They maintain this because it's useful to the optimiser in some cases, but a side effect is that count(*) on the whole table is really fast. That doesn't necessarily mean they're faster than InnoDB at anything else.
I answered a similar question in This Stackoverflow Posting in some detail, describing the merits of the architectures of both systems. To some extent it was done from a data warehousing point of view but many of the differences also matter on transactional systems.
However, 25 million rows is not a VLDB and if you are having performance problems you should look to indexing and tuning. You don't need to go to Oracle to support a 25 million row database - you've got about 3 orders of magnitude to go before you're truly in VLDB territory.
You are asking for a books worth of answer and I therefore propose you get a good book on databases. There are many.
To get you started, here are some database basics:
First, you need a great data model based not just on what data you need to store but on usage patterns. Good database performance starts with good schema design.
Second, place indicies on columns based upon expected lookup AND update needs as update performance is often overlooked.
Third, don't put functions in where clauses if at all possible.
Fourth, use an -ahem- RDBMS engine that is of quality design. I would respectfully submit that while it has improved greatly in the recent past, mysql does not qualify. (Apologies to those who wish to argue it has finally made the grade in recent times.) There is no longer any need to choose between high-price and quality; Postgres (aka PostgreSql) is available open-source and is truly fantastic - and has all the plug-ins available to meet your needs.
Finally, learn what you are asking a database engine to do - gain some insight into internals - so you can better judge what kinds of things are expensive and why.
I'm going to second #Mark Baker, and say that you need to build indices on your tables.
For other queries than the one you selected, you should also be aware that using constructs such as IN() is faster than a series of OR statements in the query. There are lots of little steps you can take to speed-up individual queries.
Indexing is key to performance with this number of records, but how you write the queries can make a big difference as well. Specific performance tuning methods vary by database, but in general, avoid returning more records or fields than you actually need, make sure all join fields are indexed (as well as common where clause fields), avoid cursors (although I think this is less true in Oracle than SQL Server I don't know about mySQL).
Hardware can also be a bottleneck especially if you are running things besides the database server on the same machine.
Performance tuning is a very technical subject and can't really be answered well in a format like this. I suggest you get a performance tuning book and read it. Here is a link to one for mySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716