Database performance benchmark - mysql

Any good articles out there comparing Oracle vs SQL Server vs MySql in terms of performance?
I'd like to know things like:
INSERT performance
SELECT performance
Scalability under heavy load
Based on some real examples in order to gain a better understanding about the different RDBMS.

The question is really too broad to be answered because it all depends on what you want to do as there is no general "X is better than Y" benchmark without qualifying "at doing Z" or otherwise giving it some kind of context.
The short answer is: it really doesn't matter. Any of those will be fast enough for your needs. I can say that with 99% certainty. Even MySQL can scale to billions of rows.
That being said, they do vary. As just one example, I wrote a post about a very narrow piece of functionality: join and aggregation performance. See Oracle vs MySQL vs SQL Server: Aggregation vs Joins.

Yes, such benchmarks do exist, but they cannot be published, as Oracle's licensing prohibits publishing such things.
At least, that is the case to the best of my knowledge. I've seen a few published which do not name Oracle specifically, but instead say something like "a leading RDBMS" when they are clearly talking about Oracle, but I don't know whether that gets around it.
On the other hand, Oracle now own MySQL, so perhaps they won't care so much, or perhaps they will. Who knows.

Related

SQL and fuzzy comparison

Let's assume we have a table of People (name, surname, address, SSN, etc).
We want to find all rows that are "very similar" to specified person A.
I would like to implement some kind of fuzzy logic comparation of A and all rows from table People. There will be several fuzzy inference rules working separately on several columns (e.g. 3 fuzzy rules for name, 2 rules on surname, 5 rules on address)
The question is Which of the following 2 approaches would be better and why?
Implement all fuzzy rules as stored procedures and use one heavy SELECT statement to return all rows that are "very similar" to A. This approach may include using soundex, sim metric etc.
Implement one or more simplier SELECT statements, that returns less accurate results, "rather similar" to A, and then fuzzy-compare A with all returned rows (outside database) to get "very similar" rows. So fuzzy comparation would be implemented in my favorit programming language.
Table People should have up to 500k rows, and I would like to make about 500-1000 queries like this a day. I use MySQL (but this is yet to be considered).
I don't really think there is a definitive answer because it depends on information not available in the question. Anyway, too long for a comment.
DBMSes are good at retrieving information according to indexes. It does not make sense to have a db server wasting time in heavy computations unless it is dedicated for this specific purpose (as answered by #Adrian).
Therefore, your client application should delegate to the DBMS the retrieval of information required by the rules.
If the computations are minor, all could be done on the server. Else, pull it off into the client system.
The disadvantage of the second approach lies in the amount of data traveling from the server to the client and the number of connections to establish. So, typically it is a compromise between computation and data transfer in the server. A balance to be achieved depending on the specificities of the fuzzy rules.
Edit: I've seen in a comment that you are almost sure to have to implement the code in the client. In that case, you should consider an additional criterion, code locality, for maintenance purposes, i.e., try to have all code that is related together, not spreading it between systems (and languages).
I would say you're best off using simple selects to get the closest matches you can without hammering the database, then do the heavy lifting in your application layer. The reason I would suggest this solution is scalability: if you do your heavy lifting in the application layer, your problem is a perfect use case for a map-reduce-style solution wherein you can distribute the processing of similarities across nodes and get your results back much faster than if you put it through the database; plus, this way, you're not locking up your database and slowing down any other operations that may be going on at the same time.
Since you're still considering what DB to use PostgreSQL has fuzzystrmatch module which provides Levenshtein and Soundex functions. Also, you might want to look on the pg_trm module as described here. Maybe you could also put the index on the column using soundex() so you won't have to calculate that every time.
But you seem to optimize prematurely so my advice would be to test using pg and then wonder if you need to optimize or not, the numbers you provided really don't seem like a lot considered you almost have two minutes to run one query.
An option i'd consider is to add a column in the "People Talbe" that is the SoundEx value of the person.
I've done joins using
Select [Column}
From People P
Inner join TableA A on Soundex(A.ComarisonColumn) = P.SoundexColumn
That'll return anything in TableA that has the same SoundEx value from the People Tables SoundEx Column.
I haven't used that kind of query on tables that size, but i see no issues with trying it. You can also index that SoundExColumn to help with performance.

Using NoSQL database for relational purpose

Non-relational databases are attracting more attention day by day. The main limitation is that today's complicated data are indeed connected. Isn't it convenient to connect databases as we connect tables in RDBMS? Of course, I just mean simple cases. Imagine three tables of Articles, Tags, Relationships. In a RDBMS like Mysql, we can run three queries to
1. Find ID of a given tag
2. Find Articles connected with the captured Tag ID
3. Fetch the contents of Articles tagged with the term
Instead of three queries, we perform a single query by JOIN. I think three queries in a key/value database like BerkeleyDB is faster than a JOIN query in Mysql.
Is this idea practical? Or other issues are involved to ignore this approach?
NoSQL databases can support relational data models just fine. You're just left to implement the relational mapping yourself in your application, and that effort is typically not insignificant.
In some applications this extra effort will be worthwhile. Perhaps you only have a small number of tables and the joins you need are very simple. Or perhaps you've done some performance evaluation between a traditional relational DBMS and a NoSQL alternative and found that the NoSQL option is more appropriate for your needs for any number of reasons (performance, scalability, flexibility, whatever).
You should keep one thing in mind, however. A typical SQL DBMS is basically a NoSQL DB with an optimized, well-built relational engine in front of it. Some databases even let you bypass the relational layer and treat their system like a pure NoSQL DB.
Therefore, the moment you start to build your own relational mappings and joins on top of a NoSQL DB you should ask yourself, "Didn't someone build this for me already?" The answer may well be "yes", and the solution might be to go with a traditional SQL DBMS.
To answer the "3 query" part of your question specifically, the answer is "maybe". You certainly might be able to make such a query run faster in a NoSQL DB than in an RDBMS, but you need to keep in mind that there are more things to consider here than just the raw speed of your query:
The technical debt you will incur as you build join-like functionality that you wouldn't have had to build otherwise
The time it will take you to build, test and optimize your query code which will likely be more significant than writing a simple SQL query
Any difference in transactional guarantees or other typical product features (replication, management tools, etc) which you may lose or gain depending on the NoSQL option you choose
The ability to hire DBMs who know how to run your database from an operational perspective
You might review that list and say to yourself, "No big deal, I'm running a simple app with only a few thousand DB entries and I'll maintain it myself". If so, knock yourself out - Berkeley (and other NoSQL options) would work fine. I've used Berkeley many times for those kinds of applications. But you may have a different answer if you are building the back-end for a significantly-sized SaaS product which might soon have millions of users and very complex queries.
We can't give a one-size-fits-all answer, unfortunately. You'll have to make the judgement call yourself based on the needs of you application and organization.
Sure, a single record join is pretty speedy in either solution, but that's not the big advantage of joins. Joins are useful when you're joining many, many rows with many, many other rows. Imagine if, in your example, you wanted to do that for 100 different tags. Without joins, you're talking 300 queries to SQL's one.
Another solution on noSql systems is playOrm. It does Joins BUT only in partitions so the table can be infinite size, but the partitions have to be on par with the size of RDBMS tables. It does all the fancy hibernate stuff as well for you with all the related annotations though it has some differences and will be adding Embedded for use when you denormalize. It makes things much easier. Typically dealing with nosql is kind of a pain in all the translation logic you have to do and all the manual indexing and updates and removes from the index....playOrm does all this for you instead.

Django Using Annotate Instead of Distinct()

I have read that distinct() API call has some performance issues at times. I wanted to try to rewrite a query through the orm which avoided using distinct (at least profile the difference).
My understanding is that values() performs a Group By under the hood. When I test out the two methods, though, the Count of objects differs depending on whether I use distinct() or values()/annotate().
zip_codes = Location.objects.values('zip_code').annotate(zip_count=Count('zip_code')).exclude(zip_code=None).count()
VS.
zip_codes = Location.objects.values_list('zip_code', flat=True).exclude(zip_code=None).distinct()
any thoughts on what is wrong here?
Thanks!
I just quickly checked your queries against a database I have with a similar query. The counts was identical so I'm not sure what about your data is resulting in issues.
I'd also be HIGHLY skeptical of the premise though. DISTINCT is indeed a cpu intensive query. However, so is COUNT(*) and your second query is going to first run an count aggregate with a group by and then run a COUNT on the results. I'd be put money on the single DISTINCT call being faster (I'd also check with whichever database backend you're using to see). All of this has very little to do with django's ORM and a whole heck of a lot more to do with your database backend.
Also think about this. The distinct based query is an order of magnitude clearer as to what it's accomplishing compared to the annotate based one. Do you have evidence to support that DISTINCT is going to be slow in your situation, or better still that it's forming a bottlneck right now? If not you're well into the range of premature optimization and should heavily reconsider your path.
Premature Optimization.
Optimization matters only when it matters. When it matters, it matters a lot, but until you know that it matters, don't waste a lot of time doing it. Even if you know it matters, you need to know where it matters. Without performance data, you won't know what to optimize, and you'll probably optimize the wrong thing.
The result will be obscure, hard to write, hard to debug, and hard to maintain code that doesn't solve your problem. Thus it has the dual disadvantage of (a) increasing software development and software maintenance costs, and (b) having no performance effect at all.
In other words write your software clearly and then when you find a problem trace it to the source and fix it. Anything you do before that is counterproductive. Spend your time worrying about which indexes are going to matter on your db, and where to use select_related. Those are 10000% more effective than what you are worrying about here (unless you are counting zip codes all the time, in which case let me introduce you to caching)

How fast is MySQL compared to a C/C++ program running in the server?

Ok, I have a need to perform some intensive text manipulation operations.
Like concatenating huge (say 100 pages of standard text), and searching in them etc. so I am wondering if MySQL would give me a better performance for these specific operations, compared to a C program doing the same thing?
Thanks.
Any database is always slower than a flat-file program outside the database.
A database server has overheads that a program reading and writing simple files doesn't have.
In general the database will be slower. But much depends on the type of processing you want to do, the time you can devote for coding and the coding skills. If the database provides out-of-the-box the tools and functionality you need, then why don't give it a try, which should take much less time than coding own tool. If the performance turns out to be an issue then write your own solution.
But I think that MySQL will not provide the text manipulation operations you want. In Oracle world one has Text Mining and Oracle Text.
There are several good responses that I voted up, but here are more considerations from my opinion:
No matter what path you take: indexing the text is critical for speed. There's no way around it. The only choice is how complex you need to make your index for space constraints as well as search query features. For example, a simple b-tree structure is fast and easy to implement but will use more disk space than a trie structure.
Unless you really understand all the issues, or want to do this as a learning exercise, you are going to be much better off using an application that has had years of performance tuning.
That can mean a relational databases like MySQL even though full-text is a kludge in databases designed for tables of rows and columns. For MySQL use the MyISAM engine to do the indexing and add a full text index on a "blob" column. (Afaik, the InnoDB engine still doesn't handle full text indexing, so you need to use MyISAM). For Postgresql you can use tsearch.
For a bit more difficulty of implementation though you'll see the best performance integrating indexing apps like Xapian, Hyper Estraier or (maybe) Lucene into your C program.
Besides better performance, these apps will also give you important features that MySQL full-text searching is missing, such as word stemming, phrase searching, etc., in other words real full-text query parsers that aren't limited to an SQL mindset.
Relational Databases are normally not good for handling large text data. The performance-wise strength of realtional DBs is the indexation and autmatically generated query plan. Freeform text does not work well in with this model.
If you are talking about storing plain text in one db field and trying to manipulate with data, then C/C++ sould be faster solution. In simple way, MySQL should be a lot bigger C programm than yours, so it must be slower in simple tasks like string manipulation :-)
Of course you must use correct algorithm to reach good result. There is useful e-book about string search algorithms with examples included: http://www-igm.univ-mlv.fr/~lecroq/string/index.html
P.S. Benchmark and give us report :-)
Thanks for all the answers.
I kind of thought that a DB would involve some overhead as well. But what I was thinking is that since my application required that the text be stored somewhere in the first place already, then the entire process of extracting the text from DB, passing it to the C program, and writing back the result into the DB would overall be less efficient than processing it within the DB??
If you're literally talking about concatenating strings and doing a regexp match, it sounds like something that's worth doing in C/C++ (or Java or C# or whatever your favorite fast high-level language is).
Databases are going to give you other features like persistence, transactions, complicated queries, etc.
With MySQL you can take advantage of full-text indices, which will be hundreds times faster, then directly searching through the text.
MySQL is fairly efficient. You need to consider whether writing your own C program would mean more or less records need to be accessed to get the final result, and whether more or less data needs to be transferred over the network to get the final result.
If either solution will result in the same number of records being accessed, and the same amount transferred over the network, then there probably won't be a big difference either way. If performance is critical then try both and benchmark them (if you don't have time to benchmark both then you probably want to go for whichever is easiest to implemnent anyway).
MySQL is written in C, so it is not correct to compare it to a C program. It's itself a C program

postgreSQL mysql oracle differences

I'm having to start building the architecture for a database project but i really don't know the differences between the engines.
Anyone can explain whats the pros and bads of each of these three engines? We'll have to choose one of them and the only thing I actually know about them is this:
Mysql & Postgres:
Are free but not so good as oracle
Mysql as security problems (is this true?)
Oracle:
Best data base engine in the world
Expensive
Can someone clear out other differences between them? This is a medium/large (we're thinking of around some 100 to 200 tables) project with low budget, what would you choose? And with a higher budget?
A few years ago I had to write a translation engine; you feed it one set of sql and it translates to the dialect of the currently connected engine. My engine works on Postgres (AKA PostgreSql), Ingres, DB2, Informix, Sybase, and Oracle - oh, and ANTS. Frankly, Oracle is my least favorite (more on that below)... Unfortunately for you, mySql and SQL Server are not on the list (at the time neither was considered a serious RDBMS - but times do change).
Without regard to the quality or performance of the engine - and ease of making and restoring backups - here are the primary areas of difference:
datatypes
limits
invalids
reserved words
null semantics (see below)
quotation semantics (single quote ', double quote ", or either)
statement completion semantics
function semantics
date handling (including constant keywords like 'now' and input / output function formats)
whether inline comments are permitted
maximum attribute lengths
maximum number of attributes
connection semantics / security paradigm.
Without boring you on all the conversion data, here's a sample for one datatype, lvarchar:
oracle=varchar(%x) sybase=text db2="long varchar" informix=lvarchar postgres=varchar(%x) ants=varchar(%x) ingres=varchar(%x,%y)
The biggest deal of all, in my view, is null handling; Oracle SILENTLY converts blank input strings to null values. ...Somewhere, a LONG time ago, I read a writeup someone had done about "The Seventeen Meanings of Null" or some such and the real point is that nulls are very valuable and the distinction between a null string and an empty string is useful and non-trivial! I think Oracle made a huge mistake on this one; none of the others have this behavior (that I've ever seen).
My second least favorite was ANTS because unlike all the others, they ENFORCED the silly rules for perfect syntax that absolutely no one else does and while they may be the only DB company to provide perfect adherence to the standard, they are also a royal pain in the butt to write code for.
Far and away my favorite is Postgres; it's very fast in _real_world_ situations, has great support, and is open source / free.
The differences between different SQL Implementations are big, at least under the hood. This boards wont suffice to count them all.
If you have to ask, you also have to ask yourself whether you are in the position to reach a valid and founded decision on the matter.
A comparison von MYSQL and Postgres can be found here
Note that Oracle offers also an Express (XE) edition, reduced in features, but free to use.
Also, if you have little knowledge to start with, you will have to learn yourself, I would just choose any one, and start learning by using it.
See the comparison tables on wikipedia: http://en.wikipedia.org/wiki/Comparison_of_object-relational_database_management_systems && http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
Oracle may or may not be the best. It's expensive, but that doesn't mean best.
Have you looked at DB2? Sybase? Teradata? MS SQL?
I think that for low budget scenarios Oracle is out of the question.
100-200 tables is not big and generally the amount of tables in a schema is not a scale measure. Dataset and throughput is.
You can have a look at http://www.mysqlperformanceblog.com/ (and their superb book) to see how MySQL can handle huge deployments.
Generally nowadays most RDBMSes can do almost anything that you'd need in a very serious application.