I'm having to start building the architecture for a database project but i really don't know the differences between the engines.
Anyone can explain whats the pros and bads of each of these three engines? We'll have to choose one of them and the only thing I actually know about them is this:
Mysql & Postgres:
Are free but not so good as oracle
Mysql as security problems (is this true?)
Oracle:
Best data base engine in the world
Expensive
Can someone clear out other differences between them? This is a medium/large (we're thinking of around some 100 to 200 tables) project with low budget, what would you choose? And with a higher budget?
A few years ago I had to write a translation engine; you feed it one set of sql and it translates to the dialect of the currently connected engine. My engine works on Postgres (AKA PostgreSql), Ingres, DB2, Informix, Sybase, and Oracle - oh, and ANTS. Frankly, Oracle is my least favorite (more on that below)... Unfortunately for you, mySql and SQL Server are not on the list (at the time neither was considered a serious RDBMS - but times do change).
Without regard to the quality or performance of the engine - and ease of making and restoring backups - here are the primary areas of difference:
datatypes
limits
invalids
reserved words
null semantics (see below)
quotation semantics (single quote ', double quote ", or either)
statement completion semantics
function semantics
date handling (including constant keywords like 'now' and input / output function formats)
whether inline comments are permitted
maximum attribute lengths
maximum number of attributes
connection semantics / security paradigm.
Without boring you on all the conversion data, here's a sample for one datatype, lvarchar:
oracle=varchar(%x) sybase=text db2="long varchar" informix=lvarchar postgres=varchar(%x) ants=varchar(%x) ingres=varchar(%x,%y)
The biggest deal of all, in my view, is null handling; Oracle SILENTLY converts blank input strings to null values. ...Somewhere, a LONG time ago, I read a writeup someone had done about "The Seventeen Meanings of Null" or some such and the real point is that nulls are very valuable and the distinction between a null string and an empty string is useful and non-trivial! I think Oracle made a huge mistake on this one; none of the others have this behavior (that I've ever seen).
My second least favorite was ANTS because unlike all the others, they ENFORCED the silly rules for perfect syntax that absolutely no one else does and while they may be the only DB company to provide perfect adherence to the standard, they are also a royal pain in the butt to write code for.
Far and away my favorite is Postgres; it's very fast in _real_world_ situations, has great support, and is open source / free.
The differences between different SQL Implementations are big, at least under the hood. This boards wont suffice to count them all.
If you have to ask, you also have to ask yourself whether you are in the position to reach a valid and founded decision on the matter.
A comparison von MYSQL and Postgres can be found here
Note that Oracle offers also an Express (XE) edition, reduced in features, but free to use.
Also, if you have little knowledge to start with, you will have to learn yourself, I would just choose any one, and start learning by using it.
See the comparison tables on wikipedia: http://en.wikipedia.org/wiki/Comparison_of_object-relational_database_management_systems && http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
Oracle may or may not be the best. It's expensive, but that doesn't mean best.
Have you looked at DB2? Sybase? Teradata? MS SQL?
I think that for low budget scenarios Oracle is out of the question.
100-200 tables is not big and generally the amount of tables in a schema is not a scale measure. Dataset and throughput is.
You can have a look at http://www.mysqlperformanceblog.com/ (and their superb book) to see how MySQL can handle huge deployments.
Generally nowadays most RDBMSes can do almost anything that you'd need in a very serious application.
Related
I've been working mostly with Oracle for the past few years, and am quite used to seeing single character varchar columns used as boolean values.
I can also see (per stack overflow answers), that suggested type for MySQL is TINYINT.
Now I've taken on my little side project - using DerbyDB, and it supports BOOLEAN columns, but not until after version 10 or so.
So, the question is, why is it so hard to incorporate a BOOLEAN column while designing a relational database? Am I missing something, or is it just pushed down the to-do list as unimportant, since you can use another column type meanwhile?
In the case of Derby, specifically, the answer is a bit of strange history: Derby, the open source database, was once called Cloudscape, and was a proprietary product. At that time, it fully supported BOOLEAN.
Subsequently, Cloudscape was purchased by Informix which was purchased by IBM, and IBM engineering decided to make Derby compatible with DB2. The reason for this was that, if the two databases were compatible, it would be easier for users to migrate their applications between Derby databases and DB2 databases. The engineering staff, however, did not remove the non-DB2-compatible features from Derby, they simply disabled them in the SQL grammar, leaving most of the implementation in place.
Subsequently, IBM open-sourced Cloudscape to the Apache Software Foundation, naming it Derby. The open source community, no longer bound by the requirement that Derby be completely compatible with DB2, decided to revive the BOOLEAN datatype support. And so Derby now has BOOLEAN datatype support.
Tom Kyte pretty much echoes your last sentence in this blog entry:
"It just isn't a type we have -- I can say no more and no less. ANSI
doesn't have it -- many databases don't have it (we are certainly not
alone). In the grand scheme of things -- I would say the
priotization of this is pretty "low" (thats my opinion there)."
He's speaking from the Oracle perspective, but it applies to any relational RDBMS.
PostgreSQL does have support for boolean for as long as I can think.
The oldest online doc I can find is for version 6.3 released 1998-03-01. They mention the boolean type:
http://www.postgresql.org/docs/6.3/static/c0805.htm
In later docs they mention SQL99 as the standard they follow.
Since SQL99 seems to mention this type I would assume, that many DBs did have support for that type quite well before 1999.
I don't know as I haven't designed one, but my guess would be that since RDBMS's are about describing and storing sets of things, boolean fields aren't needed because they would also denote what is in a set, but they are extraneous as the membership of sets will be derived from the actual data or structure of the database.
As an example, take a boolean column for roles given to employees where they're either managers, or they're not. You could use a boolean column to describe this, but what you should do is either have a table for managers, and a table for non managers, or (and this would be the more flexible and probably more manageable way) create an extra "look up" table that gives roles (as a single text column) and and key that is then referred to (a foreign key) in the employees table.
I think I should add that most times you see a boolean field in a table it's a code smell, as it will may hit performance - to use a boolean in a where clause would invoke a table scan and make an indexes on the table fairly pointless (but see the comments for a further discussion of this). I'd hazard another guess that boolean data types have been added to most RDBMS's for use in their procedural language extensions (T-SQL, PLSQL) to help with the odd conditional statement that's required.
Am I missing a reason not to use the new DateTime2 datatype?
For example, might it cause problems when migrating to another database system or integrating it with another technology?
One of the definitive articles is this from Tibor Karaszi
In favour:
Better precision
Potentially less storage
But probably best of all judging by frequency of questions here:
Better support for ANSI date formats (yyyy-mm-dd is not safe otherwise)
If you're lucky enough to be using nothing but Sql Server 2008 and can guarantee that you will be for a long time to come, then I see no reason why you shouldn't use it if you need to.
I think the replies to this question will explain it better than I can.
However, reasons for not using it would be pretty much as you describe, i.e. it's not recognised in earlier versions of Sql Server, so moving data between the two would require some conversion.
Similarly, datetime2 has a higher precision and if you write code that depends on that level of precision, then you are locked-in to always using that datatype.
I'm new to databases, and this question has to do with how smart I can expect databases to be. Here by "databases" I mean "something like" MySQL or H2 (I actually have no idea if these two are similar, just that they are popular). I'm actually using ScalaQuery, so it abstracts away from the underlying database.
Suppose I have a table with entries of type (String, Int), with lots of redundancy in the String entries. So my table might look like:
(Adam, 18)
(Adam, 24)
(Adam, 34)
... continued ...
(Adam, 3492)
(Bethany, 4)
(Bethany, 45)
... continued ...
(Bethany, 2842)
If I store this table with H2, is it going to be smart enough to realize "Adam" and "Bethany" are repeated lots of times, and can be replaced with enumerations pointing to lookup tables? Or is it going to waste lots of storage?
Related: If H2 is smart in this respect with strings, is it also smart in the same way with doubles? In my probably brain-dead initial table, I happen to have lots of repeated double fields.
Thanks!
The database engine is not built to recognize redundancies in data and fix them. That is the task of the designer / developer.
Databases are designed to store information. There is no way database will know if (Adam, 44) and (Adam,55) can be compressed, and I would be petrified if databases tried to do things like you propose, as this can lead to a various performance and/or logical problems.
On the opposite, databases are not minimising the storage, they are adding redundant information, like indexes and keys, and other internal additional information required for DB.
DBs are built to retrieve information fast, not store it space-effectively. When it comes to complexity, database rather increase storage space, then decrease the performance of a query.
There are some storage systems that compress pages, so the question is valid. I can't talk about MySQL, but I believe it is similar to H2. H2 isn't very smart in this regard. H2 does compress data, but only for the following cases:
LOB compression, if enabled.
The following does not effect storage size of a closed database: H2 compresses the undo log when writing using LZF currently, therefore repeated data in a page will result in a slightly improved write performance (but only after a checkpoint). This may change in the future however.
Also, H2 uses a coded similar to UTF-8 to store text, but I wouldn't call this compression.
MySQL and other SQL products based on contiguous storage are not smart at this kind of thing at all.
Consider two logical sets, one referencing the other (i.e. a foreign key). One possible implementation is to physically store the value common to both sets just once and for both tables to store a pointer to the value (think reference type variables in 3GL programming languages such as C#). However, most SQL products physically store the value in both tables; if you want pointers then the end user has to implement them themselves, typically using autoincrement integer 'surrogate' keys, which sadly get exposed into the logical model.
Either you are talking about data compression, which can be done by the database engine and shouldn't be your concern.
Or you are talking about data normalization. Then you should read up on database design.
Databases are meant to store data, so no need to worry about a bit of redundancy. If you are going into several million lines and gigabytes of data, then you can start considering options. But up to that level you will not have any problems with performance.
Any good articles out there comparing Oracle vs SQL Server vs MySql in terms of performance?
I'd like to know things like:
INSERT performance
SELECT performance
Scalability under heavy load
Based on some real examples in order to gain a better understanding about the different RDBMS.
The question is really too broad to be answered because it all depends on what you want to do as there is no general "X is better than Y" benchmark without qualifying "at doing Z" or otherwise giving it some kind of context.
The short answer is: it really doesn't matter. Any of those will be fast enough for your needs. I can say that with 99% certainty. Even MySQL can scale to billions of rows.
That being said, they do vary. As just one example, I wrote a post about a very narrow piece of functionality: join and aggregation performance. See Oracle vs MySQL vs SQL Server: Aggregation vs Joins.
Yes, such benchmarks do exist, but they cannot be published, as Oracle's licensing prohibits publishing such things.
At least, that is the case to the best of my knowledge. I've seen a few published which do not name Oracle specifically, but instead say something like "a leading RDBMS" when they are clearly talking about Oracle, but I don't know whether that gets around it.
On the other hand, Oracle now own MySQL, so perhaps they won't care so much, or perhaps they will. Who knows.
Ok, I have a need to perform some intensive text manipulation operations.
Like concatenating huge (say 100 pages of standard text), and searching in them etc. so I am wondering if MySQL would give me a better performance for these specific operations, compared to a C program doing the same thing?
Thanks.
Any database is always slower than a flat-file program outside the database.
A database server has overheads that a program reading and writing simple files doesn't have.
In general the database will be slower. But much depends on the type of processing you want to do, the time you can devote for coding and the coding skills. If the database provides out-of-the-box the tools and functionality you need, then why don't give it a try, which should take much less time than coding own tool. If the performance turns out to be an issue then write your own solution.
But I think that MySQL will not provide the text manipulation operations you want. In Oracle world one has Text Mining and Oracle Text.
There are several good responses that I voted up, but here are more considerations from my opinion:
No matter what path you take: indexing the text is critical for speed. There's no way around it. The only choice is how complex you need to make your index for space constraints as well as search query features. For example, a simple b-tree structure is fast and easy to implement but will use more disk space than a trie structure.
Unless you really understand all the issues, or want to do this as a learning exercise, you are going to be much better off using an application that has had years of performance tuning.
That can mean a relational databases like MySQL even though full-text is a kludge in databases designed for tables of rows and columns. For MySQL use the MyISAM engine to do the indexing and add a full text index on a "blob" column. (Afaik, the InnoDB engine still doesn't handle full text indexing, so you need to use MyISAM). For Postgresql you can use tsearch.
For a bit more difficulty of implementation though you'll see the best performance integrating indexing apps like Xapian, Hyper Estraier or (maybe) Lucene into your C program.
Besides better performance, these apps will also give you important features that MySQL full-text searching is missing, such as word stemming, phrase searching, etc., in other words real full-text query parsers that aren't limited to an SQL mindset.
Relational Databases are normally not good for handling large text data. The performance-wise strength of realtional DBs is the indexation and autmatically generated query plan. Freeform text does not work well in with this model.
If you are talking about storing plain text in one db field and trying to manipulate with data, then C/C++ sould be faster solution. In simple way, MySQL should be a lot bigger C programm than yours, so it must be slower in simple tasks like string manipulation :-)
Of course you must use correct algorithm to reach good result. There is useful e-book about string search algorithms with examples included: http://www-igm.univ-mlv.fr/~lecroq/string/index.html
P.S. Benchmark and give us report :-)
Thanks for all the answers.
I kind of thought that a DB would involve some overhead as well. But what I was thinking is that since my application required that the text be stored somewhere in the first place already, then the entire process of extracting the text from DB, passing it to the C program, and writing back the result into the DB would overall be less efficient than processing it within the DB??
If you're literally talking about concatenating strings and doing a regexp match, it sounds like something that's worth doing in C/C++ (or Java or C# or whatever your favorite fast high-level language is).
Databases are going to give you other features like persistence, transactions, complicated queries, etc.
With MySQL you can take advantage of full-text indices, which will be hundreds times faster, then directly searching through the text.
MySQL is fairly efficient. You need to consider whether writing your own C program would mean more or less records need to be accessed to get the final result, and whether more or less data needs to be transferred over the network to get the final result.
If either solution will result in the same number of records being accessed, and the same amount transferred over the network, then there probably won't be a big difference either way. If performance is critical then try both and benchmark them (if you don't have time to benchmark both then you probably want to go for whichever is easiest to implemnent anyway).
MySQL is written in C, so it is not correct to compare it to a C program. It's itself a C program