Simple question - what would better for a medium/big size database with requirement for compatibility with ACID in 2012.
I have read it all (well most) about mySQL vs pgSQL but most of those posts relate to version 4,5.1 and 7,8 respectively and are quite dated (2008,2009). Its almost 2012 now so I guess we could try and take a fresh look at the issue.
Basically I would like to know if there is anything in PostgreSQL that out-weights ease of use, availability and larger developer/knowledge base of MySQL.
Is MySQL's query optimizer still stupid? Is it still super slow on very complicated queries?
Hit me! :)
PS. And don't send me to goggle or wiki. I am looking for few specific points not an overview + I trust StackOverflow more than some random page with 'smart guy' shining his light.
Addendum
Size of the project: Say an ordering system with roughly 10-100 orders/day per account, couple of thousand accounts, eventually, each can have several hundred to several thousand users.
Better at: being future proof and flexible when it comes to growing and changing requirements. Performance is also important as to keep costs low in hardware department. Also availability of skilled workforce would be a factor.
OLTP or OLAP: OLTP
PostgreSQL is a lot more advanced when it comes to SQL features.
Things that MySQL still doesn't have (and PostgreSQL has):
deferrable constraints
check constraints (MySQL 8.0.16 added them, MariaDB 10.2 has them)
full outer join
MySQL silently uses an inner join with some syntax variations:
https://rextester.com/ADME43793
lateral joins
regular expressions don't work with UTF-8 (Fixed with MySQL 8.0)
regular expressions don't support replace or substring (Introduced with MySQL 8.0)
table functions ( select * from my_function() )
common table expressions (Introduced with MySQL 8.0)
recursive queries (Introduced with MySQL 8.0)
writeable CTEs
window functions (Introduced with MySQL 8.0)
function based index (supported since MySQL 8.0.15)
partial index
INCLUDE additional column in an indexes (e.g. for unique indexes)
multi column statistics
full text search on transactional tables (MySQL 5.6 supports this)
GIS features on transactional tables
EXCEPT or INTERSECT operator (MariaDB has them)
you cannot use a temporary table twice in the same select statement
you cannot use the table being changed (update/delete/insert) in a sub-select
you cannot create a view that uses a derived table (Possible since MySQL 8.0)
create view x as select * from (select * from y);
statement level read consistency. Needed for e.g.: update foo set x = y, y = x or update foo set a = b, a = a + 100
transactional DDL
DDL triggers
exclusion constraints
key/value store
Indexing complete JSON documents
SQL/JSON Path expressions (since Postgres 12)
range types
domains
arrays (including indexes on arrays)
roles (groups) to manage user privileges (MariaDB has them, Introduced with MySQL 8.0)
parallel queries (since Postgres 9.6)
parallel index creation (since Postgres 11)
user defined data types (including check constraints)
materialized views
custom aggregates
custom window functions
proper boolean data type
(treating any expression that can be converted to a non-zero number as "true" is not a proper boolean type)
When it comes to Spatial/GIS features Postgres with PostGIS is also much more capable. Here is a nice comparison.
Not sure what you call "ease of use" but there are several modern SQL features that I would not want to miss (CTEs, windowing functions) that would define "ease of use" for me.
Now, PostgreSQL is not perfect and probably the most obnoxious thing can be, to tune the dreaded VACUUM process for a heavy write database.
Is MySQL's query optimizer still stupid? Is it still super slow on
very complicated queries?
All query optimizers are stupid at times. PostgreSQL's is less stupid in most cases. Some of PostgreSQL's more recent SQL features (windowing functions, recursive WITH queries etc) are very powerful but if you have a dumb ORM they might not be usable.
Size of the project: Say an ordering system with roughly 10-100
orders/day per account, couple of thousand accounts, eventually, each
can have several hundred to several thousand users.
Doesn't sound that large - well within reach of a big box.
Better at: being future proof and flexible when it comes to growing
and changing requirements.
PostgreSQL has a strong developer team, with an extended community of contributors. Release policy is strict, with bugfixes-only in the point releases. Always track the latest release of 9.1.x for the bugfixes.
MySQL has had a somewhat more relaxed attitude to version numbers in the past. That may change with Oracle being in charge. I'm not familiar with the policies of the various forks.
Performance is also important as to keep costs low in hardware department.
I'd be surprised if hardware turned out to be a major component in a project this size.
Also availability of skilled workforce would be a factor.
That's your key decider. If you've got a team of experienced Perl + PostgreSQL hackers sat around idle, use that. If your people know Lisp and MySQL then use that.
OLTP or OLAP: OLTP
PostgreSQL has always been strong on OLTP.
My personal viewpoint is that the PostgreSQL mailing list are full of polite, helpful, knowledgeable people. You have direct contact with users with Terabyte databases and hackers who have built major parts of the code. The quality of the support is truly excellent.
As an addition to #a_horse_with_no_name answer, I want to name some features which I like so much in PostgreSQL:
arrays data type;
hstore extension - very useful for storing key->value data, possible to create index on columns of that type;
various language extensions - I find Python very useful when it comes to unstructured data handling;
distinct on syntax - I think this one should be ANSI SQL feature, it looks very natural to me (as opposed to MySQL group by syntax);
composite types;
record types;
inheritance;
Version 9.3 features:
lateral joins - one thing I miss from SQL Server (where it called outer/cross apply);
native JSON support;
DDL triggers;
recursive, materialized, updatable views;
PostgreSQL is a more mature database, it has a longer history, it is more ANSI SQL compliant, its query optimizer is significantly better. MySQL has different storage engines like MyISAM, InnoDB, in-memory, all of them are incompatible in a sense that an SQL query which runs on one engine may produce a syntax error when executed on another engine. Stored procedures are better in PostgreSQL.
Related
The query below takes about a minute to run on my MySQL instance (running on a fairly beefy machine with 64G memory, 2T disc, 2.30Ghz CPU with 8 cores and 16 logical, and the query is running on localhost). This same query runs in less than a second on a SQL Server database I have access to. Unfortunately, I do not have access to the SQL Server host or the DBA, etc.
select min(visit_start_date)
from visit_occurrence;
The table has been set to ENGINE=MyISAM and default-storage-engine=INNODB and innodb_buffer_pool_size=16G are set in my.ini.
Is there some configuration I could be missing that would cause this query to run so slowly on MySQL? How can I fix it?
I have a large number of tables and queries I will need to support so I would really like to be able to fix this issue globally rather than having to create indexes everywhere I have slow queries.
The SQL Server database does not seem to have an index on the column being queried as shown below.
EDIT:
Untagged MS Sql Server, I had tagged it hoping for the help of our MS Sql Server colleagues with information that Sql Server had some way of structuring data and/or queries that would make this type of query run faster on that platform v other such as MySql
Removed image of code to more closely conform with community standards
You never know if there is a magic go-faster button if you don't ask (ENGINE=MyISAM is sometimes kind of like a magic go-faster button for some queries in MySql). I'm kind of fishing for a potential hardware or clustering solution here. Is Apache Ignite a potential solution here?
Thanks again to the community for all of your support and help. I hope this fixes most of the issues that have been raised for this post.
SECOND EDIT:
Is the partitioning/sharding described in the links below a potential solution here?
https://user3141592.medium.com/how-to-scale-mysql-42ebd2841fa6
https://dev.mysql.com/doc/refman/8.0/en/partitioning-overview.html
THIRD EDIT: A note on community standards.
Part of our community standards is explicitly to be welcoming, inclusive, and to be nice.
https://stackoverflow.blog/2018/04/26/stack-overflow-isnt-very-welcoming-its-time-for-that-to-change/?fbclid=IwAR1gr6r2qmXs506SAV3H_h6H8LoFy3mlXucfa-fqiiEXMHUR3aF_tdoZGsw
https://meta.stackexchange.com/questions/240839/the-new-new-be-nice-policy-code-of-conduct-updated-with-your-feedback).
The MS Sql Server tag was used here as one of the systems I'm comparing is MS Sql Server. We're really working with very limited information here. I have two systems: My MySql system, which is knowable as I'm running it, and the MS Sql Server running the same database in someone else's system that I have very little information about (all I have is a read only sql prompt). I am comparing apples and oranges: The same query runs well on the orange (MS Sql Server) and does not run well on the apple (My MySql instance). I'd like to know why so I can make an informed decision about how to get my queries to run in a reasonable amount of time. How do I get my apple to look like an orange? Do I switch to MS Sql Server? Do I need to deploy on different hardware? Is the other system running some kind of in memory caching system on top of their database instance? Most of these possibilities would require a non trivial amount of time to explore and validate. So yes, I would like help from MS Sql Server experts that might know if there are caching options, transactional v warehouse options, etc. that could be set that would make a world of difference, that would be magic go-fast buttons.
The magic go-fast button comment was perhaps a little bit condescending.
The picture showing the indexes was shown as I was just trying to make the point that the other system does not seem to have an index on the column being queried. I this case a picture was worth a thousand words.
If the table says ENGINE=MyISAM, then that is what counts. In almost all cases, this is a bad choice. innodb_buffer_pool_size=16G is not relevant except that it robs memory from MyISAM.
default-storage-engine=INNODB is relevant only when creating a table explicitly specifying the ENGINE=.
Are some of your tables MyISAM and some are InnoDB? How much RAM do you have?
Most performance solutions necessarily involve an INDEX. Please explain why you can't afford an index. It could turn that query into less than 10ms, regardless of the number of rows in the table.
Sorry, but I don't accept "rather than having to create indexes everywhere I have slow queries".
Changing tables from MyISAM to InnoDB will, in some cases help with performance. Suggest you change the engine as you add the indexes.
Show us some more queries, we can help you decide what indexes are needed. select min(visit_start_date) from visit_occurrence; needs INDEX(date); other queries may not be so trivial. Do not fall into the trap of "indexing every column".
More
In MySQL...
A single connection only uses one core, so more cores only helps when you have more connections. (Some tiny exceptions exist in MySQL 8.0.)
Partitioning rarely helps with performance; do use that without getting advice. (PS: BY RANGE is perhaps the only useful variant.)
Replication is for read-scaling (and backup and ...)
Sharding is for write-scaling. It requires a bunch of extra architectural things -- such as routing queries to the appropriate servers. (MariaDB has Spider and FederatedX as possible tools.) In any case, sharding is a non-trivial undertaking.
Clustering is for HA (High Availability, auto-failover, etc), while helping some with read and write scaling. Cf: Galera, InnoDB Cluster.
Hardware is rarely more than a temporary solution to performance issues.
Caching leads to potentially inconsistent results, so beware. Also, consider my mantra "don't bother putting a cache in front of a cache".
(I can advise further on any of these topics.)
Whether in MyISAM or InnoDB. or even SQL Server, your query
select min(visit_start_date) from visit_occurrence;
can be satisfied almost instantaneously by this index, because it uses a so-called loose index scan.
CREATE INDEX visit_start_date ON visit_occurrence (visit_start_date);
A query with an aggregate function like MIN() is always a GROUP BY query. But if the GROUP BY clause isn't present in the SQL statement, the server groups by the entire table.
You mentioned a query that can be satisfied immediately when using MyISAM. That's SELECT COUNT(*) FROM whatever_table. Behind the scenes MyISAM keeps table metadata showing the total number of rows in the table, so that query comes back right away. The transactional storage engine InnoDB doesn't do that. It supports so much concurrency that its designers didn't include the total row count in their metadata, because it would be wrong in so many circumstances that it wasn't worth the risk.
Index design isn't a black art. But it is an art informed by the kind of measurements we get from EXPLAIN (or ANALYZE or EXPLAIN ANALYZE). A basic truth of database-driven apps (in any make of database server) is that indexing needs to be revisited as the app grows. The good news: changing, adding, or dropping indexes doesn't change your data.
I have a MySQL database with a few (five to be precise) huge tables. It is essentially a star topology based data warehouse. The table sizes range from 700GB (fact table) to 1GB and whole database goes upto 1 terabyte. Now I have been given a task of running analytics on these tables which might even include joins.
A simple analytical query on this database can be "find number of smokers per state and display it in descending order" this requirement could be converted in a simple query like
select state, count(smokingStatus) as smokers
from abc
having smokingstatus='current smoker'
group by state....
This query (and many other of same nature) takes a lot of time to execute on this database, time taken is in order of tens of hours.
This database is also heavily used for insertion which means every few minutes there are thousands of rows getting added.
In such a scenario how can I tackle this querying problem?
I have looked in Cassandra which seemed easy to implement but I am not sure if it is going to be as easy for running analytical queries on the database especially when I have to use "where clause and group by construct"
Have Also looked into Hadoop but I am not sure how can I implement RDBMS type queries. I am not too sure if I want to right away invest in getting at least three machines for name-node, zookeeper and data-nodes!! Above all our company prefers windows based solutions.
I have also thought of pre-computing all the data in a simpler summary tables but that limits my ability to run different kinds of queries.
Are there any other ideas which I can implement?
EDIT
Following is the mysql environment setup
1) master-slave setup
2) master for inserts/updates
3) slave for reads and running stored procedures
4) all tables are innodb with files per table
5) indexes on string as well as int columns.
Pre-calculating values is an option but since requirements for this kind of ad-hoc aggregated values keeps changing.
Looking at this from the position of attempting to make MySQL work better rather than positing an entirely new architectural system:
Firstly, verify what's really happening. EXPLAIN the queries which are causing issues, rather than guessing what's going on.
Having said that, I'm going to guess as to what's going on since I don't have the query plans. I'm guessing that (a) your indexes aren't being used correctly and you're getting a bunch of avoidable table scans, (b) your DB servers are tuned for OLTP, not analytical queries, (c) writing data while reading is causing things to slow down greatly, (d) working with strings just sucks and (e) you've got some inefficient queries with horrible joins (everyone has some of these).
To improve things, I'd investigate the following (in roughly this order):
Check the query plans, make sure the existing indexes are being used correctly - look at the table scans, make sure the queries actually make sense.
Move the analytical queries off the OLTP system - the tunings required for fast inserts and short queries are very different to those for the sorts of queries which potentially read most of a large table. This might mean having another analytic-only slave, with a different config (and possibly table types - I'm not sure what the state of the art with MySQL is right now).
Move the strings out of the fact table - rather than having the smoking status column with string values of (say) 'current smoker', 'recently quit', 'quit 1+ years', 'never smoked', push these values out to another table, and have the integer keys in the fact table (this will help the sizes of the indexes too).
Stop the tables from being updated while the queries are running - if the indexes are moving while the query is running I can't see good things happening. It's (luckily) been a long time since I cared about MySQL replication, so I can't remember if you can batch up the writes to the analytical query slave without too much drama.
If you get to this point without solving the performance issues, then it's time to think about moving off MySQL. I'd look at Infobright first - it's open source/$$ & based on MySQL, so it's probably the easiest to put into your existing system (make sure the data is going to the InfoBright DB, then point your analytical queries to the Infobright server, keep the rest of the system as it is, job done), or if Vertica ever releases its Community Edition. Hadoop+Hive has a lot of moving parts - its pretty cool (and great on the resume), but if it's only going to be used for the analytic portion of you system it may take more care & feeding than other options.
1 TB is not that big. MySQL should be able to handle that. At least simple queries like that shouldn't take hours! Can't be very helpful without knowing the larger context, but I can suggest some questions that you might ask yourself, mostly related to how you use your data:
Is there a way you can separate the reads and writes? How many read so you do per day and how many writes? Can you live with some lag, e.g write to a new table each day and merge it to the existing table at the end of the day?
What are most of your queries like? Are they mostly aggregation queries? Can you do some partial aggregation beforehand? Can you pre-calculate number of new smokers every day?
Can you use hadoop for the aggregation process above? Hadoop is kinda good at that stuff. Basically use hadoop just for daily or batch processing and store the results into the DB.
On the DB side, are you using InnoDB or MyISAM? Are the indices on String columns? Can you make it ints etc.?
Hope that helps
MySQL is have a serious limitation what prevent him to be able to perform good on such scenarious. The problem is a lack of parralel query capability - it can not utilize multiple CPUs in the single query.
Hadoop has an RDMBS like addition called Hive. It is application capable of translate your queries in Hive QL (sql like engine) into the MapReduce jobs. Since it is actually small adition on top of Hadoop it inherits its linear scalability
I would suggest to deploy hive alongside MySQL, replicate daily data to there and run heavy aggregations agains it. It will offload serious part of the load fro MySQL. You still need it for the short interactive queries, usually backed by indexes. You need them since Hive is iherently not-interactive - each query will take at least a few dozens of seconds.
Cassandra is built for the Key-Value type of access and does not have scalable GroupBy capability build-in. There is DataStax's Brisk which integrate Cassandra with Hive/MapReduce but it might be not trivial to map your schema into Cassandra and you still not get flexibility and indexing capabiilties of the RDBMS.
As a bottom line - Hive alongside MySQL should be good solution.
Without getting into the details, I have a very ugly SQL Server 2008 database that is used by a very ugly piece of software developed by a 3rd party vendor. Through the software interface it allows me to design and build the SQL Server tables and write queries. Great for business users, awful for performance and database design.
None of the tables have primary key columns or indexes and the 'keys' the software generates - as well as pretty much every other important field - are all large varchars and text columns. In short, performance is awful and I'm being forced to import upwards of 100,000 rows of data and possibly much more.
My question - since I can't control the software or its queries, perhaps I could tweak the way SQL Server processes those queries? Is there a way to modify the query engine or insert a pre-select trigger to map the text-based key to my own created column which I could properly index? In a perfect would I could simply replace any instance of ColumnA (for example) in a WHERE clause with ColumnAIndexed before SQL Server attempts to process it. Any chance something like that is possible?
Thanks.
It sounds like you can't place indexes on the table yourself. If that's the case then you can use an indexed view. If you're using SQL Server Enterprise or higher you might not have to do anything.
You can put non-clustered indexes on your own fields (computed in the view) after creating the clustered index.
Indexed views can be created in any edition of SQL Server. In SQL Server Enterprise, the query optimizer automatically considers the indexed view.
Then it's a matter of querying the view (with NOEXPAND if you're not on Enterprise).
Further: look at the example here which demonstrates that the Query Optimizer can consider indexed views on queries that don't reference them. You might be able to design an indexed view around predicates that are commonly used and gain some performance.
I have some very large databases (some up to 150M rows) I'm working with & after initially inserting the data there isn't much INSERT's going on; just a lot of SELECT's & usage of JOINS.
I've been messing around with InfoBright a lot (the community version) & whilst I believe it is a good engine, I personally have been having some problems with it getting it to run like it should (fast).
So I was wondering if anyone else could recommend any other fast free storage engine for MySQL?
I'm just now checking out tokudb; is there anything else out there to check out as well?
You should look at InfiniDB too. http://infinidb.org/ (one of the fastest)
There are a lot of considerations you need to make before benchmarking any engine. Hardware stuff like multicore processors, memory, configuration. Design stuff related to your schema etc etc. and how all this impacts the engine performance.
Do check this blog out for how they do benchmarking of engines (it names other engine types) - http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/
Note that this comparison is for a star schema design. If a columnar db engine doesn't suit your requirements, you can look into XtraDB , which is an extended version of InnoDB (not the fastest, but is ACID compliant).
ps - Always track the properties (important to you) of each engine - like referential integrity checks, ACID compliance etc. Sometimes these limitations can be bigger deal breakers as compared to a 10% increase in query performance
Have you looked at Sphinx at all? While it is a search engine, it also supports query-less searches, which is similar to standard SELECT queries with indexes. I found it to be a huge help when dealing with large datasets. It's very fast, and is used heavily in high-traffic forums who are up in the millions (or hundreds of millions) of posts arena.
There is also a plugin for MySQL called SphinxSE which allows it to act as a MySQL storage engine which makes integration very easy to set up. You build your indexes by supplying the indexer program a query, and then once it's all set up, you can query it as if it was a normal table.
http://sphinxsearch.com/docs/2.0.1/sphinxse-overview.html (note, I haven't used it much since pre 1.0)
Besides taking into consideration which DBMS you use, you should also focus on optimizing your tables, indices and queries.
Whenever you have multiple joins, join first on the most selective relation and then on the less selective.
Analyze your query execution plans.
Create indices on columns that are hit often in your QEPs.
Brett -
When using Infobright, you get the best performance gains by:
1) Utilizing the Knowledge Grid as much as possible
2) reducing joins
3) creating 'lookup'
Since the Knowledge-Grid is in-memory, you can kill off a lot of query time just by adding additional filters. Also, consider using a nested select instead of a join. By doing so, you can use an already-created knowledge node (instead of generating a pack-to-pack node on the fly).
If you have some queries that you think should be faster, post them, and I can help with potentially modifying the query to make it run faster.
Cheers,
Jeff
I'm thinking about moving from MySQL to Postgres for Rails development and I just want to hear what other developers that made the move have to say about it.
I'm looking for personal experiences, not a Mysql v Postgres shootout, just the pros and cons that you yourself have arrived at. Stuff that folks might not necessarily think.
Feel free to explain why you moved in the first place as well.
I made the switch and frankly couldn't be happier. While Postgres lacks a few things of MySQL (Insert Ignore, Replace, Upsert stuff, and Load Data Infile for me mainly), the features it does have MORE than make up. Its stored procedures are so much more powerful and it's far easier to write complex functions and aggregates in Postgres.
Performance-wise, if you're comparing to InnoDB (which is only fair because of MVCC), then it feels at least as fast, possibly faster - we weren't able to do some real measurements here due to some constraints, but there certainly hasn't been a performance issue. The complex queries with several joins are certainly faster, MUCH faster.
I find you're more likely to get the correct answer to your issue from the Postgres community. Everybody and their grandmother has 50 different ways to do something in MySQL. With Postgres, hit up the mailing list and you're likely to get lots of very very good help.
Any of the syntax and the like differences are a bit trivial.
Overall, Postgres feels a lot more "grown-up" to me. I used MySQL for years and I now go out of my way to avoid it.
Oh dear, this could end in tears.
Speaking from personal experience only, we moved from MySQL solely because our production system (Heroku) is running PostgreSQL. We had custom-built-for-MySQL queries which were breaking on PostgreSQL. So I guess the morale of the story here is to run on the same DBMS over everything, otherwise you may run into problems.
We also sometimes needs to insert records Über-quick-like. For this, we use PostgreSQL's built-in COPY function, used similarly to this in our app:
query = "COPY users(email) FROM STDIN WITH CSV"
values = users.map! do |user|
# Be wary of the types of the objects here, they matter.
# For instance if you set the id to a string it will error.
%Q{#{user["email"]}}
end.join("\n")
raw_connection.exec(query)
raw_connection.put_copy_data(values)
raw_connection.put_copy_end
This inserts ~500,000 records into the database in just under two minutes. Around about the same time if we add more fields.
Another couple of nice things PostgreSQL has over MySQL:
Full text searching
Geographical querying (PostGIS)
LIKE syntax is like this email ~ 'hotmail|gmail', NOT LIKE is like email !~ 'hotmail|gmail'. The | indicates an or.
In summary: PostgreSQL is like bricks & mortar, where MySQL is Lego. Go with whatever "feels" right to you. This is only my personal opinion.
We switched to PostgreSQL for several reasons in early 2007 (or was it the year before?). The main reasons were:
SQL support - PostgreSQL is much better for complex SQL-queries, for example with lots of joins and aggregates
MySQL's stored procedures didn't feel very mature
MySQL license changes - dual licensed, open source and commercial, a split that made me wonder about the future. With PG's BSD license you can do whatever you want.
Faulty behaviour - when MySQL was counting rows, sometimes it just returned an approximated value, not the actual counted rows.
Constraints behaved a bit odd, inserting truncated/adapted values. See http://use.perl.org/~Smylers/journal/34246
The administrative interface PgAdminIII felt more stable and mature than the MySQL counterpart
PostgreSQL is very solid and crash safe in case of an outage
// John
Haven't made the switch myself, but got bitten a few times by MySQL's lack of transactional schema changes which apparently Postgre supports.
This would solve those nasty problems you get when you move from your dev environment with sqlite to your MySQL server and realise your migrations screwed up and were left half-done! (No I didn't do this on a production server but it did make a mess of our shared testing server!)