Run analytics on huge MySQL database

Run analytics on huge MySQL database - mysql

I have a MySQL database with a few (five to be precise) huge tables. It is essentially a star topology based data warehouse. The table sizes range from 700GB (fact table) to 1GB and whole database goes upto 1 terabyte. Now I have been given a task of running analytics on these tables which might even include joins.
A simple analytical query on this database can be "find number of smokers per state and display it in descending order" this requirement could be converted in a simple query like
select state, count(smokingStatus) as smokers
from abc
having smokingstatus='current smoker'
group by state....
This query (and many other of same nature) takes a lot of time to execute on this database, time taken is in order of tens of hours.
This database is also heavily used for insertion which means every few minutes there are thousands of rows getting added.
In such a scenario how can I tackle this querying problem?
I have looked in Cassandra which seemed easy to implement but I am not sure if it is going to be as easy for running analytical queries on the database especially when I have to use "where clause and group by construct"
Have Also looked into Hadoop but I am not sure how can I implement RDBMS type queries. I am not too sure if I want to right away invest in getting at least three machines for name-node, zookeeper and data-nodes!! Above all our company prefers windows based solutions.
I have also thought of pre-computing all the data in a simpler summary tables but that limits my ability to run different kinds of queries.
Are there any other ideas which I can implement?
EDIT
Following is the mysql environment setup
1) master-slave setup
2) master for inserts/updates
3) slave for reads and running stored procedures
4) all tables are innodb with files per table
5) indexes on string as well as int columns.
Pre-calculating values is an option but since requirements for this kind of ad-hoc aggregated values keeps changing.

Looking at this from the position of attempting to make MySQL work better rather than positing an entirely new architectural system:
Firstly, verify what's really happening. EXPLAIN the queries which are causing issues, rather than guessing what's going on.
Having said that, I'm going to guess as to what's going on since I don't have the query plans. I'm guessing that (a) your indexes aren't being used correctly and you're getting a bunch of avoidable table scans, (b) your DB servers are tuned for OLTP, not analytical queries, (c) writing data while reading is causing things to slow down greatly, (d) working with strings just sucks and (e) you've got some inefficient queries with horrible joins (everyone has some of these).
To improve things, I'd investigate the following (in roughly this order):
Check the query plans, make sure the existing indexes are being used correctly - look at the table scans, make sure the queries actually make sense.
Move the analytical queries off the OLTP system - the tunings required for fast inserts and short queries are very different to those for the sorts of queries which potentially read most of a large table. This might mean having another analytic-only slave, with a different config (and possibly table types - I'm not sure what the state of the art with MySQL is right now).
Move the strings out of the fact table - rather than having the smoking status column with string values of (say) 'current smoker', 'recently quit', 'quit 1+ years', 'never smoked', push these values out to another table, and have the integer keys in the fact table (this will help the sizes of the indexes too).
Stop the tables from being updated while the queries are running - if the indexes are moving while the query is running I can't see good things happening. It's (luckily) been a long time since I cared about MySQL replication, so I can't remember if you can batch up the writes to the analytical query slave without too much drama.
If you get to this point without solving the performance issues, then it's time to think about moving off MySQL. I'd look at Infobright first - it's open source/$$ & based on MySQL, so it's probably the easiest to put into your existing system (make sure the data is going to the InfoBright DB, then point your analytical queries to the Infobright server, keep the rest of the system as it is, job done), or if Vertica ever releases its Community Edition. Hadoop+Hive has a lot of moving parts - its pretty cool (and great on the resume), but if it's only going to be used for the analytic portion of you system it may take more care & feeding than other options.

1 TB is not that big. MySQL should be able to handle that. At least simple queries like that shouldn't take hours! Can't be very helpful without knowing the larger context, but I can suggest some questions that you might ask yourself, mostly related to how you use your data:
Is there a way you can separate the reads and writes? How many read so you do per day and how many writes? Can you live with some lag, e.g write to a new table each day and merge it to the existing table at the end of the day?
What are most of your queries like? Are they mostly aggregation queries? Can you do some partial aggregation beforehand? Can you pre-calculate number of new smokers every day?
Can you use hadoop for the aggregation process above? Hadoop is kinda good at that stuff. Basically use hadoop just for daily or batch processing and store the results into the DB.
On the DB side, are you using InnoDB or MyISAM? Are the indices on String columns? Can you make it ints etc.?
Hope that helps

MySQL is have a serious limitation what prevent him to be able to perform good on such scenarious. The problem is a lack of parralel query capability - it can not utilize multiple CPUs in the single query.
Hadoop has an RDMBS like addition called Hive. It is application capable of translate your queries in Hive QL (sql like engine) into the MapReduce jobs. Since it is actually small adition on top of Hadoop it inherits its linear scalability
I would suggest to deploy hive alongside MySQL, replicate daily data to there and run heavy aggregations agains it. It will offload serious part of the load fro MySQL. You still need it for the short interactive queries, usually backed by indexes. You need them since Hive is iherently not-interactive - each query will take at least a few dozens of seconds.
Cassandra is built for the Key-Value type of access and does not have scalable GroupBy capability build-in. There is DataStax's Brisk which integrate Cassandra with Hive/MapReduce but it might be not trivial to map your schema into Cassandra and you still not get flexibility and indexing capabiilties of the RDBMS.
As a bottom line - Hive alongside MySQL should be good solution.

Related

Redshift design or configuration issue? - My Redshift datawarehouse seems much slower than my mysql database

I have a Redshift datawarehouse that is pulling data in from multiple sources.
One is my from MySQL and the others are some cloud based databases that get pulled in.
When querying in redshift, the query response is significantly slower than the same mysql table(s).
Here is an example:
SELECT *
FROM leads
WHERE id = 10162064
In mysql this takes .4 seconds. In Redshift it takes 4.4 seconds.
The table has 11 million rows. "id" is indexed in mysql and in redshift it is not since it is a columnar system.
I know that Redshift is a columnar data warehouse (which is relatively new to me) and Mysql is a relational database that is able to utilize indexes. I'm not sure if Redshift is the right tool for us for reporting, or if we need something else. We have about 200 tables in it from 5 different systems and it is currently at 90 GB.
We have a reporting tool sitting on top that does native queries to pull data. They are pretty slow but are also pulling a ton of data from multiple tables. I would expect some slowness with these, but with a simple statement like above, I would expect it to be quicker.
I've tried some different DIST and SORT key configurations but see no real improvement.
I've run vacuum and analyze with no improvement.
We have 4 nodes, dc2.large. Currently only using 14% storage. CPU utilization is frequently near 100%. Database connections averages about 10 at any given time.
The datawarehouse just has exact copies of the tables from our integration with the other sources. We are trying to do near real-time reporting with this.
Just looking for advice on how to improve performance of our redshift via configuration changes, some sort of view or dim table architecture, or any other tips to help me get the most out of redshift.

I've worked with clients on this type of issue many times and I'm happy to help but this may take some back and forth to narrow in on what is happening.
First I'm assuming that "leads" is a normal table, not a view and not an external table. Please correct if this assumption isn't right.
Next I'm assuming that this table isn't very wide and that "select *" isn't contributing greatly to the speed concern. Yes?
Next question is wide this size of cluster for a table of only 11M rows? I'd guess it is that there are other much larger data sets on the database and that this table isn't setting the size.
The first step of narrowing this down is to go onto the AWS console for Redshift and find the query in question. Look at the actual execution statistics and see where the query is spending its time. I'd guess it will be in loading (scanning) the table but you never know.
You also should look at STL_WLM_QUERY for the query in question and see how much wait time there was with the running of this query. Queueing can take time and if you have interactive queries that need faster response times then some WLM configuration may be needed.
It could also be compile time but given the simplicity of the query this seems unlikely.
My suspicion is that the table is spread too thin around the cluster and there are lots of mostly empty blocks being read but this is just based on assumptions. Is "id" the distkey or sortkey for this table? Other factors likely in play are cluster load - is the cluster busy when this query runs? WLM is one place that things can interfere but disk IO bandwidth is a share resource and if some other queries are abusing the disks this will make every query's access to disk slow. (Same is true of network bandwidth and leader node workload but these don't seem to be central to your issue at the moment.)
As I mentioned resolving this will likely take some back and forth so leave comments if you have additional information.

(I am speaking from a knowledge of MySQL, not Redshift.)
SELECT * FROM leads WHERE id = 10162064
If id is indexed, especially if it is a Unique (or Primary) key, 0.4 sec sounds like a long network delay. I would expect 0.004 as a worst-case (with SSDs and `PRIMARY KEY(id)).
(If leads is a VIEW, then let's see the tables. 0.4s may be be reasonable!)
That query works well for a RDBMS, but not for a columnar database. Face it.
I can understand using a columnar database to handle random queries on various columns. See also MariaDB's implementation of "Columnstore" -- that would give you both RDBMS and Columnar in a single package. Still, they are separate enough that you can't really intermix the two technologies.
If you are getting 100% CPU in MySQL, show us the query, its EXPLAIN, and SHOW CREATE TABLE. Often, a better index and/or query formulation can solve that.
For "real time reporting" in a Data Warehouse, building and maintaining Summary Tables is often the answer.
Tell us more about the "exact copy" of the DW data. In some situations, the Summary tables can supplant one copy of the Fact table data.

What data quantity is considered as too big for MySQL?

I am looking for a free SQL database able to handle my data model. The project is a production database working in a local network not connected to the internet without any replication. The number of application connected at the same times would be less than 10.
The data volume forecast for the next 5 years are:
3 tables of 100 millions rows
2 tables of 500 millions rows
20 tables with less than 10k rows
My first idea was to use MySQL, but I have found around the web several articles saying that MySQL is not designed for big database. But, what is the meaning of big in this case?
Is there someone to tell me if MySQL is able to handle my data model?
I read that Postgres would be a good alternative, but require a lot of hours for tuning to be efficient with big tables.
I don't think so that my project would use NOSQL database.
I would know if someone has some experience to share with regarding MySQL.
UPDATE
The database will be accessed by C# software (max 10 at the same times) and web application (2-3 at the same times),
It is important to mention that only few update will be done on the big tables, only insert query. Delete statements will be only done few times on the 20 small tables.
The big tables are very often used for select statement, but the most often in the way to know if an entry exists, not to return grouped and ordered batch of data.

I work for Percona, a company that provides consulting and other services for MySQL solutions.
For what it's worth, we have worked with many customers who are successful using MySQL with very large databases. Terrabytes of data, tens of thousands of tables, tables with billions of rows, transaction load of tens of thousands of requests per second. You may get some more insight by reading some of our customer case studies.
You describe the number of tables and the number of rows, but nothing about how you will query these tables. Certainly one could query a table of only a few hundred rows in a way that would not scale well. But this can be said of any database, not just MySQL.
Likewise, one could query a table that is terrabytes in size in an efficient way. It all depends on how you need to query it.
You also have to set specific goals for performance. If you want queries to run in milliseconds, that's challenging but doable with high-end hardware. If it's adequate for your queries to run in a couple of seconds, you can be a lot more relaxed about the scalability.
The point is that MySQL is not a constraining factor in these cases, any more than any other choice of database is a constraining factor.
Re your comments.
MySQL has referential integrity checks in its default storage engine, InnoDB. The claim that "MySQL has no integrity checks" is a myth often repeated over the years.
I think you need to stop reading superficial or outdated articles about MySQL, and read some more complete and current documentation.
MySQLPerformanceBlog.com
High Performance MySQL, 3rd edition
MySQL 5.6 manual

MySQL has a two important (and significantly different) database engines - MyISAM and InnoDB. A limits depends on usage - MyISAM is nontransactional - there is relative fast import, but it is too simple (without own memory cache) and JOINs on tables higher than 100MB can be slow (due too simple MySQL planner - hash joins is supported from 5.6). InnoDB is transactional and is very fast on operations based on primary key - but import is slower.
Current versions of MySQL has not good planner as Postgres has (there is progress) - so complex queries are usually much better on PostgreSQL - and really simple queries are better on MySQL.
Complexity of PostgreSQL configuration is myth. It is much more simple than MySQL InnoDB configuration - you have to set only five parameters: max_connection, shared_buffers, work_mem, maintenance_work_mem and effective_cache_size. Almost all is related to available memory for Postgres on server. Usually work for 5 minutes. On my experience a databases to 100GB is usually without any problems on Postgres (probably on MySQL too). There are two important factors - how speed you expect and how much memory and how fast IO you have.
With large databases you have to have a experience and knowledges for any database technology. All is fast when you are in memory, and when ratio database size/memory is higher, then much more work you have to do to get good results.

First of all, MySQLs table size is only limited by the allowed file size limit of your OS which is I. The terra bytes on any modern OS. That would pose no problems. Most important are questions like this:
What kind of queries will you run?
Are the large table records updated frequently or basically archives for history data?
What is your hardware budget?
What is the kind of query speed you need?
Are you familiar with table partitioning, archive tables, config tuning?
How fast do you need to write (expected inserts per second)
What language will you use to connect to the db (Java, .net, Ruby etc)
What platform are you most familiar with?
Will you run queries which might cause table scans such like '%something%' which would have to go through every single row and take forever
MySQL is used by Facebook, google, twitter and others with large tables and 100,000,000 is not much in the age of social media. MySQL has very little drawbacks (even though I prefer postgresql in most cases) like altering large tables by adding a new index for example. That might send your company in a couple days forced vacation if you don't have a replica in the meantime. Is there a reason why NoSQL is not an option? Sometimes hybrid approaches are a good choice like having your relational business logic in MySQL and huge statistical tables in a NoSQL database like MongoDb which can scale by adding new servers in minutes (MySQL can too but it's more complicated). Now MongoDB can have a indexed column which can be searched by in blistering speed.
Bejond the bottom line: you need to answer the above questions first to make a very informed decision. If you have huge tables and only search on indexed keys almost any database will do - if you expect many changes to the structure down the road you want to use a different approach.
Edit:
Based on your update you just posted I doubt you would run into problems.

Optimizing mysql/postgresql for create and update

As far as I know most of the RDBMS packages are built keeping in mind 99% of the queries will be select queries. However, I am in a situation where we have at least 50 % of the queries as create/update queries. Since we also need persistence, we can not go for NoSQL solutions. Essentially, whenever there is an update it should be immediately stored permanently. So, I was wondering if the performance with MySQL will be hampered because of that. Our current MySQL engine is InnoDb. Is any other MySQL engine more preferable? I plan to use Amazon RDS so my focus is MySQL; but just out of curiousity I would like to know if postgresql can help in this.
N.B. - Just to give an idea of the scale, we are talking about create/update queries on tables with at least a million entries within a couple of months of going into production.

If your working set fits in memory, your inserts and updates will tend to be quite fast. Partitioning can help here, as others have mentioned. Most NoSQL solutions have persistence so you shouldn't exclude them outright. Cassandra has a storage model specifically tuned for writes and might be worth a look.
If you go with MySQL, there are tuning parameters to trade some durability for insert performance, and various other hardware and software settings:
https://serverfault.com/questions/118504/how-to-improve-mysql-insert-and-update-performance
You can probably expect around 100 inserts / sec using full durability on standard disks. If that's not going to cut it, setup benchmarks and start tweaking parameters or get ready for some re-architecting. Benchmark testing is important using realistic amounts of data in your tables. It's much better to find a problem now than to discover it 6 months down the road when your tables start to fill in. Synthetic data is fine, just make sure the indexed fields are distributed similarly.

Having as few as possible indexes increases speed of inserts and updates, because all indexes have to get updated when inserting/updating rows to the tables.
But of course, keep in mind that some indexes might increase your updates as weel.

Hadoop (+HBase/HDFS) vs Mysql (or Postgres) - Loads of independent, structured data to be processed and queried

Hi there at SO,
I would like some ideas/comments on the following from you honorable and venerable bunch.
I have a 100M records which I need to process. I have 5 nodes (in a rocks cluster) to do this. The data is very structured and falls nicely in the relational data model. I want to do things in parallel since my processing takes some time.
As I see it I have two main options:
Install mysql on each node and put 20M records on each. Use the head node to delegate queries to the nodes and aggregate the results. Query Capabilities++, but I might risk some headaches when I come to choose partitioning strategies etc. (Q. Is this what they call mysql/postgres cluster?). The really bad part is that the processing of the records is left up to me now to take care of (how to distribute across machines etc)...
Alternatively install Hadoop, Hive and HBase (note that this might not be the most efficient way to store my data, since HBase is column oriented) and just define the nodes. We write everything in the MapReduce paradigm and, bang, we live happily ever after. The problem here is that we loose the "real time" query capabilities (I know you can use Hive, but that is not suggested for real time queries - which I need) - since I also have some normal sql queries to execute at times "select * from wine where colour = 'brown'".
Note that in theory - if I had 100M machines I could do the whole thing instantly since for each record the processing is independent of the other. Also - my data is read-only. I do not envisage any updates happening. I do not need/want 100M records on one node. I do not want there to be redundant data (since there is lots of it) so keeping it in BOTH mysql/postgres and Hadoop/HBase/HDFS. is not a real option.
Many Thanks

Can you prove that MySQL is the bottleneck? 100M records is not that many, and it looks like that you're not performing complex queries. Without knowing exactly what kind of processing, here is what I would do, in this order:
Keep the 100M in MySQL. Take a look at Cloudera's Sqoop utility to import records from the database and process them in Hadoop.
If MySQL is the bottleneck in (1), consider setting up slave replication, which will let you parallelize reads, without the complexity of a sharded database. Since you've already stated that you don't need to write back to the database, this should be a viable solution. You can replicate your data to as many servers as needed.
If you are running complex select queries from the database, and (2) is still not viable, then consider using Sqoop to import your records and do whatever query transformations you require in Hadoop.
In your situation, I would resist the temptation to jump off of MySQL, unless it is absolutely necessary.

There are a few questions to ask, before suggesting.
Can you formulate your queries to access by primary key only? In other words - if you can avoid all joins and table scans. If so - HBase is an option, if you need very high rate of read/write accesses.
I do noth thing that Hive is good option taking into consideration low data volume. If you expect them to grow significantly - you can consider it. In any case Hive is good for the analytical workloads - not for the OLTP type of processing.
If you do need relational model with joins and scans - I think good solution might be one Master Node and 4 slaves, with replication between them. You will direct all writes to the master, and balance reads among whole cluster. It is especially good if you have much more reads then writes.
In this schema you will have all 100M records (not that match) on each node. Within each node you can employ partitioning if appropriate.

You may also want to consider using Cassandra. I recently discovered this article on HBase vs. Cassandra which I was reminded of when I read your post.
The gist of it is that Cassandra is a highly scallable NoSQL solution with fast querying, which sort of sounds like the solution you're looking for.
So, it all depends on whether you need to maintain your relational model or not.

HI,
I had a situation where I had many tables which I created in parallel using sqlalchemy and the python multiprocessing library. I had multiple files, one per table, and loaded them using parallel COPY processes. If each process corresponds to a separate table, that works well. With one table, using COPY would be difficult. You could use tables partitioning in PostgreSQL, I guess. If you are interested I can give more details.
Regards.

database for analytics

I'm setting up a large database that will generate statistical reports from incoming data.
The system will for the most part operate as follows:
Approximately 400k-500k rows - about 30 columns, mostly varchar(5-30) and datetime - will be uploaded each morning. Its approximately 60MB while in flat file form, but grows steeply in the DB with the addition of suitable indexes.
Various statistics will be generated from the current day's data.
Reports from these statistics will be generated and stored.
Current data set will get copied into a partitioned history table.
Throughout the day, the current data set (which was copied, not moved) can be queried by end users for information that is not likely to include constants, but relationships between fields.
Users may request specialized searches from the history table, but the queries will be crafted by a DBA.
Before the next day's upload, the current data table is truncated.
This will essentially be version 2 of our existing system.
Right now, we're using MySQL 5.0 MyISAM tables (Innodb was killing on space usage alone) and suffering greatly on #6 and #4. #4 is currently not a partitioned tabled as 5.0 doesn't support it. In order to get around the tremendous amount of time (hours and hours) its taking to insert records into history, we're writing each day to an unindexed history_queue table, and then on the weekends during our slowest time, writing the queue to the history table. The problem is that any historical queries generated in the week are possibly several days behind then. We can't reduce the indexes on the historical table or its queries become unusable.
We're definitely moving to at least MySQL 5.1 (if we stay with MySQL) for the next release but strongly considering PostgreSQL. I know that debate has been done to death, but I was wondering if anybody had any advice relevant to this situation. Most of the research is revolving around web site usage. Indexing is really our main beef with MySQL and it seems like PostgreSQL may help us out through partial indexes and indexes based on functions.
I've read dozens of articles about the differences between the two, but most are old. PostgreSQL has long been labeled "more advanced, but slower" - is that still generally the case comparing MySQL 5.1 to PostgreSQL 8.3 or is it more balanced now?
Commercial databases (Oracle and MS SQL) are simply not an option - although I wish Oracle was.
NOTE on MyISAM vs Innodb for us:
We were running Innodb and for us, we found it MUCH slower, like 3-4 times slower. BUT, we were also much newer to MySQL and frankly I'm not sure we had db tuned appropriately for Innodb.
We're running in an environment with a very high degree of uptime - battery backup, fail-over network connections, backup generators, fully redundant systems, etc. So the integrity concerns with MyISAM were weighed and deemed acceptable.
In regards to 5.1:
I've heard the stability issues concern with 5.1. Generally I assume that any recently (within last 12 months) piece of software is not rock-solid stable. The updated feature set in 5.1 is just too much to pass up given the chance to re-engineer the project.
In regards to PostgreSQL gotchas:
COUNT(*) without any where clause is a pretty rare case for us. I don't anticipate this being an issue.
COPY FROM isn't nearly as flexible as LOAD DATA INFILE but an intermediate loading table will fix that.
My biggest concern is the lack of INSERT IGNORE. We've often used it when building some processing table so that we could avoid putting multiple records in twice and then having to do a giant GROUP BY at the end just to remove some dups. I think its used just infrequently enough for the lack of it to be tolerable.

My work tried a pilot project to migrate historical data from an ERP setup. The size of the data is on the small side, only 60Gbyte, covering over ~ 21 million rows, the largest table having 16 million rows. There's an additional ~15 million rows waiting to come into the pipe but the pilot has been shelved due to other priorities. The plan was to use PostgreSQL's "Job" facility to schedule queries that would regenerate data on a daily basis suitable for use in analytics.
Running simple aggregates over the large 16-million record table, the first thing I noticed is how sensitive it is to the amount of RAM available. An increase in RAM at one point allowed for a year's worth of aggregates without resorting to sequential table scans.
If you decide to use PostgreSQL, I would highly recommend re-tuning the config file, as it tends to ship with the most conservative settings possible (so that it will run on systems with little RAM). Tuning takes a little bit, maybe a few hours, but once you get it to a point where response is acceptable, just set it and forget it.
Once you have the server-side tuning done (and it's all about memory, surprise!) you'll turn your attention to your indexes. Indexing and query planning also requires a little effort but once set you'll find it to be effective. Partial indexes are a nice feature for isolating those records that have "edge-case" data in them, I highly recommend this feature if you are looking for exceptions in a sea of similar data.
Lastly, use the table space feature to relocate the data onto a fast drive array.

In my practical experience I have to say, that postgresql had quite a performance jump from 7.x/8.0 to 8.1 (for our use cases in some instances 2x-3x faster), from 8.1 to 8.2 the improvement was smaller but still noticeable. I don't know the improvements between 8.2 and 8.3, but I expect there is some performance improvement too, I havent tested it so far.
Regarding indices, I would recommend to drop those, and only create them again after filling the database with your data, it is much faster.
Further improve the crap out of your postgresql settings, there is so much gain from it. The default settings are at least sensible now, in pre 8.2 times pg was optimized for running on a pda.
In some cases, especially if you have complicated queries it can help to deactivate nested loops in your settings, which forces pg to use better performing approaches on your queries.
Ah, yes, did I say that you should go for postgresql?
(An alternative would be firebird, which is not so flexible, but in my experience it is in some cases performing much better than mysql and postgresql)

In my experience Inodb is slighly faster for really simple queries, pg for more complex queries. Myisam is probably even faster than Innodb for retrieval, but perhaps slower for indexing/index repair.
These mostly varchar fields, are you indexing them with char(n) indexes?
Can you normalize some of them? It'll cost you on the rewrite, but may save time on subsequent queries, as your row size will decrease, thus fitting more rows into memory at one time.
ON EDIT:
OK, so you have two problems, query time against the daily, and updating the history, yes?
As to the second: in my experience, mysql myism is bad at re-indexing. On tables the size of your daily (0.5 to 1M records, with rather wide (denormalized flat input) records), I found it was faster to re-write the table than to insert and wait for the re-indexing and attendant disk thrashing.
So this might or might not help:
create new_table select * from old_table ;
copies the tables but no indices.
Then insert the new records as normally. Then create the indexes on new table, wait a while. Drop old table, and rename new table to old table.
Edit: In response to the fourth comment: I don't know that MyIsam is always that bad. I know in my particular case, I was shocked at how much faster copying the table and then adding the index was. As it happened, I was doing something similar to what you were doing, copying large denormalized flat files into the database, and then renormalizing the data. But that's an anecdote, not data. ;)
(I also think I found that overall InnoDb was faster, given that I was doing as much inserting as querying. A very special case of database use.)
Note that copying with a select a.*, b.value as foo join ... was also faster than an update a.foo = b.value ... join, which follows, as the update was to an indexed column.

What is not clear to me is how complex the analytical processing is. In my oppinion, having 500K records to process should not be such a big problem, in terms of analytical processing, it is a small recordset.
Even if it is a complex job, if you can leave it over night to complete (since it is a daily process, as I understood from your post), it should still be enough.
Regarding the resulted table, I would not reduce the indexes of the table. Again, you can do the loading over night, including indexes refresh, and have the resulted, updated data set ready for use in the morning, with quicker access than in case of raw tables (non-indexed).
I saw PosgreSQL used in a datawarehouse like environment, working on the setup I've described (data transformation jobs over night) and with no performance complaints.

I'd go for PostgreSQL. You need for example partitioned tables, which are in stable Postgres releases since at least 2005 - in MySQL it is a novelty. I've heard about stability issues in new features of 5.1. With MyISAM you have no referential integrity, transactions and concurrent access suffers a lot - read this blog entry "Using MyISAM in production" for more.
And Postgres is much faster on complicated queries, which will be good for your #6.
There is also a very active and helpful mailing list, where you can get support even from core Postgres developers for free. It has some gotchas though.

The Infobright people appear to be doing some interesting things along these lines:
http://www.infobright.org/
-- psj

If Oracle is not considered an option because of cost issues, then Oracle Express Edition is available for free (as in beer). It has size limitations, but if you do not keep history around for too long anyway, it should not be a concern.

Check your hardware. Are you maxing the IO? Do you have buffers configured properly? Is your hardware sized correctly? Memory for buffering and fast disks are key.
If you have too many indexes, it'll slow inserts down substantially.
How are you doing your inserts? If you're doing one record per INSERT statement:
INSERT INTO TABLE blah VALUES (?, ?, ?, ?)
and calling it 500K times, your performance will suck. I'm surprised it's finishing in hours. With MySQL you can insert hundreds or thousands of rows at a time:
INSERT INTO TABLE blah VALUES
(?, ?, ?, ?),
(?, ?, ?, ?),
(?, ?, ?, ?)
If you're doing one insert per web requests, you should consider logging to the file system and doing bulk imports on a crontab. I've used that design in the past to speed up inserts. It also means your webpages don't depend on the database server.
It's also much faster to use LOAD DATA INFILE to import a CSV file. See http://dev.mysql.com/doc/refman/5.1/en/load-data.html
The other thing I can suggest is be wary of the SQL hammer -- you may not have SQL nails. Have you considered using a tool like Pig or Hive to generate optimized data sets for your reports?
EDIT
If you're having troubles batch importing 500K records, you need to compromise somewhere. I would drop some indexes on your master table, then create optimized views of the data for each report.

Have you tried playing with the myisam_key_buffer parameter ? It is very important in index update speed.
Also if you have indexes on date, id, etc which are correlated columns, you can do :
INSERT INTO archive SELECT .. FROM current ORDER BY id (or date)
The idea is to insert the rows in order, in this case the index update is much faster. Of course this only works for the indexes that agree with the ORDER BY... If you have some rather random columns, then those won't be helped.
but strongly considering PostgreSQL.
You should definitely test it.
it seems like PostgreSQL may help us out through partial indexes and indexes based on functions.
Yep.
I've read dozens of articles about the differences between the two, but most are old. PostgreSQL has long been labeled "more advanced, but slower" - is that still generally the case comparing MySQL 5.1 to PostgreSQL 8.3 or is it more balanced now?
Well that depends. As with any database,
IF YOU DONT KNOW HOW TO CONFIGURE AND TUNE IT IT WILL BE SLOW
If your hardware is not up to the task, it will be slow
Some people who know mysql well and want to try postgres don't factor in the fact that they need to re-learn some things and read the docs, as a result a really badly configured postgres is benchmarked, and that can be pretty slow.
For web usage, I've benchmarked a well configured postgres on a low-end server (Core 2 Duo, SATA disk) with a custom benchmark forum that I wrote and it spit out more than 4000 forum web pages per second, saturating the database server's gigabit ethernet link. So if you know how to use it, it can be screaming fast (InnoDB was much slower due to concurrency issues). "MyISAM is faster for small simple selects" is total bull, postgres will zap a "small simple select" in 50-100 microseconds.
Now, for your usage, you don't care about that ;)
You care about the ways your database can compute Big Aggregates and Big Joins, and a properly configured postgres with a good IO system will usually win against a MySQL system on those, because the optimizer is much smarter, and has many more join/aggregate types to choose from.
My biggest concern is the lack of INSERT IGNORE. We've often used it when building some processing table so that we could avoid putting multiple records in twice and then having to do a giant GROUP BY at the end just to remove some dups. I think its used just infrequently enough for the lack of it to be tolerable.
You can use a GROUP BY, but if you want to insert into a table only records that are not already there, you can do this :
INSERT INTO target SELECT .. FROM source LEFT JOIN target ON (...) WHERE target.id IS NULL
In your use case you have no concurrency problems, so that works well.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008