DB with best inserts/sec performance? [closed] - mysql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
We deploy an (AJAX - based) Instant messenger which is serviced by a Comet server. We have a requirement to store the sent messages in a DB for long-term archival purposes in order to meet legal retention requirements.
Which DB engine provides the best performance in this write-once, read never (with rare exceptions) requirement?
We need at least 5000 Insert/Sec. I am assuming neither MySQL nor PostgreSQL
can meet these requirements.
Any proposals for a higher performance solution? HamsterDB, SQLite, MongoDB ...?

Please ignore the above Benchmark we had a bug inside.
We have Insert 1M records with following columns: id (int), status (int), message (140 char, random).
All tests was done with C++ Driver on a Desktop PC i5 with 500 GB Sata Disk.
Benchmark with MongoDB:
1M Records Insert without Index
time: 23s, insert/s: 43478
1M Records Insert with Index on Id
time: 50s, insert/s: 20000
next we add 1M records to the same table with Index and 1M records
time: 78s, insert/s: 12820
that all result in near of 4gb files on fs.
Benchmark with MySQL:
1M Records Insert without Index
time: 49s, insert/s: 20408
1M Records Insert with Index
time: 56s, insert/s: 17857
next we add 1M records to the same table with Index and 1M records
time: 56s, insert/s: 17857
exactly same performance, no loss on mysql on growth
We see Mongo has eat around 384 MB Ram during this test and load 3 cores of the cpu, MySQL was happy with 14 MB and load only 1 core.
Edorian was on the right way with his proposal, I will do some more Benchmark and I'm sure we can reach on a 2x Quad Core Server 50K Inserts/sec.
I think MySQL will be the right way to go.

If you are never going to query the data, then i wouldn't store it to a database at all, you will never beat the performance of just writing them to a flat file.
What you might want to consider is the scaling issues, what happens when it's to slow to write the data to a flat file, will you invest in faster disk's, or something else.
Another thing to consider is how to scale the service so that you can add more servers without having to coordinate the logs of each server and consolidate them manually.
edit: You wrote that you want to have it in a database, and then i would also consider security issues with havening the data on line, what happens when your service gets compromised, do you want your attackers to be able to alter the history of what have been said?
It might be smarter to store it temporary to a file, and then dump it to an off-site place that's not accessible if your Internet fronts gets hacked.

If you don't need to do queries, then database is not what you need. Use a log file.

it's only stored for legal reasons.
And what about the detailed requirements? You mention the NoSQL solutions, but these can't promise the data is realy stored on disk. In PostgreSQL everything is transaction safe, so you're 100% sure the data is on disk and is available. (just don't turn of fsync)
Speed has a lot to do with your hardware, your configuration and your application. PostgreSQL can insert thousands of record per second on good hardware and using a correct configuration, it can be painfully slow using the same hardware but using a plain stupid configuration and/or the wrong approach in your application. A single INSERT is slow, many INSERT's in a single transaction are much faster, prepared statements even faster and COPY does magic when you need speed. It's up to you.

I don't know why you would rule out MySQL. It could handle high inserts per second. If you really want high inserts, use the BLACK HOLE table type with replication. It's essentially writing to a log file that eventually gets replicated to a regular database table. You could even query the slave without affecting insert speeds.

Firebird can easily handle 5000 Insert/sec if table doesn't have indices.

Depending in your system setup MySql can easily handle over 50.000 inserts per sec.
For tests on a current system i am working on we got to over 200k inserts per sec. with 100 concurrent connections on 10 tables (just some values).
Not saying that this is the best choice since other systems like couch could make replication/backups/scaling easier but dismissing mysql solely on the fact that it can't handle so minor amounts of data it a little to harsh.
I guess there are better solutions (read: cheaper, easier to administer) solutions out there.

Use Event Store (https://eventstore.org), you can read (https://eventstore.org/docs/getting-started/which-api-sdk/index.html) that when using TCP client you can achieve 15000-20000 writes per second. If you will ever need to do anything with data, you can use projections or do the transformations based on streams to populate any other datastore you wish.
You can create even cluster.

If money plays no role, you can use TimesTen.
http://www.oracle.com/timesten/index.html
A complete in memory database, with amazing speed.

I would use the log file for this, but if you must use a database, I highly recommend Firebird. I just tested the speed, it inserts about 10k records per second on quite average hardware (3 years old desktop computer). The table has one compound index, so I guess it would work even faster without it:
milanb#kiklop:~$ fbexport -i -d test -f test.fbx -v table1 -p **
Connecting to: 'LOCALHOST'...Connected.
Creating and starting transaction...Done.
Create statement...Done.
Doing verbatim import of table: TABLE1
Importing data...
SQL: INSERT INTO TABLE1 (AKCIJA,DATUM,KORISNIK,PK,TABELA) VALUES (?,?,?,?,?)
Prepare statement...Done.
Checkpoint at: 1000 lines.
Checkpoint at: 2000 lines.
Checkpoint at: 3000 lines.
...etc.
Checkpoint at: 20000 lines.
Checkpoint at: 21000 lines.
Checkpoint at: 22000 lines.
Start : Thu Aug 19 10:43:12 2010
End : Thu Aug 19 10:43:14 2010
Elapsed : 2 seconds.
22264 rows imported from test.fbx.
Firebird is open source, and completely free even for commercial projects.

I believe the answer will as well depend on hard disk type (SSD or not) and also the size of the data you insert. I was inserting a single field data into MongoDB on a dual core Ubuntu machine and was hitting over 100 records per second. I introduced some quite large data to a field and it dropped down to about 9ps and the CPU running at about 175%! The box doesn't have SSD and so I wonder if I'd have gotten better with that.
I also ran MySQL and it was taking 50 seconds just to insert 50 records on a table with 20m records (with about 4 decent indexes too) so as well with MySQL it will depend on how many indexes you have in place.

Related

Redshift design or configuration issue? - My Redshift datawarehouse seems much slower than my mysql database

I have a Redshift datawarehouse that is pulling data in from multiple sources.
One is my from MySQL and the others are some cloud based databases that get pulled in.
When querying in redshift, the query response is significantly slower than the same mysql table(s).
Here is an example:
SELECT *
FROM leads
WHERE id = 10162064
In mysql this takes .4 seconds. In Redshift it takes 4.4 seconds.
The table has 11 million rows. "id" is indexed in mysql and in redshift it is not since it is a columnar system.
I know that Redshift is a columnar data warehouse (which is relatively new to me) and Mysql is a relational database that is able to utilize indexes. I'm not sure if Redshift is the right tool for us for reporting, or if we need something else. We have about 200 tables in it from 5 different systems and it is currently at 90 GB.
We have a reporting tool sitting on top that does native queries to pull data. They are pretty slow but are also pulling a ton of data from multiple tables. I would expect some slowness with these, but with a simple statement like above, I would expect it to be quicker.
I've tried some different DIST and SORT key configurations but see no real improvement.
I've run vacuum and analyze with no improvement.
We have 4 nodes, dc2.large. Currently only using 14% storage. CPU utilization is frequently near 100%. Database connections averages about 10 at any given time.
The datawarehouse just has exact copies of the tables from our integration with the other sources. We are trying to do near real-time reporting with this.
Just looking for advice on how to improve performance of our redshift via configuration changes, some sort of view or dim table architecture, or any other tips to help me get the most out of redshift.
I've worked with clients on this type of issue many times and I'm happy to help but this may take some back and forth to narrow in on what is happening.
First I'm assuming that "leads" is a normal table, not a view and not an external table. Please correct if this assumption isn't right.
Next I'm assuming that this table isn't very wide and that "select *" isn't contributing greatly to the speed concern. Yes?
Next question is wide this size of cluster for a table of only 11M rows? I'd guess it is that there are other much larger data sets on the database and that this table isn't setting the size.
The first step of narrowing this down is to go onto the AWS console for Redshift and find the query in question. Look at the actual execution statistics and see where the query is spending its time. I'd guess it will be in loading (scanning) the table but you never know.
You also should look at STL_WLM_QUERY for the query in question and see how much wait time there was with the running of this query. Queueing can take time and if you have interactive queries that need faster response times then some WLM configuration may be needed.
It could also be compile time but given the simplicity of the query this seems unlikely.
My suspicion is that the table is spread too thin around the cluster and there are lots of mostly empty blocks being read but this is just based on assumptions. Is "id" the distkey or sortkey for this table? Other factors likely in play are cluster load - is the cluster busy when this query runs? WLM is one place that things can interfere but disk IO bandwidth is a share resource and if some other queries are abusing the disks this will make every query's access to disk slow. (Same is true of network bandwidth and leader node workload but these don't seem to be central to your issue at the moment.)
As I mentioned resolving this will likely take some back and forth so leave comments if you have additional information.
(I am speaking from a knowledge of MySQL, not Redshift.)
SELECT * FROM leads WHERE id = 10162064
If id is indexed, especially if it is a Unique (or Primary) key, 0.4 sec sounds like a long network delay. I would expect 0.004 as a worst-case (with SSDs and `PRIMARY KEY(id)).
(If leads is a VIEW, then let's see the tables. 0.4s may be be reasonable!)
That query works well for a RDBMS, but not for a columnar database. Face it.
I can understand using a columnar database to handle random queries on various columns. See also MariaDB's implementation of "Columnstore" -- that would give you both RDBMS and Columnar in a single package. Still, they are separate enough that you can't really intermix the two technologies.
If you are getting 100% CPU in MySQL, show us the query, its EXPLAIN, and SHOW CREATE TABLE. Often, a better index and/or query formulation can solve that.
For "real time reporting" in a Data Warehouse, building and maintaining Summary Tables is often the answer.
Tell us more about the "exact copy" of the DW data. In some situations, the Summary tables can supplant one copy of the Fact table data.

MySQL server very high load

I run a website with ~500 real time visitors, ~50k daily visitors and ~1,3million total users. I host my server on AWS, where I use several instances of different kind. When I started the website the different instances cost rougly the same. When the website started to gain users the RDS instance (MySQL DB) CPU constantly keept hitting the roof, I had to upgrade it several times, now it have started to take up the main part of the performance and monthly cost (around 95% of (2,8k$/month)). I currently use a database server with 16vCPU and 64GiB of RAM, I also use Multi-AZ Deployment to protect against failures. I wonder if it is normal for the database to be that expensive, or if I have done something terribly wrong?
Database Info
At the moment my database have 40 tables with the most of them have 100k rows, some have ~2millions and 1 have 30 millions.
I have a system the archives rows that are older then 21 days when they are not needed anymore.
Website Info
The website mainly use PHP, but also some NodeJS and python.
Most of the functions of the website works like this:
Start transaction
Insert row
Get last inserted id (lastrowid)
Do some calculations
Updated the inserted row
Update the user
Commit transaction
I also run around 100bots wich polls from the database with 10-30sec interval, they also inserts/updates the database sometimes.
Extra
I have done several things to try to lower the load on the database. Such as enable database cache, use a redis cache for some queries, tried to remove very slow queries, tried to upgrade the storage type to "Provisioned IOPS SSD". But nothing seems to help.
This is the changes I have done to the setting paramters:
I have though about creating a MySQL cluster of several smaller instances, but I don't know if this would help, and I also don't know if this works good with transactions.
If you need any more information, please ask, any help on this issue is greatly appriciated!
In my experience, as soon as you ask the question "how can I scale up performance?" you know you have outgrown RDS (edit: I admit my experience that leads me to this opinion may be outdated).
It sounds like your query load is pretty write-heavy. Lots of inserts and updates. You should increase the innodb_log_file_size if you can on your version of RDS. Otherwise you may have to abandon RDS and move to an EC2 instance where you can tune MySQL more easily.
I would also disable the MySQL query cache. On every insert/update, MySQL has to scan the query cache to see if there any results cached that need to be purged. This is a waste of time if you have a write-heavy workload. Increasing your query cache to 2.56GB makes it even worse! Set the cache size to 0 and the cache type to 0.
I have no idea what queries you run, or how well you have optimized them. MySQL's optimizer is limited, so it's frequently the case that you can get huge benefits from redesigning SQL queries. That is, changing the query syntax, as well as adding the right indexes.
You should do a query audit to find out which queries are accounting for your high load. A great free tool to do this is https://www.percona.com/doc/percona-toolkit/2.2/pt-query-digest.html, which can give you a report based on your slow query log. Download the RDS slow query log with the http://docs.aws.amazon.com/cli/latest/reference/rds/download-db-log-file-portion.html CLI command.
Set your long_query_time=0, let it run for a while to collect information, then change long_query_time back to the value you normally use. It's important to collect all queries in this log, because you might find that 75% of your load is from queries under 2 seconds, but they are run so frequently that it's a burden on the server.
After you know which queries are accounting for the load, you can make some informed strategy about how to address them:
Query optimization or redesign
More caching in the application
Scale out to more instances
I think the answer is "you're doing something wrong". It is very unlikely you have reached an RDS limitation, although you may be hitting limits on some parts of it.
Start by enabling detailed monitoring. This will give you some OS-level information which should help determine what your limiting factor really is. Look at your slow query logs and database stats - you may have some queries that are causing problems.
Once you understand the problem - which could be bad queries, I/O limits, or something else - then you can address them. RDS allows you to create multiple read replicas, so you can move some of your read load to slaves.
You could also move to Aurora, which should give you better I/O performance. Or use PIOPS (or allocate more disk, which should increase performance). You are using SSD storage, right?
One other suggestion - if your calculations (step 4 above) takes a significant amount of time, you might want look at breaking it into two or more transactions.
A query_cache_size of more than 50M is bad news. You are writing often -- many times per second per table? That means the QC needs to be scanned many times/second to purge the entries for the table that changed. This is a big load on the system when the QC is 2.5GB!
query_cache_type should be DEMAND if you can justify it being on at all. And in that case, pepper the SELECTs with SQL_CACHE and SQL_NO_CACHE.
Since you have the slowlog turned on, look at the output with pt-query-digest. What are the first couple of queries?
Since your typical operation involves writing, I don't see an advantage of using readonly Slaves.
Are the bots running at random times? Or do they all start at the same time? (The latter could cause terrible spikes in CPU, etc.)
How are you "archiving" "old" records? It might be best to use PARTITIONing and "transportable tablespaces". Use PARTITION BY RANGE and 21 partitions (plus a couple of extras).
Your typical transaction seems to work with one row. Can it be modified to work with 10 or 100 all at once? (More than 100 is probably not cost-effective.) SQL is much more efficient in doing lots of rows at once versus lots of queries of one row each. Show us the SQL; we can dig into the details.
It seems strange to insert a new row, then update it, all in one transaction. Can't you completely compute it before doing the insert? Hanging onto the inserted_id for so long probably interferes with others doing the same thing. What is the value of innodb_autoinc_lock_mode?
Do the "users" interactive with each other? If so, in what way?

How do I properly stagger thousands of inserts so as not to lock a table?

I have a server that receives data from thousands of locations all over the world. Periodically, that server connects to my DB server and inserts records with multi-insert, some 11,000 rows at a time per multi, and up to 6 insert statements. When this happens, all 6 process lock the table being inserted into.
What I am trying to figure out is what causes the locking? Am I better off limiting my multi-insert to, say 100 rows at a time and doing them end to end? What do I use for guidelines?
The DB server has 100GB RAM and 12 processors. It is very lightly used but when these inserts come in, everyone freezes up for a couple minutes which disrupts peopel running reports, etc.
Thanks for any advice. I know I need to stagger the inserts, I am just asking what is a recommended way to do this.
UPDATE: I was incorrect. I spoke to the programmer and he said that there is a perl program running that sends single inserts to the server, as rapidly as it can. NOT a multi-insert. There are (currently) 6 of these perl processes running simultaneously. One of them is doing 91000 inserts, one at a time. Perhaps since we have a lot of RAM, a multi-insert would be better?
Your question lacks a bunch of details about how the system is structured. In addition, if you have a database running on a server with 100 Gbytes of RAM, you should have access to a professional DBA, and not rely on internet forums.
But, as lad2025 suggests in a comment, staging tables can probably solve your problem. Your locking is probably caused by indexes, or possibly by triggers. The suggestion would be to load the data into a staging table. Then, leisurely load the data from the staging table into the final table.
One possibility is doing 11,000 inserts, say one per second (that would require about three hours). Although there is more overhead in doing the inserts, each would be its own transaction and the locking times would be very short.
Of course, only inserting 1 record at a time may not be optimal. Perhaps 10 or 100 or even 1000 would suffice. You can manage the inserts using the event scheduler.
And, this assumes that the locking scales according to the volume of the input data. That is an assumption, but I think a reasonable one in the absence of other information.

IOPS or Throughput? - Determining Write Bottleneck in Amazon RDS Instance

We have nightly load jobs that writes several hundred thousand records to an Mysql reporting database running in Amazon RDS.
The load jobs are taking several hours to complete, but I am having a hard time figuring out where the bottleneck is.
The instance is currently running with General Purpose (SSD) storage. By looking at the cloudwatch metrics, it appears I am averaging less than 50 IOPS for the last week. However, Network Receive Throughput is less than 0.2 MB/sec.
Is there anyway to tell from this data if I am being bottlenecked by network latency (we are currently loading the data from a remote server...this will change eventually) or by Write IOPS?
If IOPS is the bottleneck, I can easily upgrade to Provisioned IOPS. But if network latency is the issue, I will need to redesign our load jobs to load raw data from EC2 instances instead of our remote servers, which will take some time to implement.
Any advice is appreciated.
UPDATE:
More info about my instance. I am using an m3.xlarge instance. It is provisioned for 500GB in size. The load jobs are done with the ETL tool from pentaho. They pull from multiple (remote) source databases and insert into the RDS instance using multiple threads.
You aren't using up much CPU. Your memory is very low. An instance with more memory should be a good win.
You're only doing 50-150 iops. That's low, you should get 3000 in a burst on standard SSD-level storage. However, if your database is small, it is probably hurting you (since you get 3 iops per GB- so if you are on a 50gb or smaller database, consider paying for provisioned iops).
You might also try Aurora; it speaks mysql, and supposedly has great performance.
If you can spread out your writes, the spikes will be smaller.
A very quick test is to buy provisioned IOPS, but be careful as you may get fewer than you do currently during a burst.
Another quick means to determine your bottleneck is to profile your job execution application with a profiler that understands your database driver. If you're using Java, JProfiler will show the characteristics of your job and it's use of the database.
A third is to configure your database driver to print statistics about the database workload. This might inform you that you are issuing far more queries than you would expect.
Your most likely culprit accessing the database remotely is actually round-trip latency. The impact is easy to overlook or underestimate.
If the remote database has, for example, a 75 millisecond round-trip time, you can't possibly execute more than 1000 (milliseconds/sec) / 75 (milliseconds/round trip) = 13.3 queries per second if you're using a single connection. There's no getting around the laws of physics.
The spikes suggest inefficiency in the loading process, where it gathers for a while, then loads for a while, then gathers for a while, then loads for a while.
Separate but related, if you don't have the MySQL client/server compression protocol enabled on the client side... find out how to enable it. (The server always supports compression but the client has to request it during the initial connection handshake), This won't fix the core problem but should improve the situation somewhat, since less data to physically transfer could mean less time wasted in transit.
I'm not an RDS expert and I don't know if my own particular case can shed you some light. Anyway, hope this give you any kind of insight.
I have a db.t1.micro with 200GB provisioned (that gives be 600 IOPS baseline performance), on a General Purpose SSD storage.
The heaviest workload is when I aggregate thousands of records from a pool of around 2.5 million rows from a 10 million rows table and another one of 8 million rows. I do this every day. This is what I average (it is steady performance, unlike yours where I see a pattern of spikes):
Write/ReadIOPS: +600 IOPS
NetworkTrafficReceived/Transmit throughput: < 3,000 Bytes/sec (my queries are relatively short)
Database connections: 15 (workers aggregating on parallel)
Queue depth: 7.5 counts
Read/Write Throughput: 10MB per second
The whole aggregation task takes around 3 hours.
Also check 10 tips to improve The Performance of your app in AWS slideshare from AWS Summit 2014.
I don't know what else to say since I'm not an expert! Good luck!
In my case it was the amount of records. I was writing only 30 records per minute and had an Write IOPS of round about the same 20 to 30. But this was eating at the CPU, which reduced the CPU credit quite steeply. So I took all the data in that table and moved it to another "historic" table. And cleared all data in that table.
CPU dropped back down to normal measures, but Write IOPS stayed about the same, this was fine though. The problem: Indexes, I think because so many records needed to indexed when inserting it took way more CPU to do this indexing with that amount of rows. Even though the only index I had was a Primary Key.
Moral of my story, the problem is not always where you think it lies, although I had increased Write IOPS it was not the root cause of the problem, but rather the CPU that was being used to do index stuff when inserting which caused the CPU credit to fall.
Not even X-RAY on the Lambda could catch an increased query time. That is when I started to look at the DB directly.
Your Queue depth graph shows > 2 , which clearly indicate that the IOPS is under provisioned. (if Queue depth < 2 then IOPS is over provisioned)
I think you have used the default AUTOCOMMIT = 1 (autocommit mode). It performs a log flush to disk for every insert and exhausted the IOPS.
So,It is better to use (for performance tuning) AUTOCOMMIT = 0 before bulk inserts of data in MySQL, if the insert query looks like
set AUTOCOMMIT = 0;
START TRANSACTION;
-- first 10000 recs
INSERT INTO SomeTable (column1, column2) VALUES (vala1,valb1),(vala2,valb2) ... (val10000,val10000);
COMMIT;
--- next 10000 recs
START TRANSACTION;
INSERT INTO SomeTable (column1, column2) VALUES (vala10001,valb10001),(vala10001,valb10001) ... (val20000,val20000);
COMMIT;
--- next 10000 recs
.
.
.
set AUTOCOMMIT = 1
Using the above approach in t2.micro and inserted 300000 in 15 minutes using PHP.

database for analytics

I'm setting up a large database that will generate statistical reports from incoming data.
The system will for the most part operate as follows:
Approximately 400k-500k rows - about 30 columns, mostly varchar(5-30) and datetime - will be uploaded each morning. Its approximately 60MB while in flat file form, but grows steeply in the DB with the addition of suitable indexes.
Various statistics will be generated from the current day's data.
Reports from these statistics will be generated and stored.
Current data set will get copied into a partitioned history table.
Throughout the day, the current data set (which was copied, not moved) can be queried by end users for information that is not likely to include constants, but relationships between fields.
Users may request specialized searches from the history table, but the queries will be crafted by a DBA.
Before the next day's upload, the current data table is truncated.
This will essentially be version 2 of our existing system.
Right now, we're using MySQL 5.0 MyISAM tables (Innodb was killing on space usage alone) and suffering greatly on #6 and #4. #4 is currently not a partitioned tabled as 5.0 doesn't support it. In order to get around the tremendous amount of time (hours and hours) its taking to insert records into history, we're writing each day to an unindexed history_queue table, and then on the weekends during our slowest time, writing the queue to the history table. The problem is that any historical queries generated in the week are possibly several days behind then. We can't reduce the indexes on the historical table or its queries become unusable.
We're definitely moving to at least MySQL 5.1 (if we stay with MySQL) for the next release but strongly considering PostgreSQL. I know that debate has been done to death, but I was wondering if anybody had any advice relevant to this situation. Most of the research is revolving around web site usage. Indexing is really our main beef with MySQL and it seems like PostgreSQL may help us out through partial indexes and indexes based on functions.
I've read dozens of articles about the differences between the two, but most are old. PostgreSQL has long been labeled "more advanced, but slower" - is that still generally the case comparing MySQL 5.1 to PostgreSQL 8.3 or is it more balanced now?
Commercial databases (Oracle and MS SQL) are simply not an option - although I wish Oracle was.
NOTE on MyISAM vs Innodb for us:
We were running Innodb and for us, we found it MUCH slower, like 3-4 times slower. BUT, we were also much newer to MySQL and frankly I'm not sure we had db tuned appropriately for Innodb.
We're running in an environment with a very high degree of uptime - battery backup, fail-over network connections, backup generators, fully redundant systems, etc. So the integrity concerns with MyISAM were weighed and deemed acceptable.
In regards to 5.1:
I've heard the stability issues concern with 5.1. Generally I assume that any recently (within last 12 months) piece of software is not rock-solid stable. The updated feature set in 5.1 is just too much to pass up given the chance to re-engineer the project.
In regards to PostgreSQL gotchas:
COUNT(*) without any where clause is a pretty rare case for us. I don't anticipate this being an issue.
COPY FROM isn't nearly as flexible as LOAD DATA INFILE but an intermediate loading table will fix that.
My biggest concern is the lack of INSERT IGNORE. We've often used it when building some processing table so that we could avoid putting multiple records in twice and then having to do a giant GROUP BY at the end just to remove some dups. I think its used just infrequently enough for the lack of it to be tolerable.
My work tried a pilot project to migrate historical data from an ERP setup. The size of the data is on the small side, only 60Gbyte, covering over ~ 21 million rows, the largest table having 16 million rows. There's an additional ~15 million rows waiting to come into the pipe but the pilot has been shelved due to other priorities. The plan was to use PostgreSQL's "Job" facility to schedule queries that would regenerate data on a daily basis suitable for use in analytics.
Running simple aggregates over the large 16-million record table, the first thing I noticed is how sensitive it is to the amount of RAM available. An increase in RAM at one point allowed for a year's worth of aggregates without resorting to sequential table scans.
If you decide to use PostgreSQL, I would highly recommend re-tuning the config file, as it tends to ship with the most conservative settings possible (so that it will run on systems with little RAM). Tuning takes a little bit, maybe a few hours, but once you get it to a point where response is acceptable, just set it and forget it.
Once you have the server-side tuning done (and it's all about memory, surprise!) you'll turn your attention to your indexes. Indexing and query planning also requires a little effort but once set you'll find it to be effective. Partial indexes are a nice feature for isolating those records that have "edge-case" data in them, I highly recommend this feature if you are looking for exceptions in a sea of similar data.
Lastly, use the table space feature to relocate the data onto a fast drive array.
In my practical experience I have to say, that postgresql had quite a performance jump from 7.x/8.0 to 8.1 (for our use cases in some instances 2x-3x faster), from 8.1 to 8.2 the improvement was smaller but still noticeable. I don't know the improvements between 8.2 and 8.3, but I expect there is some performance improvement too, I havent tested it so far.
Regarding indices, I would recommend to drop those, and only create them again after filling the database with your data, it is much faster.
Further improve the crap out of your postgresql settings, there is so much gain from it. The default settings are at least sensible now, in pre 8.2 times pg was optimized for running on a pda.
In some cases, especially if you have complicated queries it can help to deactivate nested loops in your settings, which forces pg to use better performing approaches on your queries.
Ah, yes, did I say that you should go for postgresql?
(An alternative would be firebird, which is not so flexible, but in my experience it is in some cases performing much better than mysql and postgresql)
In my experience Inodb is slighly faster for really simple queries, pg for more complex queries. Myisam is probably even faster than Innodb for retrieval, but perhaps slower for indexing/index repair.
These mostly varchar fields, are you indexing them with char(n) indexes?
Can you normalize some of them? It'll cost you on the rewrite, but may save time on subsequent queries, as your row size will decrease, thus fitting more rows into memory at one time.
ON EDIT:
OK, so you have two problems, query time against the daily, and updating the history, yes?
As to the second: in my experience, mysql myism is bad at re-indexing. On tables the size of your daily (0.5 to 1M records, with rather wide (denormalized flat input) records), I found it was faster to re-write the table than to insert and wait for the re-indexing and attendant disk thrashing.
So this might or might not help:
create new_table select * from old_table ;
copies the tables but no indices.
Then insert the new records as normally. Then create the indexes on new table, wait a while. Drop old table, and rename new table to old table.
Edit: In response to the fourth comment: I don't know that MyIsam is always that bad. I know in my particular case, I was shocked at how much faster copying the table and then adding the index was. As it happened, I was doing something similar to what you were doing, copying large denormalized flat files into the database, and then renormalizing the data. But that's an anecdote, not data. ;)
(I also think I found that overall InnoDb was faster, given that I was doing as much inserting as querying. A very special case of database use.)
Note that copying with a select a.*, b.value as foo join ... was also faster than an update a.foo = b.value ... join, which follows, as the update was to an indexed column.
What is not clear to me is how complex the analytical processing is. In my oppinion, having 500K records to process should not be such a big problem, in terms of analytical processing, it is a small recordset.
Even if it is a complex job, if you can leave it over night to complete (since it is a daily process, as I understood from your post), it should still be enough.
Regarding the resulted table, I would not reduce the indexes of the table. Again, you can do the loading over night, including indexes refresh, and have the resulted, updated data set ready for use in the morning, with quicker access than in case of raw tables (non-indexed).
I saw PosgreSQL used in a datawarehouse like environment, working on the setup I've described (data transformation jobs over night) and with no performance complaints.
I'd go for PostgreSQL. You need for example partitioned tables, which are in stable Postgres releases since at least 2005 - in MySQL it is a novelty. I've heard about stability issues in new features of 5.1. With MyISAM you have no referential integrity, transactions and concurrent access suffers a lot - read this blog entry "Using MyISAM in production" for more.
And Postgres is much faster on complicated queries, which will be good for your #6.
There is also a very active and helpful mailing list, where you can get support even from core Postgres developers for free. It has some gotchas though.
The Infobright people appear to be doing some interesting things along these lines:
http://www.infobright.org/
-- psj
If Oracle is not considered an option because of cost issues, then Oracle Express Edition is available for free (as in beer). It has size limitations, but if you do not keep history around for too long anyway, it should not be a concern.
Check your hardware. Are you maxing the IO? Do you have buffers configured properly? Is your hardware sized correctly? Memory for buffering and fast disks are key.
If you have too many indexes, it'll slow inserts down substantially.
How are you doing your inserts? If you're doing one record per INSERT statement:
INSERT INTO TABLE blah VALUES (?, ?, ?, ?)
and calling it 500K times, your performance will suck. I'm surprised it's finishing in hours. With MySQL you can insert hundreds or thousands of rows at a time:
INSERT INTO TABLE blah VALUES
(?, ?, ?, ?),
(?, ?, ?, ?),
(?, ?, ?, ?)
If you're doing one insert per web requests, you should consider logging to the file system and doing bulk imports on a crontab. I've used that design in the past to speed up inserts. It also means your webpages don't depend on the database server.
It's also much faster to use LOAD DATA INFILE to import a CSV file. See http://dev.mysql.com/doc/refman/5.1/en/load-data.html
The other thing I can suggest is be wary of the SQL hammer -- you may not have SQL nails. Have you considered using a tool like Pig or Hive to generate optimized data sets for your reports?
EDIT
If you're having troubles batch importing 500K records, you need to compromise somewhere. I would drop some indexes on your master table, then create optimized views of the data for each report.
Have you tried playing with the myisam_key_buffer parameter ? It is very important in index update speed.
Also if you have indexes on date, id, etc which are correlated columns, you can do :
INSERT INTO archive SELECT .. FROM current ORDER BY id (or date)
The idea is to insert the rows in order, in this case the index update is much faster. Of course this only works for the indexes that agree with the ORDER BY... If you have some rather random columns, then those won't be helped.
but strongly considering PostgreSQL.
You should definitely test it.
it seems like PostgreSQL may help us out through partial indexes and indexes based on functions.
Yep.
I've read dozens of articles about the differences between the two, but most are old. PostgreSQL has long been labeled "more advanced, but slower" - is that still generally the case comparing MySQL 5.1 to PostgreSQL 8.3 or is it more balanced now?
Well that depends. As with any database,
IF YOU DONT KNOW HOW TO CONFIGURE AND TUNE IT IT WILL BE SLOW
If your hardware is not up to the task, it will be slow
Some people who know mysql well and want to try postgres don't factor in the fact that they need to re-learn some things and read the docs, as a result a really badly configured postgres is benchmarked, and that can be pretty slow.
For web usage, I've benchmarked a well configured postgres on a low-end server (Core 2 Duo, SATA disk) with a custom benchmark forum that I wrote and it spit out more than 4000 forum web pages per second, saturating the database server's gigabit ethernet link. So if you know how to use it, it can be screaming fast (InnoDB was much slower due to concurrency issues). "MyISAM is faster for small simple selects" is total bull, postgres will zap a "small simple select" in 50-100 microseconds.
Now, for your usage, you don't care about that ;)
You care about the ways your database can compute Big Aggregates and Big Joins, and a properly configured postgres with a good IO system will usually win against a MySQL system on those, because the optimizer is much smarter, and has many more join/aggregate types to choose from.
My biggest concern is the lack of INSERT IGNORE. We've often used it when building some processing table so that we could avoid putting multiple records in twice and then having to do a giant GROUP BY at the end just to remove some dups. I think its used just infrequently enough for the lack of it to be tolerable.
You can use a GROUP BY, but if you want to insert into a table only records that are not already there, you can do this :
INSERT INTO target SELECT .. FROM source LEFT JOIN target ON (...) WHERE target.id IS NULL
In your use case you have no concurrency problems, so that works well.