I made an api and am logging all the requests to it.
When someone hits the api, there is an insert and an update (in order to record the api response code)
During testing the api log is around 200k records, this might go to a few million records very quickly.
Does this kind of logging, ie insert and update, put alot of pressure on the server?
My concern is that mysql will get overloaded due to logging so not sure if I should trim the logs every 7 days or something.
A few million records won't hurt the database storage-wise. I've got MySQL tables with hundreds of millions of rows and they work fine when indexed. I would consult the MySQL manual for Data Manipulation Statements and Optimization to see if what you're doing is going to stretch performance limits.
MySQL's indexing guide if you'd like to know more.
If you're worried about overloading your DB, you could write a cron job to backup your database each day and then just truncate/load after hitting a certain row count or time of day or whatever. You would then have a backup of all the records that you need.
Hope this helps.
Related
I'm working on a migration from MySQL to Postgres on a large Rails app, most operations are performing at a normal rate. However, we have a particular operation that will generate job records every 30 minutes or so. There are usually about 200 records generated and inserted after which we have separate workers that pick up the jobs and work on them from another server.
Under MySQL it takes about 15 seconds to generate the records, and then another 3 minutes for the worker to perform and write back the results, one at a time (so 200 more updates to the original job records).
Under Postgres it takes around 30 seconds, and then another 7 minutes for the worker to perform and write back the results.
The table being written to has roughly 2 million rows, and 1 sequence column under ID.
I have tried tweaking checkpoint timeouts and sizes with no luck.
The table is heavily indexed and really shouldn't be any different than it was before.
I can't post code samples as its a huge codebase and without posting pages and pages of code it wouldn't make sense.
My question is, can anyone think of why this would possibly be happening? There is nothing in the Postgres log and the process of creating these objects has not changed really. Is there some sort of blocking synchronous write behavior I'm not aware of with Postgres?
I've added all sorts of logging in my code to spot errors or transaction failures but I'm coming up with nothing, it just takes twice as long to run, which doesn't seem correct to me.
The Postgres instance is hosted on AWS RDS on a M3.Medium instance type.
We also use New Relic, and it's showing nothing of interest here, which is surprising
Why does your job queue contain 2 million rows? Are they all live or are have not moved them to an archive table to keep your reporting more simple?
Have you used EXPLAIN on your SQL from a psql prompt or your preferred SQL IDE/tool?
Postgres is a completely different RDBMS then MySQL. It allocates space differently and manipulates space differently so may need to be indexed differently.
Additionally there's a tool called pgtune that will suggest configuration changes.
edit: 2014-08-13
Also, rails comes with a profiler that might add some insight. Here's a StackOverflow thread about rails profiling.
You also want to watch your DB server at the disk IO level. Does your job fulfillment to a large number of updates? Postgres created new rows when you update a existing rows, and marks the old rows as available, instead of just overwriting the existing row. So you may be seeing a lot more IO as a result of your RDBMS switch.
I have a large quantity of data in a production database that I want to update with batches of data while the data in the table is still available for end user use. The updates could be insertion of new rows or updates of existing rows. The specific table is approximately 50M rows, and the updates will be between 100k - 1M rows per "batch". What I would like to do is insert replace with a low priority.. In other words, I want the database to kind of slowly do the batch import without impacting performance of other queries that are occurring concurrently to the same disk spindles. To complicate this, the update data is heavily indexed. 8 b-tree indexes across multiple columns to facilitate various lookup that adds quite a bit of overhead to the import.
I've thought about batching the inserts down into 1-2k record blocks, then having the external script that loads the data just pause for a couple seconds between each insert, but that's really kind of hokey IMHO. Plus, during a 1M record batch, I really don't want to add 500-1000 2second pauses to add 20-40 minutes of extra load time if its not needed. Anyone have ideas on a better way to do this?
I've dealt with a similar scenario using InnoDB and hundreds of millions of rows. Batching with a throttling mechanism is the way to go if you want to minimize risk to end users. I'd experiment with different pause times and see what works for you. With small batches you have the benefit that you can adjust accordingly. You might find that you don't need any pause if you run this all sequentially. If your end users are using more connections then they'll naturally get more resources.
If you're using MyISAM there's a LOW_PRIORITY option for UPDATE. If you're using InnoDB with replication be sure to check that it's not getting too far behind because of the extra load. Apparently it runs in a single thread and that turned out to be the bottleneck for us. Consequently we programmed our throttling mechanism to just check how far behind replication was and pause as needed.
An INSERT DELAYED might be what you need. From the linked documentation:
Each time that delayed_insert_limit rows are written, the handler checks whether any SELECT statements are still pending. If so, it permits these to execute before continuing.
Check this link: http://dev.mysql.com/doc/refman/5.0/en/server-status-variables.html What I would do is write a script that will execute your batch updates when MySQL is showing Threads_running or Connections under a certain number. Hopefully you have some sort of test server where you can determine what a good number threshold might be for either of those server variables. There are plenty of other of server status variables to look at in there also. Maybe control the executions by the Innodb_data_pending_writes number? Let us know what works for you, its an interesting question!
I have three large MySQL tables. They are approaching 2 million records. Two of the tables are InnoDB and are currently around 500 MB in size. The other table is MyISAM and is about 2.5 GB.
We run an import script from FileMaker to insert and update records in these tables but lately it has become very slow - only inserting a few hundred records per hour.
What can I do to increase performance to make inserts and updates happen faster?
For INSERT it could have to do with the indexes you have defined on the tables (they have to be updated after each INSERT). Could you post more information about them? And are there triggers set on the tables?
For UPDATE it is a different story, it could be that not the record update is slow but finding the record is slow. Could you try to change the UPDATE into a SELECT and see if it is still slow? If yes, then you should investigate your indexes.
For the Innodb table, if it's an acceptable risk, I'd consider changing the innodb_flush_log_at_trx_commit level. Some more details in this blog post, along with some more Innodb tuning pointers.
For both engines, batching INSERTs together can speed things up to a point. See doc.
What version of MySQL are you running? There have been many improvements with the new InnoDB "Plugin" engine and concurrency of operations on servers with multiple processors.
Is the query slow when executed on MySQL from the command line?
If you're using the Execute SQL Script step from FileMaker, that connects and disconnects after every call, causing major slowdowns when executing large numbers of queries. We've had clients switch to our JDBC plugin (self-promotion disclaimer here) to avoid this, resulting in major speedups.
It turns out the reason for the slowness was from the FileMaker side of things. Exporting the FileMaker records to a CSV and running INSERT/UPDATE commands resulted in very fast execution.
I have an application where I receive each data 40.000 rows. I have 5 million rows to handle (500 Mb MySQL 5.0 database).
Actually, those rows are stored in the same table => slow to update, hard to backup, etc.
Which kind of scheme is used in such application to allow long term accessibility to the data without problems with too big tables, easy backup, fast read/write ?
Is postgresql better than mysql for such purpose ?
1 - 40000 rows / day is not that big
2 - Partition your data against the insert date : you can easily delete old data this way.
3 - Don't hesitate to go through a datamart step. (compute often asked metrics in intermediary tables)
FYI, I have used PostgreSQL with tables containing several GB of data without any problem (and without partitioning). INSERT/UPDATE time was constant
We're having log tables of 100-200million rows now, and it is quite painful.
backup is impossible, requires several days of down time.
purging old data is becoming too painful - it usually ties down the database for several hours
So far we've only seen these solutions:
backup , set up a MySQL slave. Backing up the slave doesn't impact the main db. (We havn't done this yet - as the logs we load and transform are from flat files - we back up these files and can regenerate the db in case of failures)
Purging old data, only painless way we've found is to introduce a new integer column that identifies the current date, and partition the tables(requires mysql 5.1) on that key, per day. Dropping old data is a matter of dropping a partition, which is fast.
If in addition you need to do continuously transactions on these tables(as opposed to just load data every now and then and mostly query that data), you probably need to look into InnoDB and not the default MyISAM tables.
The general answer is: you probably don't need all that detail around all the time.
For example, instead of keeping every sale in a giant Sales table, you create records in a DailySales table (one record per day), or even a group of tables (DailySalesByLocation = one record per location per day, DailySalesByProduct = one record per product per day, etc.)
First, huge data volumes are not always handled well in a relational database.
What some folks do is to put huge datasets in files. Plain old files. Fast to update, easy to back up.
The files are formatted so that the database bulk loader will work quickly.
Second, no one analyzes huge data volumes. They rarely summarize 5,000,000 rows. Usually, they want a subset.
So, you write simple file filters to cut out their subset, load that into a "data mart" and let them query that. You can build all the indexes they need. Views, everything.
This is one way to handle "Data Warehousing", which is that your problem sounds like.
First, make sure that your logging table is not over-indexed. By that i mean that every time you insert/update/delete from a table any indexes that you have also need to be updated which slows down the process. If you have a lot of indexes specified on your log table you should take a critical look at them and decide if they are indeed necessary. If not, drop them.
You should also consider an archiving procedure such that "old" log information is moved to a separate database at some arbitrary interval, say once a month or once a year. It all depends on how your logs are used.
This is the sort of thing that NoSQL DBs might be useful for, if you're not doing the sort of reporting that requires complicated joins.
CouchDB, MongoDB, and Riak are document-oriented databases; they don't have the heavyweight reporting features of SQL, but if you're storing a large log they might be the ticket, as they're simpler and can scale more readily than SQL DBs.
They're a little easier to get started with than Cassandra or HBase (different type of NoSQL), which you might also look into.
From this SO post:
http://carsonified.com/blog/dev/should-you-go-beyond-relational-databases/
I need to log all post and get requests on web site in the database.
There will be two tables:
requests with time stamp, user id and requested URI
request parameters with name, value and request id
I will use it only for analytical reports once per month. No regular usage of this data.
I have about one million requests a day and the request parameters table will be very huge.
Can I handle such a large table in MySQL with no problems?
I'd avoid writing to the db on each request or you'll be vulnerable to slashdot effect. Parse your web logs during quiet times to update the db.
The usual solution of this type of problem is to write a program that parses the logs from the whole month. If You don't need sophisticated MySQL capabilities, You should consider this approach.
If You really need the database, then consider parsing logs offline. Otherwise, if Your database goes down, You will loose data. Logs are know to be pretty safe.
Table indexes are not free. The more indexes You have, the faster the queries run, but the more indexes You have, the slower inserting data becomes.
Yes, mysql will handle millions of rows normally, but depending on what you wanna do with your data later and on indexes on those tables perfomance may be not very high.
PS. In my project we have a huge pricelist with a few millions of products in it and it works without any problems.