High frequency insert in MySQL - mysql

I have a problem with high frequency insert in MySQL. I've searched a lot on Internet but haven't found a good answer to my problem.
I need to log a lot of event at a very high frequency (~3000 inserts / s => 260 millions row per day), these event are stored in a InnoDB table like that :
log_events :
- id_user : BIGINT
- id_event : SMALLINT
- date : INT
- data : BIGINT (data associated to this event)
My problems are :
- How to speed inserts ? Event are send by thousands of visitors and we are not able to bulk insert
- How to limit IO write ? We are on a 6*600 GB SSD drives and have write IO problems
Do you have any ideas to these kind of problem ?
Thanks
François

Do you have any foreign keys on that table? If so, I would consider to remove them and add indexes only on cols which are used for reads. This should improve writes.
The second idea is use some in-memory db (eg. redis, memcache) as a queue and some worker could get data from it and inserts in a bulk (for example for every 2 seconds) to mysql storage.
The another option if you don't need frequent reads is use archive storage instead of innodb: http://dev.mysql.com/doc/refman/5.5/en/archive-storage-engine.html. But it looks like it's not an option for you as long as it hasn't indexes at all (which means full scan table reads).
Another option is reorganize your db structure, eg. use partitioning (http://dev.mysql.com/doc/refman/5.5/en/partitioning.html). But it depends on how SELECTS looks like.
My additional questions are:
could you show whole table definition?
which fields are used for reads? could you show them?
do you need all data for your reads or maybe only recently ones? If so, how recently data must be? (eg. only from last day/week/month/year)
id_event is an event type, right? Number of possible events is static or it could change in the future?

Event are send by thousands of visitors and we are not able to bulk insert
You need to either bulk insert or shard the data. I would be tempted to try the bulk insert route first.
That you think you can't suggests these events are being created by autonomous processes - you just need to funnel them through an intermediary rather than direct to the database. And it would be easiest to implement that funnel as an event based server (rather than a threaded or forking server).
You don't say what the events are nor where they originate - which has some impact on the details of implementing a solution.
Both rsyslog and syslogng will talk to a MySQL backend - hence you can eliminate the overhead of establishing a new connection per message - but I don't know if either implements buffering / bulk inserts. It would certainly be possible to tail the files they produce with a single process and create bulk inserts from there.
It would relatively simple to write a funnel using this event based server, this buffer tool along with a bit of code to implement asynch mysqli calls and a watchdog. Or you could use node.js with an async mysql lib. There's also tools like statsd (again using node.js) which can also perform some aggregation on the data on the data.
Or you could just write something from scratch.
A write-only database is a useless piece of hardware though. You've not provided any details of how this data will be used - which has some relevance to designing a solution. Also since ideally the data feed would be a single process / DB session, it might be a beter idea to use MyISAM rather than InnoDB (I see in your later comment you said you had problems with MyISAM - presumably this was with multiple clients).

Related

Store common queries on disk with Mysql and windows

I have a Huge person database and do common search with name on it.
SELECT * FROM tbl_person WHERE full_name LIKE 'Sparow%Jack%';
SELECT * FROM tbl_person WHERE full_name LIKE 'Sparow%';
I rarely insert new data in this table.
I want to store common last_name queries on hark disk, queries already stored in ram but I loose it all each time the server reboot.
I have 1.7Billions row in my table and each row (with index) take 1k, yes it's a 1.7Tb database.
It's the main reason why I want to stored common select on disk.
Variable_name,Value
query_alloc_block_size,8192
query_cache_limit,1048576
query_cache_min_res_unit,1024
query_cache_size,4294966272
query_cache_type,ON
query_cache_wlock_invalidate,OFF
query_prealloc_size,8192
Edit :
SELECT * FROM tbl_person WHERE full_name LIKE 'Savard%';
take 1000 sec to execute first time and 2 sec after.
If I reboot the system and execute again, the query take 1000 sec again.
I simply want to avoid mysql take another 1000 sec runing the same query I already do before reboot.
Why not consider something like Redis for caching?
It's an in memory data store and it's very popular right now. Sites using Redis:
http://blog.togo.io/redisphere/redis-roundup-what-companies-use-redis
Redis also can persist data to disk: http://redis.io/topics/persistence
For caching though, saving to disk shouldn't be absolutely critical. The idea is that if some data is not cached, the worst case is not always loading from disk manually, but going straight through to your database.
If you are performing many such queries on your data, I suggest you index your table using Apache Lucene or Sphinx. Database are fast, but they are not so efficient (especially MySQL) when performing partial matches on millions of rows.
I already answered a similar question about Zend Framework and Lucene, and favor Zend's solution as I believe it is the easiest to setup and use with a PHP environment.
Luckily, Zend Framework can be used by module and you can easily only use the Zend Search Lucene module by itself without the entire class library.
** Edit **
The role of an indexer is not to replace your DB, but to improve it's search functionality by providing a way to perform partial searches. For example, given your table, you may only index a few of your fields (make them "queryable") and have other static (non-indexed) fields to reference your rows in your database.
The advantage in using an indexer is that you can also index pre-computations and directly search them, instead of querying the database.

Can I use multiple servers to increase mysql's data upload performance?

I am in the process of setting up a mysql server to store some data but realized(after reading a bit this weekend) I might have a problem uploading the data in time.
I basically have multiple servers generating daily data and then sending it to a shared queue to process/analyze. The data is about 5 billion rows(although its very small data, an ID number in a column and a dictionary of ints in another). Most of the performance reports I have seen have shown insert speeds of 60 to 100k/second which would take over 10 hours. We need the data in very quickly so we can work on it that day and then we may discard it(or achieve the table to S3 or something).
What can I do? I have 8 servers at my disposal(in addition to the database server), can I somehow use them to make the uploads faster? At first I was thinking of using them to push data to the server at the same time but I'm also thinking maybe I can load the data onto each of them and then somehow try to merge all the separated data into one server?
I was going to use mysql with innodb(I can use any other settings it helps) but its not finalized so if mysql doesn't work is there something else that will(I have used hbase before but was looking for a mysql solution first in case I have problems seems more widely used and easier to get help)?
Wow. That is a lot of data you're loading. It's probably worth quite a bit of design thought to get this right.
Multiple mySQL server instances won't help with loading speed. What will make a difference is fast processor chips and very fast disk IO subsystems on your mySQL server. If you can use a 64-bit processor and provision it with a LOT of RAM, you may be able to use a MEMORY access method for your big table, which will be very fast indeed. (But if that will work for you, a gigantic Java HashMap may work even better.)
Ask yourself: Why do you need to stash this info in a SQL-queryable table? How will you use your data once you've loaded it? Will you run lots of queries that retrieve single rows or just a few rows of your billions? Or will you run aggregate queries (e.g. SUM(something) ... GROUP BY something_else) that grind through large fractions of the table?
Will you have to access the data while it is incompletely loaded? Or can you load up a whole batch of data before the first access?
If all your queries need to grind the whole table, then don't use any indexes. Otherwise do. But don't throw in any indexes you don't need. They are going to cost you load performance, big time.
Consider using myISAM rather than InnoDB for this table; myISAM's lack of transaction semantics makes it faster to load. myISAM will do fine at handling either aggregate queries or few-row queries.
You probably want to have a separate table for each day's data, so you can "get rid" of yesterday's data by either renaming the table or simply accessing a new table.
You should consider using the LOAD DATA INFILE command.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
This command causes the mySQL server to read a file from the mySQL server's file system and bulk-load it directly into a table. It's way faster than doing INSERT commands from a client program on another machine. But it's also tricker to set up in production: your shared queue needs access to the mySQL server's file system to write the data files for loading.
You should consider disabling indexing, then loading the whole table, then re-enabling indexing, but only if you don't need to query partially loaded tables.

How do you optimize a MySQL database for writes?

I have a write intensive application running on EC2. Any thoughts on how to optimize it to be able to make several thousands concurrent writes on the MySQL DB?
Write scaling is a hard problem. Perhaps, secret to write scaling is in read scaling. That is, cache reads as much as possible, so that the writes get all the throughput.
Having said that, there are a bunch of things one can do:
1) Start with the data model. Design a data model so that you do not ever delete or update a table. Only operation is an insert. Use Effective Date, Effective Sequence and Effective Status to implement Insert, Update and Delete operations using just the Insert Command. This concept is called Append Only model. Checkout RethinkDB..
2) Set the Concurrent Insert flag to 1. This makes sure that the tables keep inserting while reads are in progress.
3) When you have only Inserts at the tail, you may not need row-level locks. So, use MyISAM (this is not to take anything away from InnoDB, which I will come to later).
4) If all this does not do much, create a replica table in Memory Engine. If you have a table called MY_DATA, create a table called MY_DATA_MEM in memory table.
5) Redirect all Inserts to the MEM table. Create a View that UNIONS both tables and use that view as your Read Source.
6) Write a daemon that periodically moves MEM contents to the Main table and deletes from the Mem table. It may be ideal to implement the MOVE operation as a Delete trigger on the Mem table (I am hoping triggers are possible on Memory Engine, not entirely sure).
7) Do not do any deletes or Updates on the MEM table (they are slow) also pay attention to the cardinality of the keys in your table (HASH vs B-Tree : Low Card -> Hash, High Card-> B-Tree)
8) Even if all the above does not work, ditch jdbc/odbc. Move to InnoDB and use Handler Socket interface to do the direct inserts (Google for Yoshinori-San MySQL)
I have not used the HS myself, but the benchmarks are impressive. There is a even Java HS Project on Google Code.
Hope that helps..

What is the most efficient way to keep track of all user traffic inside a database

Currently I am using mysql to log all traffic from all users coming into a website that I manage. The database has grown to almost 11m rows in a month, and queries are getting quite slow. Is there a more efficient way to log user information? All we are storing is their request, useragent, and their ip, and associating it with a certain website.
Why not try Google Analytics? Even if you might not think it would be sufficient for you, I bet you it can track 99% of what you want to be tracked.
The answer depends completely on what you expect to retrieve in the query side. Are you looking for aggregate information, are you looking for all of history or only a portion? Often, if you need to look at every row to find out what you need, storing in basic text files is quickest.
What are the kind of queries that you want to run on the data? I assume most of your queries are over data in current or recent time window. I would suggest to use time based partitioning of the table. This will make such queries faster as the queries will hit only the partition having the data, so less disk seeks. Also regularly purge old data and put them in summary tables. Some useful links are:
http://forge.mysql.com/w/images/a/a2/FOSDEM_2009-Giuseppe_Maxia-Partitions_Performance.pdf
http://www.slideshare.net/bluesmoon/scaling-mysql-writes-through-partitioning-3397422
the most efficient way is probably to have apache (assuming thats what the site is running on) simply use its built in logging to text logs, and configure something like AWStats. This removes the need to log this information yourself, and should provide you with the information you are looking for - probably all ready configured in existing reports. The benefit of this over something like Google Analytics would be its server side tracking - etc.
Maybe stating the obvious but have you got a good index in relation to the querys that you are making?
1) Look at using Piwik to perform Google Analytic type tracking, while retaining control of the MySQL data.
2) If you must continue to use your own system, look at using InnoDB Plugin in order to support compressed table types. In addition, convert IP to unsigned integer, convert both useragent and request to unsigned int referencing lookup tables that are compressed using either Innodb compression or the archive engine.
3) Skip partitioning and shard the DB by month.
This is what "Data Warehousing" is for. Consider buying a good book on warehousing.
Collect the raw data in some "current activity" schema.
Periodically, move it into a "warehouse" (or "datamart") star schema that's (a) separate from the current activity schema and (b) optimized for count/sum/group-by queries.
Move, BTW, means insert into warehouse schema and delete from current activity schema.
Separate your ongoing transactional processing from your query/analytical processing.

Database design for heavy timed data logging

I have an application where I receive each data 40.000 rows. I have 5 million rows to handle (500 Mb MySQL 5.0 database).
Actually, those rows are stored in the same table => slow to update, hard to backup, etc.
Which kind of scheme is used in such application to allow long term accessibility to the data without problems with too big tables, easy backup, fast read/write ?
Is postgresql better than mysql for such purpose ?
1 - 40000 rows / day is not that big
2 - Partition your data against the insert date : you can easily delete old data this way.
3 - Don't hesitate to go through a datamart step. (compute often asked metrics in intermediary tables)
FYI, I have used PostgreSQL with tables containing several GB of data without any problem (and without partitioning). INSERT/UPDATE time was constant
We're having log tables of 100-200million rows now, and it is quite painful.
backup is impossible, requires several days of down time.
purging old data is becoming too painful - it usually ties down the database for several hours
So far we've only seen these solutions:
backup , set up a MySQL slave. Backing up the slave doesn't impact the main db. (We havn't done this yet - as the logs we load and transform are from flat files - we back up these files and can regenerate the db in case of failures)
Purging old data, only painless way we've found is to introduce a new integer column that identifies the current date, and partition the tables(requires mysql 5.1) on that key, per day. Dropping old data is a matter of dropping a partition, which is fast.
If in addition you need to do continuously transactions on these tables(as opposed to just load data every now and then and mostly query that data), you probably need to look into InnoDB and not the default MyISAM tables.
The general answer is: you probably don't need all that detail around all the time.
For example, instead of keeping every sale in a giant Sales table, you create records in a DailySales table (one record per day), or even a group of tables (DailySalesByLocation = one record per location per day, DailySalesByProduct = one record per product per day, etc.)
First, huge data volumes are not always handled well in a relational database.
What some folks do is to put huge datasets in files. Plain old files. Fast to update, easy to back up.
The files are formatted so that the database bulk loader will work quickly.
Second, no one analyzes huge data volumes. They rarely summarize 5,000,000 rows. Usually, they want a subset.
So, you write simple file filters to cut out their subset, load that into a "data mart" and let them query that. You can build all the indexes they need. Views, everything.
This is one way to handle "Data Warehousing", which is that your problem sounds like.
First, make sure that your logging table is not over-indexed. By that i mean that every time you insert/update/delete from a table any indexes that you have also need to be updated which slows down the process. If you have a lot of indexes specified on your log table you should take a critical look at them and decide if they are indeed necessary. If not, drop them.
You should also consider an archiving procedure such that "old" log information is moved to a separate database at some arbitrary interval, say once a month or once a year. It all depends on how your logs are used.
This is the sort of thing that NoSQL DBs might be useful for, if you're not doing the sort of reporting that requires complicated joins.
CouchDB, MongoDB, and Riak are document-oriented databases; they don't have the heavyweight reporting features of SQL, but if you're storing a large log they might be the ticket, as they're simpler and can scale more readily than SQL DBs.
They're a little easier to get started with than Cassandra or HBase (different type of NoSQL), which you might also look into.
From this SO post:
http://carsonified.com/blog/dev/should-you-go-beyond-relational-databases/