I need to log all post and get requests on web site in the database.
There will be two tables:
requests with time stamp, user id and requested URI
request parameters with name, value and request id
I will use it only for analytical reports once per month. No regular usage of this data.
I have about one million requests a day and the request parameters table will be very huge.
Can I handle such a large table in MySQL with no problems?
I'd avoid writing to the db on each request or you'll be vulnerable to slashdot effect. Parse your web logs during quiet times to update the db.
The usual solution of this type of problem is to write a program that parses the logs from the whole month. If You don't need sophisticated MySQL capabilities, You should consider this approach.
If You really need the database, then consider parsing logs offline. Otherwise, if Your database goes down, You will loose data. Logs are know to be pretty safe.
Table indexes are not free. The more indexes You have, the faster the queries run, but the more indexes You have, the slower inserting data becomes.
Yes, mysql will handle millions of rows normally, but depending on what you wanna do with your data later and on indexes on those tables perfomance may be not very high.
PS. In my project we have a huge pricelist with a few millions of products in it and it works without any problems.
Related
I made an api and am logging all the requests to it.
When someone hits the api, there is an insert and an update (in order to record the api response code)
During testing the api log is around 200k records, this might go to a few million records very quickly.
Does this kind of logging, ie insert and update, put alot of pressure on the server?
My concern is that mysql will get overloaded due to logging so not sure if I should trim the logs every 7 days or something.
A few million records won't hurt the database storage-wise. I've got MySQL tables with hundreds of millions of rows and they work fine when indexed. I would consult the MySQL manual for Data Manipulation Statements and Optimization to see if what you're doing is going to stretch performance limits.
MySQL's indexing guide if you'd like to know more.
If you're worried about overloading your DB, you could write a cron job to backup your database each day and then just truncate/load after hitting a certain row count or time of day or whatever. You would then have a backup of all the records that you need.
Hope this helps.
I am in the process of setting up a mysql server to store some data but realized(after reading a bit this weekend) I might have a problem uploading the data in time.
I basically have multiple servers generating daily data and then sending it to a shared queue to process/analyze. The data is about 5 billion rows(although its very small data, an ID number in a column and a dictionary of ints in another). Most of the performance reports I have seen have shown insert speeds of 60 to 100k/second which would take over 10 hours. We need the data in very quickly so we can work on it that day and then we may discard it(or achieve the table to S3 or something).
What can I do? I have 8 servers at my disposal(in addition to the database server), can I somehow use them to make the uploads faster? At first I was thinking of using them to push data to the server at the same time but I'm also thinking maybe I can load the data onto each of them and then somehow try to merge all the separated data into one server?
I was going to use mysql with innodb(I can use any other settings it helps) but its not finalized so if mysql doesn't work is there something else that will(I have used hbase before but was looking for a mysql solution first in case I have problems seems more widely used and easier to get help)?
Wow. That is a lot of data you're loading. It's probably worth quite a bit of design thought to get this right.
Multiple mySQL server instances won't help with loading speed. What will make a difference is fast processor chips and very fast disk IO subsystems on your mySQL server. If you can use a 64-bit processor and provision it with a LOT of RAM, you may be able to use a MEMORY access method for your big table, which will be very fast indeed. (But if that will work for you, a gigantic Java HashMap may work even better.)
Ask yourself: Why do you need to stash this info in a SQL-queryable table? How will you use your data once you've loaded it? Will you run lots of queries that retrieve single rows or just a few rows of your billions? Or will you run aggregate queries (e.g. SUM(something) ... GROUP BY something_else) that grind through large fractions of the table?
Will you have to access the data while it is incompletely loaded? Or can you load up a whole batch of data before the first access?
If all your queries need to grind the whole table, then don't use any indexes. Otherwise do. But don't throw in any indexes you don't need. They are going to cost you load performance, big time.
Consider using myISAM rather than InnoDB for this table; myISAM's lack of transaction semantics makes it faster to load. myISAM will do fine at handling either aggregate queries or few-row queries.
You probably want to have a separate table for each day's data, so you can "get rid" of yesterday's data by either renaming the table or simply accessing a new table.
You should consider using the LOAD DATA INFILE command.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
This command causes the mySQL server to read a file from the mySQL server's file system and bulk-load it directly into a table. It's way faster than doing INSERT commands from a client program on another machine. But it's also tricker to set up in production: your shared queue needs access to the mySQL server's file system to write the data files for loading.
You should consider disabling indexing, then loading the whole table, then re-enabling indexing, but only if you don't need to query partially loaded tables.
Is it possible to cache recently inserted data in MySQL database internally?
I looked at query cache etc (http://dev.mysql.com/doc/refman/5.1/en/query-cache.html) but thats not what I am looking for. I know that 'SELECT' query will be cached.
Details:
I am inserting lots of data to MySQL DB every second.
I have two kind of users for this Data.
Users who query any random data
Users who query recently inserted data
For 2nd kind of users, my table has primary key as unix time-stamp which tells me how new the data is. Is there any way to cache the data at the time of insert?
One option is to write my own caching module which cache data and then 'INSERT'.
Users can query this module before going to MySQL DB.
I was just wondering if something similar is available.
PS: I am open to other database providing similar feature.
Usually you get the best performance from MySQL if you allow a big index cache (config setting key_buffer_size), at least for MyISAM tables.
If latency is really an issue (as it seems in your case) have a look at Sphinx which has recently introduced real-time indexes.
Currently I am using mysql to log all traffic from all users coming into a website that I manage. The database has grown to almost 11m rows in a month, and queries are getting quite slow. Is there a more efficient way to log user information? All we are storing is their request, useragent, and their ip, and associating it with a certain website.
Why not try Google Analytics? Even if you might not think it would be sufficient for you, I bet you it can track 99% of what you want to be tracked.
The answer depends completely on what you expect to retrieve in the query side. Are you looking for aggregate information, are you looking for all of history or only a portion? Often, if you need to look at every row to find out what you need, storing in basic text files is quickest.
What are the kind of queries that you want to run on the data? I assume most of your queries are over data in current or recent time window. I would suggest to use time based partitioning of the table. This will make such queries faster as the queries will hit only the partition having the data, so less disk seeks. Also regularly purge old data and put them in summary tables. Some useful links are:
http://forge.mysql.com/w/images/a/a2/FOSDEM_2009-Giuseppe_Maxia-Partitions_Performance.pdf
http://www.slideshare.net/bluesmoon/scaling-mysql-writes-through-partitioning-3397422
the most efficient way is probably to have apache (assuming thats what the site is running on) simply use its built in logging to text logs, and configure something like AWStats. This removes the need to log this information yourself, and should provide you with the information you are looking for - probably all ready configured in existing reports. The benefit of this over something like Google Analytics would be its server side tracking - etc.
Maybe stating the obvious but have you got a good index in relation to the querys that you are making?
1) Look at using Piwik to perform Google Analytic type tracking, while retaining control of the MySQL data.
2) If you must continue to use your own system, look at using InnoDB Plugin in order to support compressed table types. In addition, convert IP to unsigned integer, convert both useragent and request to unsigned int referencing lookup tables that are compressed using either Innodb compression or the archive engine.
3) Skip partitioning and shard the DB by month.
This is what "Data Warehousing" is for. Consider buying a good book on warehousing.
Collect the raw data in some "current activity" schema.
Periodically, move it into a "warehouse" (or "datamart") star schema that's (a) separate from the current activity schema and (b) optimized for count/sum/group-by queries.
Move, BTW, means insert into warehouse schema and delete from current activity schema.
Separate your ongoing transactional processing from your query/analytical processing.
I have an application where I receive each data 40.000 rows. I have 5 million rows to handle (500 Mb MySQL 5.0 database).
Actually, those rows are stored in the same table => slow to update, hard to backup, etc.
Which kind of scheme is used in such application to allow long term accessibility to the data without problems with too big tables, easy backup, fast read/write ?
Is postgresql better than mysql for such purpose ?
1 - 40000 rows / day is not that big
2 - Partition your data against the insert date : you can easily delete old data this way.
3 - Don't hesitate to go through a datamart step. (compute often asked metrics in intermediary tables)
FYI, I have used PostgreSQL with tables containing several GB of data without any problem (and without partitioning). INSERT/UPDATE time was constant
We're having log tables of 100-200million rows now, and it is quite painful.
backup is impossible, requires several days of down time.
purging old data is becoming too painful - it usually ties down the database for several hours
So far we've only seen these solutions:
backup , set up a MySQL slave. Backing up the slave doesn't impact the main db. (We havn't done this yet - as the logs we load and transform are from flat files - we back up these files and can regenerate the db in case of failures)
Purging old data, only painless way we've found is to introduce a new integer column that identifies the current date, and partition the tables(requires mysql 5.1) on that key, per day. Dropping old data is a matter of dropping a partition, which is fast.
If in addition you need to do continuously transactions on these tables(as opposed to just load data every now and then and mostly query that data), you probably need to look into InnoDB and not the default MyISAM tables.
The general answer is: you probably don't need all that detail around all the time.
For example, instead of keeping every sale in a giant Sales table, you create records in a DailySales table (one record per day), or even a group of tables (DailySalesByLocation = one record per location per day, DailySalesByProduct = one record per product per day, etc.)
First, huge data volumes are not always handled well in a relational database.
What some folks do is to put huge datasets in files. Plain old files. Fast to update, easy to back up.
The files are formatted so that the database bulk loader will work quickly.
Second, no one analyzes huge data volumes. They rarely summarize 5,000,000 rows. Usually, they want a subset.
So, you write simple file filters to cut out their subset, load that into a "data mart" and let them query that. You can build all the indexes they need. Views, everything.
This is one way to handle "Data Warehousing", which is that your problem sounds like.
First, make sure that your logging table is not over-indexed. By that i mean that every time you insert/update/delete from a table any indexes that you have also need to be updated which slows down the process. If you have a lot of indexes specified on your log table you should take a critical look at them and decide if they are indeed necessary. If not, drop them.
You should also consider an archiving procedure such that "old" log information is moved to a separate database at some arbitrary interval, say once a month or once a year. It all depends on how your logs are used.
This is the sort of thing that NoSQL DBs might be useful for, if you're not doing the sort of reporting that requires complicated joins.
CouchDB, MongoDB, and Riak are document-oriented databases; they don't have the heavyweight reporting features of SQL, but if you're storing a large log they might be the ticket, as they're simpler and can scale more readily than SQL DBs.
They're a little easier to get started with than Cassandra or HBase (different type of NoSQL), which you might also look into.
From this SO post:
http://carsonified.com/blog/dev/should-you-go-beyond-relational-databases/