Run thousands of queries on ~60gb of logfiles - mysql

I have a month worth of logfiles (~60gb uncompressed) and I need to run about 1000 thousand queries on these logfiles. Each logfile is ~68MB compressed with gzip.
For testing purpose I have installed Hadoop and Hive in pseudo-distributed mode on our test server (8core, 32gb ram) and I have loaded the logfiles in a hive table which looks somewhat like this:
date, time, userid, channel
And I have a file with about 1000 timeframes like this:
date, time-start, time-end
01_01_2015, 08:05:31, 08:09:54
01_01_2015, 08:54:10, 08:54:30
...
02_01_2015, 08:15:14, 08:20:48
...
[edit:] The timeframes on a single day are non-overlapping and with precision in seconds. They can be as short as 10 seconds and as long as several minutes.
I want to find out how many unique user were on my site during these exact timeframes.
With each of these timeframes being unique.
My question is what would be the most time efficient way of handling such a task? Running a thousand different queries in Hive seems like a terrible way of doing this.
The alternative would be to bundle say 50-100 queries into one to avoid too much overhead from creating jobs etc., would that work better? And is there a limit how long a query can be in Hive?
While Im interested in how this could be done with Hadoop, I'm also open for other suggestions (especially considering this runs in pseudo-distributed).

Are the timeframes overlapping? If so, would 1-minute chunks of the log be a reasonable way to chunk the data? That is would there be dozens or hundreds of rows per minute and all the timeframes have a resolution of one minute? If not one minute, maybe one hour?
Summarize the data in each 1-minute chunk; put the results in another database table. Then write queries against that table.
That would be the MySQL way to do it, probably in a single machine.
Edit (based on OP's edit showing that ranges are non-overlapping and not conveniently divided):
Given that the ranges are non-overlapping, you should aim for doing the work in a single pass.
I would pick between a Perl/PHP program that does all the work, versus 1000 sql calls with
INSERT INTO SummaryTable
SELECT MIN(ts), MAX(ts), SUM(...), COUNT(...)
FROM ...
WHERE ts BETWEEN...
(This assumes an index on ts.) That would be simple enough and fast enough -- It would run only slightly slower than the time it takes to read that much disk.
But... Why even put the raw data into a database table? That is a lot of work, with perhaps no long-term benefit. So, I am back to writing a Perl script to read the log file, doing the work as it goes.

Related

How can I improve MySQL database performance?

So i have database in project Mysql .
I have a main table that have main staff for updating and inserting .
I have huge data traffic on the data . what i am doing mainly reading .csv file and inserting to table .
Everything works file for 3 days but when table record goes above 20 million the database start responding slow , and in 60 million more slow.
What i have done ?
I have applied index in the record where i think i need of it . (where clause field for fast searching) .
I think query optimisation can not be issue because database working fine for 3 days and when data filled in table it get slow . and as i reach 60 million it work more slow .
Can you provide me the approach how can i handle this ?
What should i do ? Should i shift data in every 3 days or what ? What you have done in such situation .
The purpose of database is to store a huge information. I think the problem is not in your database, it should be poor query, joins, Database buffer, index and cache. These are the following reason which makes your response to slow up. For more info check this link
I have applied index in the record where i think i need of it
Yes, index improve the performance of SELECT query, but at the same time it will degrade your DML operation and index has to be restructure whenever you perform any changes to indexed column.
Now, this is totally depending on your business need, whether you need index or not, whether you can compromise SELECT or DML.
Currently, many industries uses two different schemas OLAP for reporting and analytics and OLTP to store real-time data (including some real-time reporting).
First of all it could be helpful for us to now which kind of data you want to store.
Normally it makes no sense to store such a huge amount of data in 3 days because no one ever will be able to use this in an effective way. So it is better to reduce the data before storing in the database.
e.g.
If you get measuring values from a device which gives you one value a millisecond, you should think if any user is ever asking for a special value at a special millisecond or if it not makes more sense to calculate the average value of once a second, minute or hour or perhaps once a day?
If you really need the milliseconds but only if the user takes a deeper look, you can create a table from the main table with only the average values of an hour or day or whatever and work with that table. Only if the user goes in ths "milliseconds" view you use the main table and have to live with the more bad performance.
This all is of course only possible if the database data is read only. If the data in the database is changed from the application (and not only appended by the CSV import) then using more then one table will be error prone.
Whick operation do you want to speed up?
insert operation
A good way to speed it up is to insert records in batch. For example, insert 1000 records in each insert statement:
insert into test values (value_list),(value_list)...(value_list);
other operations
If your table got tens of millions of records, everything will be slowing down. This is quite common.
To speed it up in this situation, here is some advice:
Optimize your table definition. It depends on your particular case. Creating indexes is a common way.
Optimize your SQL statements. Apparently a good SQL statement will run much faster, and a bad SQL statement might be a performance killer.
Data migration. If only part of your data is used frequently, you can shift the infrequently-used data to another big table.
Sharding. This is a more complicated way, but usually used in big data system.
For the .csv file, use LOAD DATA INFILE ...
Are you using InnoDB? How much RAM do you have? What is the value of innodb_buffer_pool_size? That may not be set right -- based on queries slowing down as the data increases.
Let's see a slow query. And SHOW CREATE TABLE. Often a 'composite' index is needed. Or reformulation of the SELECT.

Simple query on a large MySQL table takes very long time at first, much quicker later

We are struggling with a slow query that only happens the first time it is called. Afterwards the query is much faster.
The first time the query is done, it takes anywhere from 15-20 seconds. Subsequent calls take < 1.5 seconds. However if not called again for a few hours, the query will take 15-20 seconds again.
The table is a table of daily readings for an entity called system(foreign key), with system id, date, sample reading, and an indication if the reading is done (past). The query asks for a range of 1 year of samples (365 days) for 200 selected systems.
It looks like this:
SELECT system_id,
sample_date,
reading
FROM Dailyreadings
WHERE past = 1
AND reading IS NOT NULL
AND sample_date < '2014-02-25' AND sample_date >= DATE('2013-01-26')
AND system_id IN (list_of_ids)
list_of_ids represents a list of 200 system ids for which we want the readings.
We have an index on system_id, sample_date and an index on both. The result of the query usually gives back ~70,000 rows. And when using explain on the query, I can see the index is used, and the planning is to only go over ~70,000 rows.
The MySQL is on amazon RDS. The engine for all table is innodb.
The Dailyreadings table has about 60 million rows, so it is quite large. However I can't understand how a very simple range query, can take up to 20 seconds. This is done on a read only replica, so concurrent writes aren't an issue I would guess. This also happens on a staging copy of the DB which has very few read/write requests going on at the same time.
After reading many many questions about slow first time queries, I assume the problem is that the first time, the query needs to be read from the disk, and afterwards it is cached. However, I fail to see why such a simple query would take so much time reading from disk. I also tried many tweaks to the innodb parameters, and couldn't get this to improve. Even doubling the ram of the system didn't seem to help.
Any pointers as to what could be the problem? and how we can improve the time it takes for the first query? Any ideas how to pinpoint the exact problem?
edit
It seems the problem might be in the IN clause, which is slow since the list is big (200) items?. Is this a known issue? Is there a way to accelerate this?
The query runs fast after a run because mysql is caching it probably. To see how your query runs with caching disabled try: SELECT SQL_NO_CACHE system_id ...
Also, I found that comparing dates on tables with lots of data has a negative effect on performance. When possible, I saved the dates as ints using unix timestamps and compared the dates like that and it worked faster.

Reporting with MySQL - Simplest Query taking too long

I have a MySQL Table on an Amazon RDS Instance with 250 000 Rows. When I try to
SELECT * FROM tableName
without any conditions (just for testing, the normal query specifies the columns I need, but I need most of them) , the query takes between 20 and 60 seconds to execute. This will be the base query for my report, and the report should run in under 60 seconds, so I think this will not work out (it times out the moment I add the joins). The report runs without any problems in our smaller test environments.
Could it be that the Query is taking so long because MySQL is trying to lock the table and waiting for all writes to finish? There might be quite a lot of writes on this table. I am doing the query on a MySQL slave, since I do not want to lockup the production system with my queries.
I have no experience with how much rows are much for a relational DB. Are 250 000 Rows with ~30 columns (varchar, date and integer types) much?
How can I speedup this query (hardware, software, query optimization ...)
Can I tell MySQL that I do not care that the Data might be inconsistent (It is a snapshot from a Reporting Database)
Is there a chance that this query will run under 60 seconds, or do I have to adjust my goals?
Remember that MySQL has to prepare your result set and transport it to your client. In your case, this could be 200MB of data it has to shuttle across the connection, so 20 seconds is not bad at all. Most libraries, by default, wait for the entire result being received before forwarding it to the application.
To speed it up, fetch only the columns you need, or do it in chunks with LIMIT. SELECT * is usually a sign that someone's being super lazy and not optimizing at all.
If your library supports streaming resultsets, use that, as then you can start getting data almost immediately. It'll allow you to iterate on rows as they come in without buffering the entire result.
A table with 250,000 rows is not too big for MySQL at all.
However, waiting for those rows to be returned to the application does take time. That is network time, and there are probably a lot of hops between you and Amazon.
Unless your report is really going to process all the data, check the performance of the database with a simpler query, such as:
select count(*) from table;
EDIT:
Your problem is unlikely to be due to the database. It is probably due to network traffic. As mentioned in another answer, streaming might solve the problem. You might also be able to play with the data formats to get the total size down to something more reasonable.
A last-resort step would be to save the data in a text file, compress the file, move it over, and uncompress it. Although this sounds like a lot of work, you might get 5x - 10x compression on the data, saving oodles of time on the transmission and still have a large improvement in performance with the rest of the processing.
I got updated specs from my client and was able to reduce the amount of users returned to 250, which goes (with a lot of JOINS) though in 60 seconds.
So maybe the answer is really: Try to not dump a whole table with a query, fetch only the exact data your need. The Client has SQL access, and he will have to update his queries, so only relevant users are returned.
I should never really use * as a wildcard. Choose the fields that you actually want and then create an index of these fields combined.
If you have thousands of rows, another option is implement pagination.
If result data directly using for report , no one can look more than 100 rows in single shot.

MySQL: How many UPDATES per second can an average box support?

We've got a constant stream of simple updates to a single MySQL table (storing user activity information). Let's say we group these into batch updates each second.
I want a ballpark idea of when mysql on a typical 4-core 8GB box will start having an issue keeping up with the updates coming in each second. E.g. how many rows of updates can I make # 1 per second?
This is a thought exercise to decide if I should get going with MySQL in the early days of our applications release (simplify development), or if MySQL's likely to bomb so soon as to make it not worth even venturing down that path.
The only way you can get a decent figure is through benchmarking your specific use case. There are just too many variables and there is no way around that.
It shouldn't take too long either if you just knock a bash script or a small demo app and hammer it with jmeter, then that can give you a good idea.
I used jmeter when trying to benchmark a similar use case. The difference was I was looking for write throughput for number of INSERTS. The most useful thing that came out when I was playing was the 'innodb_flush_log_at_trx_commit' param. If you are using INNODB and don't need ACID compliance for your use case, then changing it to 0. This makes a huge difference to INSERT throughput and will likely do the same in your UPDATE use case. Although note that with this setting, changes only get flushed to disk once per second, so if your server gets a power cut or something, you could lose a seconds worth of data.
On my Quad Core 8GB Machine for my use case:
innodb_flush_log_at_trx_commit=1 resulted in 80 INSERTS per second
innodb_flush_log_at_trx_commit=0 resulted in 2000 INSERTS per second
These figures will probably bear no relevance to your use case - which is why you need to benchmark it yourself.
A lot of it depends on the quality of the code which you use to push to the DB.
If you write your batch to insert a single value per INSERT request (i.e.,
INSERT INTO table (field) VALUES (value_1);
INSERT INTO table (field) VALUES (value_2);
...
INSERT INTO table (field) VALUES (value_n);
, your performance will crash and burn.
If you insert multiple values using a single INSERT (i.e.
INSERT INTO table (field) values (value_1),(value_2)...(value_n);
, you'll find that you could easily insert many records per second
As an example, I wrote a quick app which needed to add the details of a request for an LDAP account to a holding DB. Inserting one field at a time (i.e., LDAP_field, LDAP_value), execution of the whole script took 10's of seconds. When I concatenated the values into a single INSERT request, execution time of the script went down to about 2 seconds from start to finish. This included the overhead of starting and committing a transaction
Hope this helps
Its not easy to give a general answer to this question. The numbers you ask for rely heavily not only on the hardware of your database server, MySQL itself, but also on server/client configuration, network and - equally important - on your database/table design too.
Generally speaking, with a naked MySQL setup on a state-of-the-art server and update statements using unique keys, I don't have issues below 200 update-statementsp er second if I fire them from localhost, at least that's what I get on my six year old winxp test enviroment. A naked installation on a new system will scale this way higher. If you think way bigger, one server isn't the way to go. MySQL can be tweaked and scaled out in some ways, therefore many companies rely heavily on it.
Just some basics:
If the fields you want to update have huge index files, the update
statements are alot slower since each statement has to write not only
data, but also index informations.
If your update statement cannot
use an index, it might take longer for the server to allocate the
required fields it has to update.
Slow memory and/or slow harddisks
might also slow down overall server performance.
Slow network
connection slows down communication between client and server.
There are whole books written about it, so I'll stop here and advise some further reading, if you're interested!

Database design for heavy timed data logging

I have an application where I receive each data 40.000 rows. I have 5 million rows to handle (500 Mb MySQL 5.0 database).
Actually, those rows are stored in the same table => slow to update, hard to backup, etc.
Which kind of scheme is used in such application to allow long term accessibility to the data without problems with too big tables, easy backup, fast read/write ?
Is postgresql better than mysql for such purpose ?
1 - 40000 rows / day is not that big
2 - Partition your data against the insert date : you can easily delete old data this way.
3 - Don't hesitate to go through a datamart step. (compute often asked metrics in intermediary tables)
FYI, I have used PostgreSQL with tables containing several GB of data without any problem (and without partitioning). INSERT/UPDATE time was constant
We're having log tables of 100-200million rows now, and it is quite painful.
backup is impossible, requires several days of down time.
purging old data is becoming too painful - it usually ties down the database for several hours
So far we've only seen these solutions:
backup , set up a MySQL slave. Backing up the slave doesn't impact the main db. (We havn't done this yet - as the logs we load and transform are from flat files - we back up these files and can regenerate the db in case of failures)
Purging old data, only painless way we've found is to introduce a new integer column that identifies the current date, and partition the tables(requires mysql 5.1) on that key, per day. Dropping old data is a matter of dropping a partition, which is fast.
If in addition you need to do continuously transactions on these tables(as opposed to just load data every now and then and mostly query that data), you probably need to look into InnoDB and not the default MyISAM tables.
The general answer is: you probably don't need all that detail around all the time.
For example, instead of keeping every sale in a giant Sales table, you create records in a DailySales table (one record per day), or even a group of tables (DailySalesByLocation = one record per location per day, DailySalesByProduct = one record per product per day, etc.)
First, huge data volumes are not always handled well in a relational database.
What some folks do is to put huge datasets in files. Plain old files. Fast to update, easy to back up.
The files are formatted so that the database bulk loader will work quickly.
Second, no one analyzes huge data volumes. They rarely summarize 5,000,000 rows. Usually, they want a subset.
So, you write simple file filters to cut out their subset, load that into a "data mart" and let them query that. You can build all the indexes they need. Views, everything.
This is one way to handle "Data Warehousing", which is that your problem sounds like.
First, make sure that your logging table is not over-indexed. By that i mean that every time you insert/update/delete from a table any indexes that you have also need to be updated which slows down the process. If you have a lot of indexes specified on your log table you should take a critical look at them and decide if they are indeed necessary. If not, drop them.
You should also consider an archiving procedure such that "old" log information is moved to a separate database at some arbitrary interval, say once a month or once a year. It all depends on how your logs are used.
This is the sort of thing that NoSQL DBs might be useful for, if you're not doing the sort of reporting that requires complicated joins.
CouchDB, MongoDB, and Riak are document-oriented databases; they don't have the heavyweight reporting features of SQL, but if you're storing a large log they might be the ticket, as they're simpler and can scale more readily than SQL DBs.
They're a little easier to get started with than Cassandra or HBase (different type of NoSQL), which you might also look into.
From this SO post:
http://carsonified.com/blog/dev/should-you-go-beyond-relational-databases/