Reporting with MySQL - Simplest Query taking too long - mysql

I have a MySQL Table on an Amazon RDS Instance with 250 000 Rows. When I try to
SELECT * FROM tableName
without any conditions (just for testing, the normal query specifies the columns I need, but I need most of them) , the query takes between 20 and 60 seconds to execute. This will be the base query for my report, and the report should run in under 60 seconds, so I think this will not work out (it times out the moment I add the joins). The report runs without any problems in our smaller test environments.
Could it be that the Query is taking so long because MySQL is trying to lock the table and waiting for all writes to finish? There might be quite a lot of writes on this table. I am doing the query on a MySQL slave, since I do not want to lockup the production system with my queries.
I have no experience with how much rows are much for a relational DB. Are 250 000 Rows with ~30 columns (varchar, date and integer types) much?
How can I speedup this query (hardware, software, query optimization ...)
Can I tell MySQL that I do not care that the Data might be inconsistent (It is a snapshot from a Reporting Database)
Is there a chance that this query will run under 60 seconds, or do I have to adjust my goals?

Remember that MySQL has to prepare your result set and transport it to your client. In your case, this could be 200MB of data it has to shuttle across the connection, so 20 seconds is not bad at all. Most libraries, by default, wait for the entire result being received before forwarding it to the application.
To speed it up, fetch only the columns you need, or do it in chunks with LIMIT. SELECT * is usually a sign that someone's being super lazy and not optimizing at all.
If your library supports streaming resultsets, use that, as then you can start getting data almost immediately. It'll allow you to iterate on rows as they come in without buffering the entire result.

A table with 250,000 rows is not too big for MySQL at all.
However, waiting for those rows to be returned to the application does take time. That is network time, and there are probably a lot of hops between you and Amazon.
Unless your report is really going to process all the data, check the performance of the database with a simpler query, such as:
select count(*) from table;
EDIT:
Your problem is unlikely to be due to the database. It is probably due to network traffic. As mentioned in another answer, streaming might solve the problem. You might also be able to play with the data formats to get the total size down to something more reasonable.
A last-resort step would be to save the data in a text file, compress the file, move it over, and uncompress it. Although this sounds like a lot of work, you might get 5x - 10x compression on the data, saving oodles of time on the transmission and still have a large improvement in performance with the rest of the processing.

I got updated specs from my client and was able to reduce the amount of users returned to 250, which goes (with a lot of JOINS) though in 60 seconds.
So maybe the answer is really: Try to not dump a whole table with a query, fetch only the exact data your need. The Client has SQL access, and he will have to update his queries, so only relevant users are returned.

I should never really use * as a wildcard. Choose the fields that you actually want and then create an index of these fields combined.

If you have thousands of rows, another option is implement pagination.
If result data directly using for report , no one can look more than 100 rows in single shot.

Related

Redshift design or configuration issue? - My Redshift datawarehouse seems much slower than my mysql database

I have a Redshift datawarehouse that is pulling data in from multiple sources.
One is my from MySQL and the others are some cloud based databases that get pulled in.
When querying in redshift, the query response is significantly slower than the same mysql table(s).
Here is an example:
SELECT *
FROM leads
WHERE id = 10162064
In mysql this takes .4 seconds. In Redshift it takes 4.4 seconds.
The table has 11 million rows. "id" is indexed in mysql and in redshift it is not since it is a columnar system.
I know that Redshift is a columnar data warehouse (which is relatively new to me) and Mysql is a relational database that is able to utilize indexes. I'm not sure if Redshift is the right tool for us for reporting, or if we need something else. We have about 200 tables in it from 5 different systems and it is currently at 90 GB.
We have a reporting tool sitting on top that does native queries to pull data. They are pretty slow but are also pulling a ton of data from multiple tables. I would expect some slowness with these, but with a simple statement like above, I would expect it to be quicker.
I've tried some different DIST and SORT key configurations but see no real improvement.
I've run vacuum and analyze with no improvement.
We have 4 nodes, dc2.large. Currently only using 14% storage. CPU utilization is frequently near 100%. Database connections averages about 10 at any given time.
The datawarehouse just has exact copies of the tables from our integration with the other sources. We are trying to do near real-time reporting with this.
Just looking for advice on how to improve performance of our redshift via configuration changes, some sort of view or dim table architecture, or any other tips to help me get the most out of redshift.
I've worked with clients on this type of issue many times and I'm happy to help but this may take some back and forth to narrow in on what is happening.
First I'm assuming that "leads" is a normal table, not a view and not an external table. Please correct if this assumption isn't right.
Next I'm assuming that this table isn't very wide and that "select *" isn't contributing greatly to the speed concern. Yes?
Next question is wide this size of cluster for a table of only 11M rows? I'd guess it is that there are other much larger data sets on the database and that this table isn't setting the size.
The first step of narrowing this down is to go onto the AWS console for Redshift and find the query in question. Look at the actual execution statistics and see where the query is spending its time. I'd guess it will be in loading (scanning) the table but you never know.
You also should look at STL_WLM_QUERY for the query in question and see how much wait time there was with the running of this query. Queueing can take time and if you have interactive queries that need faster response times then some WLM configuration may be needed.
It could also be compile time but given the simplicity of the query this seems unlikely.
My suspicion is that the table is spread too thin around the cluster and there are lots of mostly empty blocks being read but this is just based on assumptions. Is "id" the distkey or sortkey for this table? Other factors likely in play are cluster load - is the cluster busy when this query runs? WLM is one place that things can interfere but disk IO bandwidth is a share resource and if some other queries are abusing the disks this will make every query's access to disk slow. (Same is true of network bandwidth and leader node workload but these don't seem to be central to your issue at the moment.)
As I mentioned resolving this will likely take some back and forth so leave comments if you have additional information.
(I am speaking from a knowledge of MySQL, not Redshift.)
SELECT * FROM leads WHERE id = 10162064
If id is indexed, especially if it is a Unique (or Primary) key, 0.4 sec sounds like a long network delay. I would expect 0.004 as a worst-case (with SSDs and `PRIMARY KEY(id)).
(If leads is a VIEW, then let's see the tables. 0.4s may be be reasonable!)
That query works well for a RDBMS, but not for a columnar database. Face it.
I can understand using a columnar database to handle random queries on various columns. See also MariaDB's implementation of "Columnstore" -- that would give you both RDBMS and Columnar in a single package. Still, they are separate enough that you can't really intermix the two technologies.
If you are getting 100% CPU in MySQL, show us the query, its EXPLAIN, and SHOW CREATE TABLE. Often, a better index and/or query formulation can solve that.
For "real time reporting" in a Data Warehouse, building and maintaining Summary Tables is often the answer.
Tell us more about the "exact copy" of the DW data. In some situations, the Summary tables can supplant one copy of the Fact table data.

Where should I focus: Optimize the query, changing database config or what else?

I took over a project and have 2 MyISAM tables.
table1 with approx. 1M rows, and
table2 with approx. 100K rows.
In the project these tables are accessed often, and at first it seems ok.
After I installed the project on a Windows 8.1 for local development I found that every day, the first time I access the site, my query takes 14 seconds. A bit too much.
Afterwards is less than 0.1 second.
Now, since on dev this accumulated with another query runs into a timeout-exception for php, it got me concerned about whether it's recommended to do anything about it or not. On production it seems not to occur (or hard to reproduce).
I heard of things like warm cache or optimize query but don't know what is meant by that.
What do experts like you do in this case?
I had another question set up here trying to see whether I can optimize the query.
Changing to InnoDB doesn't seem to have an impact.
The "first" time you run a query, two things may or may not happen:
Lots of disk I/O may be done to fetch the index blocks and/or data blocks from disk. (If other queries happened to have fetched those blocks, the blocks may be cached already.) (14s vs 0.1s is more than I usually see for this cold/warm cache difference.)
If the "Query cache" was on, the first SELECT and its resultset were stored in the QC. The second call may have found it there and returned the result almost instantly. (Usually this is ~1ms, not the 100ms you mentioned.) The QC can be bypassed for a single query by saying SELECT SQL_NO_CACHE ....
Since it is annoying you daily, you may as well go through the exercise of trying to optimize the query. If the tables are growing daily, it may get slower and slower over time. Note that if production needs to be restarted for any reason, that query may timeout on it. So, yes, try to optimize it.
A million rows is beginning to be "big".
The characteristics of this indicate that you are I/O-bound only initially. So it does not indicate that key_buffer_size and innodb_buffer_pool_size are too low.
If you want to discuss the performance of a particular query, start a new thread and provide SHOW CREATE TABLE and EXPLAIN SELECT ....

Run thousands of queries on ~60gb of logfiles

I have a month worth of logfiles (~60gb uncompressed) and I need to run about 1000 thousand queries on these logfiles. Each logfile is ~68MB compressed with gzip.
For testing purpose I have installed Hadoop and Hive in pseudo-distributed mode on our test server (8core, 32gb ram) and I have loaded the logfiles in a hive table which looks somewhat like this:
date, time, userid, channel
And I have a file with about 1000 timeframes like this:
date, time-start, time-end
01_01_2015, 08:05:31, 08:09:54
01_01_2015, 08:54:10, 08:54:30
...
02_01_2015, 08:15:14, 08:20:48
...
[edit:] The timeframes on a single day are non-overlapping and with precision in seconds. They can be as short as 10 seconds and as long as several minutes.
I want to find out how many unique user were on my site during these exact timeframes.
With each of these timeframes being unique.
My question is what would be the most time efficient way of handling such a task? Running a thousand different queries in Hive seems like a terrible way of doing this.
The alternative would be to bundle say 50-100 queries into one to avoid too much overhead from creating jobs etc., would that work better? And is there a limit how long a query can be in Hive?
While Im interested in how this could be done with Hadoop, I'm also open for other suggestions (especially considering this runs in pseudo-distributed).
Are the timeframes overlapping? If so, would 1-minute chunks of the log be a reasonable way to chunk the data? That is would there be dozens or hundreds of rows per minute and all the timeframes have a resolution of one minute? If not one minute, maybe one hour?
Summarize the data in each 1-minute chunk; put the results in another database table. Then write queries against that table.
That would be the MySQL way to do it, probably in a single machine.
Edit (based on OP's edit showing that ranges are non-overlapping and not conveniently divided):
Given that the ranges are non-overlapping, you should aim for doing the work in a single pass.
I would pick between a Perl/PHP program that does all the work, versus 1000 sql calls with
INSERT INTO SummaryTable
SELECT MIN(ts), MAX(ts), SUM(...), COUNT(...)
FROM ...
WHERE ts BETWEEN...
(This assumes an index on ts.) That would be simple enough and fast enough -- It would run only slightly slower than the time it takes to read that much disk.
But... Why even put the raw data into a database table? That is a lot of work, with perhaps no long-term benefit. So, I am back to writing a Perl script to read the log file, doing the work as it goes.

How to effectively store a high amount of rows in a database

What's the best way to store a high amount of data in a database?
I need to store values of various environmental sensors with timestamps.
I have done some benchmarks with SQLCE, it works fine for a few 100,000 rows, but if it goes to the millions, the selects will get horrible slow.
My actual tables:
Datapoint:[DatastreamID:int, Timestamp:datetime, Value:float]
Datastream: [ID:int{unique index}, Uint:nvarchar, Tag:nvarchar]
If I query for Datapoints of a specific Datastream and a date range, it takes ages. Especially if I run it on a embedded WindowsCE device. And that is the main problem. On my development machine a query took's ~1sek, but on the CE device it took's ~5min
every 5min I log 20 sensors, 12 per hour * 24h * 365days = 105,120 * 20 sensors = 2,102,400(rows) per year
But it could be even more sensors!
I thought about some kind of webservice backend, but the device may not always have a connection to the internet / server.
The data must be able to display on the device itself.
How can I speed up the things? choose an other table layout, use an other database (sqlite)? At the moment I use .netcf20 and SQLCE3.5
Some advices?
I'm sure any relational database would suit your needs. SQL Server, Oracle, etc. The important thing is to create good indexes so that your queries are efficient. If you have to do a table scan just to find a single record, it will be slow no matter which database you use.
If you always find yourself querying for a specific DataStreamID and Timestamp value, create an index for it. That way it will do an index seek instead of a scan.
The key to quick access is using one or more indexes.
A Database of two million rows in a year is very manageable.
Adding indexes will slow, somewhat, the INSERTS, but your data isn't coming in all that quickly, so it should not be an issue. If the data were coming in faster, you might have to be more careful, but it would have to be far more data in a far faster rate than you have now in order to be a concern.
Do you have access to SQL Server, or even MySQL?
Your design must have these:
Primary key in the table. Integer PK is faster.
You need to analyze your select queries to see what is going on behind the scene.
Select must do a SEEK instead of a scan
If 100K makes it slow, you must look at the query through analyzer.
It might get little slow if you have 100M rows, not 100K rows
Hope this helps
Can you use SQL Server Express Edition instead? You can create indexes on it just like in the full version. I've worked with databases that are over 100 million rows in SQL Server just fine. SQL Server Express Edition limits you database size to 10 GB so as long as that's okay then the free one should work for you.
http://www.microsoft.com/express/Database/

mySQL Inconsistent Performance

I'm running a mySQL query that joins various tables of 500,000+ rows. Sometimes it takes a second, other times around 15 seconds! This is on my local machine. I have experienced similarly varied times before on other intensive queries, does anyone know why this is?
Thanks
Thanks for the replies - I am using appropriate indexes, inner and left joins and have a WHERE clause range of one week out of possible 2 year period of invoices. If I keep varying it (so presumably query results are not cached) and re-running, time varies a lot, even if no. of rows retrieved is similar. The server is not busy. A few scheduled queries every minute but not intensive, take around 200ms.
The explain plan shows that a table of around 2000 rows is always fully scanned. So maybe these rows are sometimes cached, or maybe indexes are cached - didnt know indexes could be cached. I will try again with caching turned off.
Editing again - query cache is in fact off, I'm using InnoDB so looks like increasing innodb_buffer_pool_size is way to go
Same query each time?
It's hard to tell, based on what you've posted. If we assume that the schema and data aren't changing, I'd guess that there's something else running on your machine when the queries are long that would explain the difference. It could be that the state of memory is different, so paging is going on; an anti-virus program is running; some other service has started. It's impossible to answer.
Try to do an
Optimize Table
That should help to refresh some data useful for the query planner.
You have not give us much information, if you're using MyISAM tables, it may be a matter of locks.
Are you using ANSI INNER JOINs? Little basic, but don't use "cross joins". Those are the joins with the comma, like
SELECT * FROM t1, t2 WHERE t1.id_t1=t2.id_t1
Last things you may want to try. Increase your buffers (innodb), your key_buffers (myisam), and some query cache buffers.
Here's some common reasons(bar your server simply being too busy)
The slow query is hitting the harddrive. In the fast case the indexes and data are already cached in MySQL or the OS file cache.
Retrieving the data gets locked by updates/inserts, for MyISAM tables the whole table gets locked whenever someone inserts/updates data in it in some cases.
Table statistics are out of date and/or the wrong index gets selected. running analyze oroptimize on the table can help.
You have the query cache enabled, fetching the result of a cached query is fast, fetching it if it's not in the cache might be slow. Try turning off the query cache to check if the query is always slow if its not fetched from the cache.
In any case, you should show the output of EXPLAIN on your queries to verify indexes are getting used properly - even if they're not, queries can be fast if everything is in ram but grinding to a halt if it needs to hit the hardddrive.