Running complex concurrent queries on Apache Drill - apache-drill

In a distributed bare-metal Apache Drill, complex concurrent queries have two issue:
hooking the cluster resources, especially CPU, this can be somehow controlled by Linux "cgroup".
the Drill seems to be serving concurrent queries as first-come-first-served, this means - even if the second query is very simple and it should not take time, it will have to wait for the first-coming complex heavy query to be finished first, which is not acceptable at all in a production environment.
my question is: is there a workaround to resolve the second problem, if not, what are the alternatives from technology stack that might help in this case?
we tried changing some Apache Drill configuration parameters related to concurrent queries and queue management.

Without query queueing enabled Drill takes the approach of unlimited concurrent execution (an approach that will soon exhaust the cluster's resources if new queries arrive rapidly enough). With queueing enabled, concurrency is capped at a configured number of queries, where "small" queries are queued separately from "big" queries. In either case, I'd never expect to find that a big query is holding back the execution of a small query. The only scenario I can image is that both queries are being classified as the same size (both big, or both small) and you have reached the concurrency limit for the respective queue so that the second query stays queued.
It might be useful discuss the issue further in the Apache Drill Slack

Related

Multiple large MySQL SELECT queries - better to run in parallel or in a queue?

I have looked up answers to this question a bunch and couldn't find a specific answer - sorry in advance if I missed something! Also, I'm a SQL optimization noob.
I have an analytics dashboard which pulls data based on users' requests from a large database.
Each page the user loads runs a number of different queries to populate different parts of the page (different charts, tables, etc). Some of these pages can take quite some time to load as the user might request several years of data.
Currently, each part of the page pings off one SELECT query to the SQL server but as there are several parts of the page, those queries end up running in parallel.
Would it be faster to run these queries in a queue - to allow the server to process one query at a time? Or to keep everything in parallel, as is?
The added benefit of running them one at a time is that we could run the queries to fill in the "above-the-fold" part of the page first...
Hope that all makes sense and take it easy on me please :)
I also say "it depends", but I lean toward parallelism.
Probably should not have more parallelism than the number of CPU cores.
I rarely see a system that chews up all the CPU cores -- unless it does not have good enough indexes. That is, fix the indexes before asking the question.
If the data is bigger than can be cached, it may be faster to queue, since you may have a choke point -- I/O.
If the table(s) are continually being changed, turn off the Query Cache.
Is your goal to get some results on the page early (a likely Human Interface goal), add a small delay in all but one AJAX callee (not caller).
If multiple pages could be computing at the same time, things get more complex. For example, you can't really control the parallelism.
Let's see the queries. Perhaps we can speed them up enough to obviate the question.
There is no right answer to this question. Up to a point, running parallel SELECT queries is (generally) going to be faster than one running query. Whether that point is 2 queries or 200 depends on the nature of the queries, the hardware configuration, the data, and the speeds of various components.
The situation becomes even more complex when you consider how many different users may be involved and whether or not the data is being updated. You can get into really bad situations with parallel queries and updates if the locks start cascading. Of course, this can happen with multiple simultaneous users as well.
My guess is that you want a throttling mechanism that will run, say, n queries at a time and put the rest into a queue.

Setting up the database server separately from Tomcat server effects on overall performance

I am curious to know the overall effects on performance if i have my Database server separately from Tomcat server(Spin a MySQL server on Amazon). Actually, I am having some performance issues and not sure if it might be the cause of.
Yes, absolutely i have found that separating the DB and application can actually uncover performance issues not evident in a co-location situation, for the network latency reasons mentioned by ck1. In fact, if you capture stack traces during the slow operations, by sampling it will indicate the database/application code sensitive to network latency. The use cases with performance issues ( in non co-located apps) generally make a lot of round trips to the database. Instead try offloading the processing into the DB with a more complex query and reduce the rows returned.
Pros of having database and app servers co-located:
Network latency will be minimized
You only need to maintain a single server
Cons of co-location:
The app and database servers will contend for a common set of CPU, memory, and disk I/O resources. For example, queries causing a spike in CPU usage will affect the app server's performance
You have more than a single server to maintain
It's difficult to scale horizontally

mysql performance benchmark

I'm thinking about moving our production env from a self hosted solution to amazon aws. I took a look at the different services and thought about using RDS as replacement for our mysql instances. The hardware we're using for our master seems to be better than the best hardware we can get when using rds (Quadruple Extra Large DB Instance). Since I can't simply move our production env to aws and see if the performance is still good enough I'd love to make some tests in advance.
I thought about creating a full query log from our current master, configure the rds instance and start to replay the full query log against it. Actually I don't even know if this kind of testing is a good idea but I guess you'll tell me if there are better ways to make sure the performance of mysql won't drop dramatically when making the move to rds.
Is there a preferred tool to replay the full query log?
at what metrics should I take a look while running the test
cpu usage?
memory usage?
disk usage?
query time?
anything else?
Thanks in advance
I'd recommend against replaying the query log - it's almost certainly not going to give you the information you want, and will take a significant amount of effort.
Firstly, you'd need to prepare your database so that replaying the query log won't break constraints when inserting, updating or deleting data, and that subsequent "select" queries will find the records they should find. This is distinctly non-trivial on anything other than a toy database - just taking a back-up and replaying the log doesn't necessarily guarantee the ordering of DML statements will match what happened on production. This may well give you a false sense of comfort - all your select statements return in a few milliseconds, because the data they're looking for doesn't exist!
Secondly, load and performance testing rarely works by replaying what happened on production - that doesn't (usually) reflect the peak conditions that will bring your system to its knees. For instance, most production systems run happily most of the time at <50% capacity, but go through spikes during the day, when they might reach 80% or more of capacity - that's what you care about, can your new environment handle the peaks.
My recommendation would be to use a tool like JMeter to write performance scripts (either directly to the database using the JDBC driver, or through the front end if you've got a web appilcation). Your performance scripts should reflect the behaviour you see from users, and be parameterized so they're not dependent on the order in which records are created.
Set yourself some performance targets (ideally based on current production levels, with a multiplier to cover you against spikes), e.g. "100 concurrent users, with no query taking more than 1 second"), and use JMeter to simulate that load. If you reach it first time, congratulations - go home! If not, look at the performance counters to see where the bottleneck is; see if you can alleviate that bottleneck (or tune your queries, your awesome on-premise hardware may be hiding some performance issues). Typical bottlenecks are CPU, RAM, and disk I/O.
Experiment with different test scenarios - "lots of writes", "lots of reads", "lots of reporting queries", and mix them up.
The idea is to understand the bottlenecks on the system, and see how far you are from those bottleneck, and understand what you can do to alleviate them. Once you know that, your decision to migrate will be far more robust.

MySQL scale up or scale out?

I have been tasked with investigating reasons why our internal web application is hitting performance problems.
The web application itself is part written in PHP and part written in Perl, and we have a MySQL database which is where I believe the source of performance hit is occurring.
We have about 400 users of the system, of which, most are spread across different timezones, so generally there are only ever a max of 30 users online at any one time. The performance problems have crept up on us, particularly over the past year as the database keeps growing.
The system is running on one single 32-bit debian server - 6GB of RAM, with 8 x 2.4GHz intel CPU. This is probably not hefty enough for the job in-hand. However, even at times where I am the only user online, page loading time can still be slow.
I'm trying to determine whether we need to scale up or scale out. Firstly, I'd like to know is how well our hardware is coping with the demands placed upon it. And secondly, whether it might be worth scaling out and creating some replication slaves to balance the load.
There are a lot of tools available on the internet - probably a bit too many to investigate. Can anyone recommend any tools that can provide some profiling/performance monitoring that may help me on my quest.
Many thanks,
ns
Your slow-down seems to be related to the data and not to the number of concurrent users.
Properly indexed queries tend to scale logarithmically with the amount of data - i.e. doubling the data increases the query time by some constant C, doubling the data again by the same C, doubling again by the same C etc... Before you know it, you have humongous amounts of data, yet your queries are just a little slower.
If the slow-down wasn't as gradual in your case (i.e. it was linear to the amount of data, or worse), this might be an indication of badly optimized queries. Throwing more iron at the problem will postpone it, but unless you have unlimited budget, you'll have to actually solve the root cause at some point:
Measure the query performance on the actual data to identify slow queries.
Examine the execution plans for possible improvements.
If necessary, learn about indexing, clustering, covering and other performance techniques.
And finally, apply that knowledge onto queries you have identified in steps (1) and (2).
If nothing else helps, think about your data model. Sometimes, a "perfectly" normalized model is not the best performing one, so a little judicial denormalization might be warranted.
The easy (lazy) way if you have budget is just to throw some more iron at it.
A better way would be, before deciding where or how to scale, would be to identify the bottlenecks. Is it every page load that is slow? Or just particular pages? If it is just a few pages then invest in a profiler (for PHP both xDebug and the Zend Debugger can do profiling). I would also (if you haven't) invest in a test system that is as similar as possible to the live system to run diagnostics.
You could also look at gathering some stats; either at server level with a program such as sar (from the sysstat package and also at the db level (have you got the slow query log running?).

SQL Azure performance considerations

Which are the performance considerations I should keep in mind when I'm planning an SQL Azure application? Azure Storage, and the worker and the web roles looks very scalable, but if at the end they are using one database... it looks like the bottleneck.
I was trying to find numbers about:
How many concurrent connections does
SQL Azure support?
Which is the bandwidth?
But no luck.
For example, I'm planning and application that uses a very high level of inserts, but I need return the result of an aggregate function each time (e.g.: the sum of all records with same key in a column), so I can not go with table storage.
Batching is an option, but time response is critical as well, so I'm afraid the database will be bloated with lot of connections.
Sharding is another option, but even when the amount of inserts is massive, the amount of data is very small, 4 to 6 columns with one PK and no FK. So even a 1Gb DB would be an overkill (and an overpay :D) for a partition.
Which would be the performance keys I should keep in mind when I'm facing these kind of applications?
Cheers.
Achieving both scalability and performance can be very difficult, even in the cloud. Your question was primarily about scalability, so you may want to design your application in such a way that your data becomes "eventually" consistent, using queues for example. A worker role would listen for incoming insert requests and would perform the insert asynchronously.
To minimize the number of roundtrips to the database and optimize connection pooling make sure to batch your inserts as well. So you could send 100 inserts in one shot. Also keep in mind that SQL Azure now supports MARS (multiple active recordsets) so that you can return multiple SELECTs in a single batch back to the calling code. The use of batching and MARS should reduce the number of database connections to a minimum.
Sharding usually helps for Read operations; not so much for inserts (although I never benchmarked inserts with sharding). So I am not sure sharding will help you that much for your requirements.
Remember that the Azure offering is designed first for scalability and reasonable performance in a multitenancy environment, where your database is shared with others on the same server. So if you need strong performance with guaranteed response time you may need to reevaluate your hosting choices or indeed test the performance boundaries of Azure for your needs as suggested by tijmenvdk.
SQL Azure will throttle your connections if any form of resource contention occurs (this includes heavy load but might also occur when your database is physically moved around). Throttling is non-deterministic, meaning that you cannot predict if and when this happens. When throttling, SQL Azure will drop your connection, requiring you to perform a retry. Number of connections supported and bandwidth is not published "by design" due to the flexible nature of the underlying infrastructure. Having said that, the setup is optimized for high availability, not high throughput.
If the bursts happen at a known time, you might consider sharding just during those bursts and consolidating the data after the burst has happened. Another way to handle this, is to start queueing/batching writes if and only if throttling occurs. You can use an Azure Queue for that plus a worker role to empty the queue later. This "overflow mechanism" has the advantage of automatically engaging if throttling occurs.
As an alternative you could use Azure Table Storage and keep a separate table of running totals that you can report back instead of performing an aggregation over the data to return the required sum of all records (this might be tricky due to the lack of locking on the tables though).
Apologies for stating the obvious, but the first step would be to test if you run into throttling at all in your scenario. I would give the overflow solution a try.