How does Apache Drill handle big result sets? - apache-drill

Let's say you have Drill connected to two separate databases, and you run a query where you would pull a massive amount of data from each and then do a join.
How does Drill handle this without throwing Out of Memory errors? This is assuming that the data you are requesting exceeds the amount of memory available for Drill to use.

Please check the below from Drill documentation .
Drill scales from a single laptop to a 1000-node cluster
Drill is available as a simple download you can run on your laptop. When you're ready to analyze larger datasets, deploy Drill on your Hadoop cluster (up to 1000 commodity servers). Drill leverages the aggregate memory in the cluster to execute queries using an optimistic pipelined model, and automatically spills to disk when the working set doesn't fit in memory.

Related

Running complex concurrent queries on Apache Drill

In a distributed bare-metal Apache Drill, complex concurrent queries have two issue:
hooking the cluster resources, especially CPU, this can be somehow controlled by Linux "cgroup".
the Drill seems to be serving concurrent queries as first-come-first-served, this means - even if the second query is very simple and it should not take time, it will have to wait for the first-coming complex heavy query to be finished first, which is not acceptable at all in a production environment.
my question is: is there a workaround to resolve the second problem, if not, what are the alternatives from technology stack that might help in this case?
we tried changing some Apache Drill configuration parameters related to concurrent queries and queue management.
Without query queueing enabled Drill takes the approach of unlimited concurrent execution (an approach that will soon exhaust the cluster's resources if new queries arrive rapidly enough). With queueing enabled, concurrency is capped at a configured number of queries, where "small" queries are queued separately from "big" queries. In either case, I'd never expect to find that a big query is holding back the execution of a small query. The only scenario I can image is that both queries are being classified as the same size (both big, or both small) and you have reached the concurrency limit for the respective queue so that the second query stays queued.
It might be useful discuss the issue further in the Apache Drill Slack

How to identify AWS RDS Scaling conditions

I have an application that is hosted in AWS ECS and having the database in AWS RDS. I'm using a microservice-based container architecture for my application. The frontend of the application is in Angular and Backends are in Java and Python. Right now, the database size is ~1GB. The database size will increase day by day as the scraped data will be inserted daily.
Right now, some queries are taking 4-6 seconds to execute. We need to host this application to the public and there are a lot of users will be using the application. So when we load tested the application with 50 users, I found that the CPU of RDS reached 100% and some queries had taken more than 60 seconds to execute and then timed-out. Also, the CPU and memory of other microservices (frontend and backend) are normal. I have tried vertically scaling the application up to 64GB RAM and 4 vCPUs but still, this condition remains.
Is this an issue with the query or can I do anything with the database server configuration?
The RDS storage I'm using is 100GB with a general-purpose SSD. So, I guess there will be only 300 IOPS, right? I'm planning to use RDS read replicas but before that, I need to know is there anything that I need to do for improving the performance? Any database configurations etc?
I also not have a good idea about the MySQL connection count. Right now, it is using a total of 24 connections. Do I need to change the connection count also?
Query Optimisation
As Tim pointed out, try and optimise the queries. Since you have more data getting inserted, consider indexing the table and make the queries to use indexed columns if possible. also consider archiving unused old data.
number of connections
If you have control over the code, you can make use of database pools to control the number of connections that your applications can use.
CPU usage
the CPU usage is highly related to the performance of the queries, once you optimise the queries, the CPU usage should come down.
disk usage
Use the cloudwatch metrics to monitor the disk usage, based on that, you can decide on a provisioned IOPS disk.
hope this helps.

HIVE, HBASE which one I have to use for My Data Analytics

I have 150 GB of MySQL data, Plan to replace MySQL to Casandra as backend.
Analytics, plan to go with Hadoop, HIVE or HBASE.
Currently I have 4 physical machines for POC. Please some one help me to come up with best efficient architecture.
Per day I will get 5 GB of Data.
Daily Status report I have to send to each customer.
Have to give Analysis report based on request : for example : 1 week report or last month first 2 week report. Is it possible to produce report instantly using HIVe or HBASE ?
I want to give best performance using Cassandra, Hadoop .
Hadoop can process your data using map reduce paradigm or other, using emerging technologies such as Spark. The advantage is a reliable distributed filesystem and the usage of data locality to send the computation to the nodes that have the data.
Hive is a good SQL-like way of processing files and generate your reports once a day. It's batch processing and 5 more GB a day shouldn't produce a big impact. It has a high overhead latency though, but shouldn't be a problem if you do it once a day.
HBase and Cassandra are NoSQL databases whose purpose is to serve data with low latency. If that's a requirement, you should go with any of those. HBase uses the DFS to store the data and Cassandra has good connectors to Hadoop, so it's simple to run jobs consuming from these two sources.
For reports based on request, specifying a date range, you should store the data in an efficient way so you don't have to ingest data that's not needed for your report. Hive supports partitioning and that can be done using date (i.e. /<year>/<month>/<day>/). Using partitioning can significantly optimize your job execution times.
If you go to the NoSQL approach, be sure the rowkeys have some date format as prefix (e.g. 20140521...) so that you can select those that start by the dates you want.
Some questions you should also consider are:
how many data do you want to store in your cluster – e.g. last 180
days, etc. This will affect the number of nodes / disks. Beware data is usually replicated 3 times.
how many files do you have in HDFS – when the number of files is high,
Namenode will be hit hard on retrieval of file metadata. Some
solutions exist such as replicated namenode or using MapR Hadoop
distribution which doesn't rely on a Namenode per se.

Mysql cluster for dummies

So what's the idea behind a cluster?
You have multiple machines with the same copy of the DB where you spread the read/write? Is this correct?
How does this idea work? When I make a select query the cluster analyzes which server has less read/writes and points my query to that server?
When you should start using a cluster, I know this is a tricky question, but mabe someone can give me an example like, 1 million visits and a 100 million rows DB.
1) Correct. Every data node does not hold a full copy of the cluster data, but every single bit of data is stored on at least two nodes.
2) Essentially correct. MySQL Cluster supports distributed transactions.
3) When vertical scaling is not possible anymore, and replication becomes impractical :)
As promised, some recommended readings:
Setting Up Multi-Master Circular Replication with MySQL (simple tutorial)
Circular Replication in MySQL (higher-level warnings about conflicts)
MySQL Cluster Multi-Computer How-To (step-by-step tutorial, it assumes multiple physical machines, but you can run your test with all processes running on the same machine by following these instructions)
The MySQL Performance Blog is a reference in this field
1->your 1st point is correct in a way.But i think if multiple machines would share the same data it would be replication instead of clustering.
In clustering the data is divided among the various machines and there is horizontal partitioning means the dividing of the data is based on the rows,the records are divided by using some algorithm among those machines.
the dividing of data is done in such a way that each record will get a unique key just as in case of a key-value pair and each machine also has a unique machine_id related which is used to define which key value pair would go to which machine.
we call each machine a cluster and each cluster consists of an individual mysql-server, individual data and a cluster manager.and also there is a data sharing between all the cluster nodes so that all the data is available to the every node at any time.
the retrieval of data is done through memcached devices/servers for fast retrieval and
there is also a replication server for a particular cluster to save the data.
2->yes, there is a possibility because there is a sharing of all the data among all the cluster nodes. and also you can use a load balancer to balance the load.But the idea of load balancer is quiet common because they are being used by most of the servers. but if you are trying you just for your knowledge then there is no need because you will not get to notice the type of load that creates the requirement of a load balancer the cluster manager itself can do the whole thing.
3->RandomSeed is right. you do feel the need of a cluster when your replication becomes impractical means if you are using the master server for writes and slave for reads then at some time when the traffic becomes huge such that the sever would not be able to work smoothly then you will feel the need of clustering. simply to speed up the whole process.
this is not the only case, this is just one of the scenario this is only just a case.
hope this is helpful for you!!

Mechanism for extracting data out of Cassandra for load into relational databases

We use Cassandra as our primary data store for our application that collects a very large amount of data and requires large amount of storage and very fast write throughput.
We plan to extract this data on a periodic basis and load into a relational database (like mySQL). What extraction mechanisms exist that can scale to the tune of hundreds of millions of records daily? Expensive third party ETL tools like Informatica are not an option for us.
So far my web searches have revealed only Hadoop with Pig or Hive as an option. However being very new to this field, I am not sure how well they would scale and also how much load they would put on the Cassandra cluster itself when running? Are there other options as well?
You should take a look at sqoop, it has an integration with Cassandra as shown here.
This will also scale easily, you need a Hadoop cluster to get sqoop working, the way it works is basically:
Slice your dataset into different partitions.
Run a Map/Reduce job where each mapper will be responsible for transferring 1 slice.
So the bigger the dataset you wish to export, the higher the number of mappers, which means that if you keep increasing your cluster the throughput will keep increasing. It's all a matter of what resources you have.
As far as the load on the Cassandra cluster, I am not certain since I have not used the Cassandra connector with sqoop personally, but if you wish to extract data you will need to put some load on your cluster anyway. You could for example do it once a day at a certain time where the traffic is lowest, so that in case your Cassandra availability drops the impact is minimal.
I'm also thinking that if this is related to your other question, you might want to consider exporting to Hive instead of MySQL, in which case sqoop works too because it can export to Hive directly. And once it's in Hive you can use the same cluster as used by sqoop to run your analytics jobs.
There is no way to extract data out of cassandra other than paying for etl tool. I tried different way like copy command or cql query -- all the methods gives times out irrespective of changing timeout parameter in Cassandra.Yaml. Cassandra experts say you can not query the data without 'where' clause. This is big restriction to me. This may be one of the main reason not to use cassandra at least for me.