Apache Drill Filter Pushdown Case - apache-drill

I have a requirement to query the federated data resides in different
vendors DB's in different servers, let's take an example,
SQL: SELECT t1.NAME,t2.AMOUNT FROM server1.mysql.USERS AS t1 INNER JOIN
server2.oracle.PURCHASES AS t2 ON t1.ID =t2.USER_ID WHERE t1.NAME ='ABC' AND
t2.TYPE ='Sales';
When I am executing this query t2.TYPE ='Sales' filter was not pushed down to the table level so here,
How the Drill will overcome the entire table scan, it can cause a performance impact.
How to push t2.TYPE ='Sales' filter to table level.
Thanks and Regards
Ajay Babu Maguluri.

I have created DRILL-7340 for this issue and described the root cause of the problem there. The complete fix requires changes in both projects: Drill and Calcite.
The good news is that PR with the fix in Calcite was created and after it is merged and Drill rebased onto the new Calcite version, this issue will be fixed.

Related

Efficient way to flag a record with min field value and common fieldX value in mysql

I am trying to flag all records in a table that have the minimum value for all records with a common FieldX Value.
My query was as such:
TableA
Update TableA as T1
Inner Join (Select ID,Name,Min(ValueField) from TableA
where GroupFlag='X'
Group by CommonTermField) as T2
On T1.ID=T2.ID
Set MainFlag='Y';
This worked awhile back but I keep getting a timeout/table locked error and I am assuming that it is because the table is 26 million records long (with appropriate indexes). Is there a more efficient way to update vs using an inner-join in this case?
Update:
After trying to run another Update/Inner-Join that previously worked and also getting a table-locked type error, it occurred to me that recently we migrated to larger servers so we would have overhead to work with these tables. I did some checking while DevOps is out and it turns out the settings weren't migrated (yet) so our "innodb_buffer_pool" which had previously been 2GB was only 128MB. I am waiting until they get in to migrate that and other settings, but am 99% sure the "inefficiency" in the query (which previously worked fine) is due to that. I will leave the Q open until then and if the innodb_pool fix works answer my own question with the settings we changed and how in case anyone else runs into this issue (seeming query inefficiency in fact mysql settings issue).
Ok so answer to the question was Mysql settings. Apparently when we migrated servers DevOps/SysAdmin did migrate settings but didn't restart server as I jumped right into query-mode. We restarted last night and things worked swimmingly.
The issue was that innodb_buffer_pool was set to 128MB by default and our custom settings had it at 2GB.

Oracle best practice

I have to pull some incremental data and do some small and complex calculations after that. But in days passing by, the data grew large and after the 1st incremental stage, it started more time to insert and update large records.
So, what I did was:
CREATE TABLE T1 AS(SELECT (some_conditions) FROM SOME_TABLE);
CREATE TABLE T2 AS(SELECT (some_conditions) FROM T1);
DROP TABLE T1
RENAME T2 TO T
Is this a good practice in a production environment. It works very fast though.
Normally I'd agree that DDL is pretty bad thing to do regularly but we need to be pragmatic.
I think if Tom Kyte (Oracle Guru) says it's ok then it's ok.
https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:6407993912330

speed select vs join

The two quires below do the same thing. Basically show all the id's of table 1, which are present in table 2. The thing which puzzles me is that the simple select is way way faster than the JOIN, I would have expected that the JOIN is a bit slower, but not by that much...5 seconds vs. 0.2
Can anyone elaborate on this ?
SELECT table1.id FROM
table1,table2 WHERE
table1.id=table2.id
Duration/Fetch 0.295/0.028 (MySql Workbench 5.2.47)
SELECT table1.id
FROM table1
INNER JOIN table2
ON table1.id=table2.id
Duration/Fetch 5.035/0.027 (MySql Workbench 5.2.47)
Q: Can anyone elaborate on this?
A: Before we go the "a bug in MySQL" route that #a_horse_with_no_name seems impatient to race down, we'd really need to ensure that this is repeatable behavior, and isn't just a quirk.
And to do that, we'd really need to see the elapsed time result from more than one run of the query.
If the query cache is enabled on the server, we want to run the queries with the SQL_NO_CACHE hint added (SELECT SQL_NO_CACHE table1.id ...) so we know we aren't retrieving cached results.
I'd repeat the execution of each query at least three times, and throw out the result from the first run, and average the other runs. (The purpose of this is to eliminate the impact of the table data not being in the cache, either InnoDB buffer, or the filesystem cache.)
Also, run an EXPLAIN SELECT ... for each query. And compare the access plans.
If either of these tables is MyISAM storage engine, note that MyISAM tables are subject to locking by DML operations; while an INSERT, UPDATE or DELETE operation is run on the table, the SELECT statements will be blocked from accessing the table. (But five seconds seems a bit much for that, unless these are really large tables, or really inefficient DML statements).
With InnoDB, the SELECT queries won't be blocked by DML operations.
Elapsed time is also going to depend on what else is going on on the system.
But the total elapsed time is going include more than just the time in the MySQL server. Temporarily turning on the MySQL general_log would allow you to capture the statements that are actually being processed by the server.
This looks like something that could be further optimized by the database engine if indeed you are running both queries under the exact same context.
SQL is declarative. By successfully declaring what you want, the engine has free reign to restructure the "How" of your request to bring back the fastest result.
The earliest versions of SQL didn't even have the keyword JOIN. There was only the comma.
There are many coding constructs in SQL that imperatively force a single inferior methodology over another and they should be avoided. JOIN shouldn't be avoided. Something sounds a miss. JOIN is the core element of SQL. It would be a shame to always have to use commas.
There are a zillion factors that go into the performance of a JOIN all based your environment, schema, and data. Chances are that your table1 and table2 represent a fringe case that may have gotten past the optimization algorithms.
The SQL_NO_CACHE worked, the new results are:
Duration/Fetch 5.065 / 0.027 for the select where and
Duration/Fetch 5.050 / 0.027 for the join
I would have thought that the "select where" would be faster, but the join was actually a tad swifter. But the difference is negligible
I would like to thank everyone for their response.

Will hadoop be faster than mySQL

I am facing a big data problem. I have a large MySQL (Percona) table which joins on itself once a day and produces about 25 billion rows. I am trying to to group together and aggregate all the rows to produce a result. The query is a simple join:
--This query produces about 25 billion rows
SELECT t1.colA as 'varchar(45)_1', t2.colB as 'varchar(45)_2', count(*)
FROM table t1
JOIN
table t2
on t1.date = t2.date
GROUP BY t1.colA, t2.colB
The problem is this process takes more than a week to complete. I have started reading about hadoop and wondering if the map reduce feature can improve the amount of time to process the data. I noticed HIVE is a nice little add-on to allow SQL like queries for hadoop. This all looks very promising, but I am facing an issue where I will only be running on a single machine:
6-core i7-4930K
16GB RAM
128 SSD
2TB HDD
When I run the query with MySQL, my resources are barley being used, only about 4Gb of ram and one core is only working 100% while the other are working close to 0%. I checked into this and found MySQL is single threaded. This is also why Hadoop seems to be promising as I noticed it can run multiple mapper functions to better utilize my resources. My question remains is hadoop able to replace MySQL in my situation in which it can produce results within a few hours opposed to over a week even though hadoop will only be running on a single node (although I know it is meant for distributed computing)?
Some very large hurdles for you are going to be that hadoop is really meant to run on a cluster and not a single server. It can make use of multiple cores but the amount of resources that it will consume will be very significant. I have a single system that I use for testing that has hadoop and hbase. It has namenode, secondary name node, data node, nodemanager, resourcemanager, zookeeper etc running. This is a very heavy load for a single system. Plus HIVE is not a true SQL compliant replacement for a RDBMS so it has to emulate some of the work by creating map/reduce jobs. These jobs are considerably more disk intensive and use the hdfs file system for mapping the data into virtual tables (verbage may vary). HDFS also has a fairly significant overhead due to the fact that the filesystem is meant to be spread over many systems.
With that said I would not recommend solving your problem with Hadoop. I would recommend checking out what it has to offer though in the future.
Have you looked into sharding the data which can take advantage of multiple processors. IMHO this would be a much cleaner solution.
http://www.percona.com/blog/2014/05/01/parallel-query-mysql-shard-query/
You might also look into testing postgres. It has very good parallel query support built in.
Another idea is you may look into trying an olap cube to do the calculations and it can rebuild the indexes on the fly so that only changes will be taken into affect. Due to the fact that you are really dealing with data analytics this may be an ideal solution.
Hadoop is not a magic bullet.
Whether anything is faster in Hadoop than in MySQL is mostly a question of how well your abilities to write Java code (for mappers and reducers in Hadoop) or SQL are...
Usually, Hadoop shines when you have a problem running well on a single host, and need to scale it up to 100 hosts at the same time. It is not the best choice if you have a single computer only; because it essentially communicates via disk. Writing to disk is not the best way to do communication. The reason why it is popular in distributed systems is crash recovery. But you cannot benefit from this: if you lose your single machine, you lost everything, even with Hadoop.
Instead:
figure out if you are doing the right thing. There is nothing worse than spending time to optimize a computation that you do not need. Consider working on a subset, to first figure out whether you are doing the right thing at all... (chances are, there is something fundamentally broken with your query in the first place!)
optimize your SQL. Use multiple queries to split the workload. Reuse earlier results, instead of computing them again.
reduce your data. A query that is expected to return 25 billion must be expected to be slow! It's just really inefficient to produce results this size. Choose a different analysis, and double-check that you are doing the right computation; because most likely you aren't; but you are doing much to much work.
build optimal partitions. Partition you data by some key, and put each date into a separate table, database, file, whatever, ... then process the joins one such partition at a time (or if you have good indexes on your database, just query one key at a time)!
Yes you are right MySQL is single threaded i.e. 1 thread per query.
Having 1 machine only I don't think it will help you much because you may utilize the cores but you will have contention over I/O since all threads will try to access the disk.
The number of rows you mentioned are a lot but you have not mentioned the actual size of your table on disk.
How big is your table actually? (In bytes on HD I mean)
Also you have not mentioned if the date column is indexed.
It could help you if you removed the t2.colB or removed the GROUP BY all together.
GROUP BY does sorting and in your case it isn't good. You could try to do the group by in your application.
Perhaps you should tell us what exactly are you trying to achieve with your query. May be there is a better way to do it.
I had a similarly large query and was able to take advantage of all cores by breaking up my query into multiple smaller ones and running them concurrently. Perhaps you could do the same. Instead of one large query that processes all dates, you could run two (or N) queries that process a subset of dates and write the results into another table.
i.e. if your data spanned from 2012 to 2013
SELECT INTO myResults (colA,colB,colC)
SELECT t1.colA as 'varchar(45)_1', t2.colB as 'varchar(45)_2', count(*)
FROM table t1
JOIN table t2 on t1.date = t2.date
WHERE t1.date BETWEEN '2012-01-01' AND '2012-12-31'
GROUP BY t1.colA, t2.colB
SELECT INTO myResults (colA,colB,colC)
SELECT t1.colA as 'varchar(45)_1', t2.colB as 'varchar(45)_2', count(*)
FROM table t1
JOIN table t2 on t1.date = t2.date
WHERE t1.date BETWEEN '2013-01-01' AND '2013-12-31'
GROUP BY t1.colA, t2.colB

ORM solutions for multi-database queries

In an ORM you can have nice syntax like this:
my $results = Model.objects.all()[10];
And in the Django ORM it even quite nicely handles foreign key relationships and many to many relationships all through the ORM.
However, in MySQL you can run a query like this:
SELECT t1.column1
, t2.column2
, t3.column3
FROM db1.table AS t1
, db2.table AS t2
, db3.table AS t3
WHERE t1.id = t2.t1_id
AND t1.id = t3.t1_id
LIMIT 0,10
I'm looking for an ORM that can support these types of query natively but can't really see anything that does.
Are there any existing ORMs that can do this? Or are there alternative strategies for tackling this problem?
Whenever I've used a framework like django to build a site, I've kept everything on the same database because I was aware of the limitation. Now I'm working with data that's spread across many different databases, for no apparent reason other than namespacing.
Might be worth looking at something at a lower level than the ORM? For example, something along the lines of C-JDBC provides a 'virtual' database driver that talks to a cluster of databases behind the scenes. (Tables can be distributed across servers)
(I realise you're using Python, so this specific example would only be of use if you could run Jython on the JVM as a platform integrating that way - however I'm guessing similar libraries probably exist closer suited to your specific requirements)