multiple fragments in apache drill - apache-drill

I'm executing a query in Apache drill cluster, however it is making only 1 minor segment. I have tried various queries like union of 2 queries etc, and executing it on 5 million records however it is still making 1 fragment only. Is there any configuration change that I can do for making multiple segments so that these could be executed on each drill bit individually.
How can I confirm whether the query is being executed on 1 drillbit instance or multiple instances.

you can use join to increase fragment.
example:
SELECT t1.name
FROM dfs./* as t1
JOIN(SELECT t1.name WHERE xxx) t2 ON t1.name = t2.name
WHERE yyyy

Related

Fniding consecutive sequences with a given filter width in SQL

I was using MYSQL for storing signals which are sampled for a given period. My task involves identifying the faulty signal based on a filter width. The signal table consists of the signal index along with its value. During primary filtering, I was able to get the index of the sequence where there is a mismatch. The filtered table now consists of indexes of the signal where there is a mismatch. Now, I want to count the number of instances when the signal is faulty.
For examples, the filtered table consists of indexes like 3,4,5,6,9,10,13,16, if I apply a filter of width 3 then there are two instance where the signal is faulty as indicated by the index sequence 3,4,5 and 4,5,6. If I apply a filter of width 2 then similarly there are 4 instances.
I want to count this by using sql queries on the table which consists of these indexes.
For now this is what I'm doing for a filter width of 2.
SELECT COUNT(*) FROM table_index AS t1 INNER JOIN table_index AS t2 WHERE t1.id+1=t2.id;
But this approach is quite costly, when applying a filter width of 3 and more as one need to use inner join on those many tables.
Is there any efficient way of doing this using SQL queries only? Or do I need to do these analysis by reading these indexes in another way? (Ex: using python)
Thank you.
SELECT t1.id as starting_id
FROM test t1
JOIN test t2 ON t1.id BETWEEN t2.id - #filter_count + 1 AND t2.id
GROUP BY t1.id
HAVING COUNT(*) = #filter_count;
fiddle

Better way to get 15 tables results at a time in MySql

I have about 20 tables. These tables have only id (primary key) and description (varchar). The data is a lot reaching about 400 rows for one table.
Right now I have to get data of at least 15 tables at a time.
Right now I am calling them one by one. Which means that in one session I am giving 15 calls. This is making my process slow.
Can any one suggest any better way to get the results from the database?
I am using MySQL database and using Java Springs on server side. Will making view for all combined help me ?
The application is becoming slow because of this issue and I need a solution that will make my process faster.
It sounds like your schema isn't so great. 20 tables of id/varchar sounds like a broken EAV, which is generally considered broken to begin with. Just the same, I think a UNION query will help out. This would be the "View" to create in the database so you can just SELECT * FROM thisviewyoumade and let it worry about the hitting all the tables.
A UNION query works by having multiple SELECT stataements "Stacked" on top of one another. It's important that each SELECT statement has the same number, ordinal, and types of fields so when it stacks the results, everything matches up.
In your case, it makes sense to manufacturer an extra field so you know which table it came from. Something like the following:
SELECT 'table1' as tablename, id, col2 FROM table1
UNION ALL
SELECT 'table2', id, col2 FROM table2
UNION ALL
SELECT 'table3', id, col2 FROM table3
... and on and on
The names or aliases of the fields in the first SELECT statement are the field names that are used in the result set that is returned, so no worries about doing a bunch AS blahblahblah in subsequent SELECT statements.
The real question is whether this union query will perform faster than 15 individual calls on such a tiny tiny tiny amount of data. I think the better option would be to change your schema so this stuff is already stored in one table just like this UNION query outputs. Then you would need a single select statement against a single table. And 400x20=8000 is still a dinky little table to query.
To get a row of all descriptions into app code in a single roundtrip send a query kind of
select t1.description, ... t15.description
from t -- this should contain all needed ids
join table1 t1 on t1.id = t.t1id
...
join table1 t15 on t15.id = t.t15id
I cannot get you what you really need but here merging all those table values into single table
CREATE TABLE table_name AS (
SELECT *
FROM table1 t1
LEFT JOIN table2 t2 ON t1.ID=t2.ID AND
...
LEFT JOIN tableN tN ON tN-1.ID=tN.ID
)

order by clause not working after shrinking database

Recently, I have shrink local database and size reduced from 6gb to 1 mb.
But after that some query doesn't work, those are already working in development and live server (in local, development and live sql version is same).
One of this query is
SELECT a.col1,
b.col2,
isnull(a.intPriority, 100) AS intPriority
FROM tab1 a
INNER JOIN tab2 b
ON a.id = b.id
UNION
SELECT a.col1,
b.col2,
isnull(a.intPriority, 100) AS intPriority
FROM tab1 a
INNER JOIN tab2 b
ON a.id = b.id
ORDER BY a.intPriority
This query giving me an error:
ORDER BY items must appear in the select list if the statement contains a UNION, INTERSECT or EXCEPT operator.
Above query runs well in dev and live server, why not on local??
I know, suppose I changed order by to intPriority than problem will solved but its not solution. I have to change in my entire website.
I think you just need:
ORDER BY intPriority
Also I don't think this has anything at all to do with shrinking your database, but perhaps you upgraded from SQL Server 2000 as well? If so you can "get by" in the meantime by rolling your compat level back to 2000. Just to demonstrate, on SQL Server 2008:
SELECT name = COALESCE(a.name, '') FROM sys.objects AS a
UNION ALL
SELECT name = COALESCE(a.name, '') FROM sys.objects AS a
ORDER BY a.name;
Fails with:
ORDER BY items must appear in the select list if the statement contains a UNION, INTERSECT or EXCEPT operator.
But works after setting:
ALTER DATABASE my_db SET COMPATIBILITY_LEVEL = 80;
So you can set the compat level for your database back to 2000, and your invalid code will work in the meantime, but you really should FIX it, because eventually 80 won't be a valid compatibility level (it is no longer valid in SQL Server 2012) and because someone else might upgrade the compatibility level on the servers where this is already working (since this is typically one of the recommended steps after upgrading a database).

JOIN or INNER SELECT with IN, which is faster?

I was wondering which is faster an INNER JOIN or INNER SELECT with IN?
select t1.* from test1 t1
inner join test2 t2 on t1.id = t2.id
where t2.id = 'blah'
OR
select t1.* from test1 t1
where t1.id IN (select t2.id from test2 t2 where t2.id = 'blah')
Assuming id is key, these queries mean the same thing, and a decent DBMS will execute them in the exact same way. Unfortunately MySQL doesn't, as can be seen by expanding the "View Execution Plan" link in this SQL Fiddle. Which one will be faster probably depends on the size of tables - if TABLE1 has very few rows, then IN has a chance for being faster, while JOIN will likely be faster in all other cases.
This is a peculiarity of MySQL's query optimizer. I've never seen Oracle, PostgreSQL or MS SQL Server execute such simple equivalent queries differently.
If you have to guess, INNER JOIN is likely to be more efficient than an IN (SELECT ...), but that can vary from one query to another.
The EXPLAIN keyword is one of your best friends. Type EXPLAIN in front of your complete SELECT query and MySQL will give you some basic information about how it will execute the query. It'll tell you where it's using file sorts, where it's using indices you've created (and where it's ignoring them), and how many rows it will probably have to examine to fulfill the request.
If all else is equal, use the INNER JOIN mostly because it's more predictable and thus easier to understand to a new developer coming in. But of course if you see a real advantage to the IN (SELECT ...) form, use it!
Though you'd have to check the execution plan on whatever RDBS you're inquiring about, I would guess the inner join would be faster or at least the same. Perhaps someone will correct me if I'm wrong.
The nested select will most likely run the entire inner query anyway, and build a hash table of possible values from test2. If that query returns a million rows, you've incurred the cost of loading that data into memory no matter what.
With the inner join, if test1 only has 2 rows, it will probably just do 2 index scans on test2 for the id values of each of those rows, and not have to load a million rows into memory.
It's also possible that a more modern database system can optimize the first scenario since it has statistics on each table, however at the very best case, the inner join would be the same.
In most of the cases JOIN is much faster than sub query but sub-query is more readable than JOIN.
RDBMS creates an execution plan against JOIN so it can be predict that what data should be loaded to be processed. This definitely saves time. On the other hand for the sub-query it run all the queries and load all their data to do the processing.
For more details please check this link.

Finding duplicate records in mysql based on a bitmask

I have a mysql table which stores maintenance logs for sensors. I'd like to design a query that finds instances where a given sensor was repaired/maintained for the same reason. (Recurring problem finder.)
My table (simplified) looks like this:
id name mask
== ==== ====
11 alpha 0011
12 alpha 0010
13 alpha 0100
14 beta 0001
The mask field is a bitmask where each position represents a particular type of repair. I was able to successfully figure out how to compare the bitmask (per this question) but trying to incorporate it into a query is proving more difficult than I thought.
Given the above sample records, only id's 11 and 12 apply, since they both have a 1 in the third mask position.
Here's what I've tried and why it didn't work:
1. Never finishes...
This query seems to run forever, I don't think it is working the way I want.
SELECT t1.id, t1.name
FROM data t1
LEFT OUTER JOIN data t2
ON (CONV(t1.mask,2,10) & CONV(t2.mask,2,10) > 0)
GROUP BY t1.name
HAVING COUNT(*) >1;
2. Incomplete query...
I was thinking of creating a view, to only parse the sensors that actually have more than one entry in the table. I wasn't sure where to go from here.
SELECT COUNT(t1.name) AS times, t1.name, t1.id, t1.mask
FROM data AS t1
GROUP BY t1.name ASC
HAVING times > 1;
Any suggestions on this?
Since the database structure was not designed with the realities of RDBMs in mind (probably not your doing, I just have to make the point anyway…), the performance will always be poor, though it is possible to write a query that will finish.
Jim is correct in that the query results in a cartesian product. If that query were to be returned ungrouped and unfiltered, you could expect (SELECT POW(COUNT(*), 2) FROM data) results. Also, any form of outer join is unnecessary, so a standard inner join is what you want here (not that it ought to make a difference in terms of performance, it's just more appropriate).
Also another condition of the join, t1.id != t2.id is necessary, lest each record match itself.
SELECT t1.id, t1.name
FROM data t1
JOIN data t2
ON t1.name = t2.name
AND t1.id != t2.id //
WHERE CONV(t1.mask, 2, 10) & CONV(t2.mask, 2, 10) > 0
GROUP BY t1.name
HAVING COUNT(*) > 1;
Your incomplete query:
SELECT t1.id, t1.name, t1.mask
FROM data t1
WHERE t1.name IN (SELECT t2.name FROM data t2 GROUP BY t2.name HAVING COUNT(*) > 1);
SELECT t1.id, t1.name, t1.mask
FROM data t1
WHERE EXISTS (SELECT 1 FROM data t2 WHERE t2.name = t1.name GROUP BY t2.name HAVING COUNT(*) > 1);
Off the top of my head I can't tell you which of those would perform best. If data.name is indexed (and I would hope it is), the cost for either query ought to be rather low. The former will cache a copy of the subselect, whereas the latter will perform multiple queries against the index.
One very basic optimization (while leaving the table structure as a whole untouched) would be to convert the mask field to an unsigned integer data type, thereby saving many calls to CONV().
WHERE CONV(t1.mask, 2, 10) & CONV(t2.mask, 2, 10) > 0
becomes
WHERE t1.mask & t2.mask > 0
Of course, breaking the data down further does make more sense. Instead of storing a bitmask in one record, break out all the ones bits into separate records
id name mask
== ==== ====
11 alpha 1101
would become
id name value
== ==== =====
11 alpha 1
12 alpha 4
13 alpha 8
Now, a strategically placed index on name and value makes the query a piece of cake
SELECT name, value
FROM data
GROUP BY name, value
HAVING COUNT(*) > 1;
I hope that this helps.
Break the mask bits out in real columns. RDMBs don't like bit fields.
Your join results in a cartesian product of the table with itself. Add `t1.name=t2.name' to the join, giving a bunch of (much) smaller cartesian products, one per unique name, which will speed things up considerably.