WHERE filename in Apache Drill does a full scan in all files - apache-drill

select distinct filename from dfs.contoso.`folder/CSVs/`
> 2021-01.csv
> 2021-02.csv
> ...
or
select count(*) as cnt from dfs.contoso.`folder/CSVs/`
where filename = '2021-01.csv'
> 4562751239
The problem is both of these queries take AN HOUR. From the plan is obvious that Drill goes through ALL files in destination folder and AFTER THEN it filters the data by filename. That's absolutely unusable for bigger datasets.
Unfortunately, I cannot change the data structure and I cannot have single file in the from clause (from dfs.contoso.folder/CSVs/2021-01.csv`) because at that point Drill does not use created CSV schema which I need.
Is there any reason why Drill does this?
How can we do it effectively?
Drill 1.19
UPDATE
The main problem is not enumerating files in a folder but reading a data from a single file from many in a directory.
Having this filesystem:
CsvHistory/2019-01.csv [2GB]
CsvHistory/2019-02.csv [3GB]
...
CsvHistory/2021-09.csv [6GB]
We needed to do a query directly from one file without reading the others from the folder and without changing the filesystem structure since it's not allowed.
We needed this query not to traverse all the other files because it's huge waste of performance.

I'm sorry you gave up on Drill, but I'm going to post this for anyone else who might be reading this.
You do have to understand a bit about how Drill handles schemas. Firstly, Drill attempts to infer the schema from the underlying data. For the queries listed above, it looks like you are trying to find the file names in a given directory and count rows in each file. Neither of these requires a schema at all. As I mentioned, you should use the INFORMATION_SCHEMA to query directories or a SHOW FILES IN <dir> query for that.
Providing a Schema
If the schema Drill infers isn't cutting it for you, you can provide a schema to Drill either at query time or by running a CREATE SCHEMA query which will create a hidden schema file. Here is a link to the docs for that functionality: https://drill.apache.org/docs/create-or-replace-schema/. I've not used this functionality extensively, but I do know that you can certainly provide a schema file for single files. Not sure for entire directories but I believe it was meant to do that.
The documentation for the inline schema is a bit lacking but you can also do this at query time as follows:
SELECT Year
FROM table(dfs.test.`test.csvh` (schema=>'inline=(`Year` int)'))
WHERE Make = 'Ford'
The result here would be that the Year column would be interpreted as an INT rather than the default VARCHAR.
**UPDATE: ** It turns out that you CAN also provide a schema in the table() function above. See below
SELECT Year
FROM table(dfs.test.`test.csvh` (schema => 'path=`/path/to/schema`'))
WHERE Make = 'Ford'

Possibly Drill shuffles filenames and the records in it.
You might move each file to directories and group by on dir0:
# hadoop fs -mv /folder/CSVs/2021-01.csv /folder/CSVs/2021-01/2021-01.csv
sql> select dir0 as fileprefix,count(1) from dfs.contoso.`/folder/CSVs/` group by dir0

I'm a little unclear as to what you're trying to do.
If you are querying an individual file, you should not specify the file name as you are doing. As you've noted, Drill will do a full recursive directory scan, then open the file you requested. See below as to how to query a specific file. The better way to query an individual file is to specify the file name in the FROM clause as shown below.
SELECT count(*) as cnt
FROM dfs.contoso.`folder/CSVs/2021-01.csv`
You can also use globs and wildcards as well in the file path. You might also want to look at the drill docs 1 for some more info about efficiently querying directories.
If you are looking to explore what files are present in various directories, you'll want to use the INFORMATION_SCHEMA for that. Take a look here 2 and take a look at the section about files.
For instance:
SELECT *
FROM information_schema.`files`
WHERE schema_name = 'dfs.contoso'

Related

Export blob column from mysql dB to disk and replace it with new file name

So I'm working on a legacy database, and unfortunately the performance of database is very slow. Simple select query can take up to 10 seconds in tables with less than 10000 record.
So i tried to investigate problem and found out that deleting column that they have used to store files (mostly videos and images) fix the problem and improve performance a lot.
Along with adding proper indexes I was able to run exact same query that used to take 10-15sec to run in under 1sec.
So my question is. Is there any already existing tool or script I can use to help me export those blobs (videos) from database and save the to disk and update row with new file name/path on file system?
If not is there any proper way to optimize database so that those blob would not impact performance that much?
Hint some one clients consuming this database use high level orms so we don't have much control on queries orm use to fetch rows and its relations. So I cannot optimize queries directly.
SELECT column FROM table1 WHERE id = 1 INTO DUMPFILE 'name.png';
How about this way?
These is also INTO_OUTFILEinstead of INTO_DUMPFILE
13.2.10.1 SELECT ... INTO Statement The SELECT ... INTO form of SELECT enables a query result to be stored in variables or written to a file:
SELECT ... INTO var_list selects column values and stores them into
variables.
SELECT ... INTO OUTFILE writes the selected rows to a file. Column and
line terminators can be specified to produce a specific output format.
SELECT ... INTO DUMPFILE writes a single row to a file without any
formatting.
Link: https://dev.mysql.com/doc/refman/8.0/en/select-into.html
Link: https://dev.mysql.com/doc/refman/8.0/en/select.html

Amazon Athena - How can I exclude the metadata when create table based on query result

In Athena, I want to create a table based on the query result, but every query result contains 2 files, ".csv" and ".csv.metadata". All these files are in my table and the metadata makes the table looks messy. Is there any way to ignore these ".csv.metadata" files, only show the data of ".csv" files?
Any suggestion or code snippets will be appreciated.
Thank you.
You can exclude input files like this:
select * from your_table where "$PATH" not like '%metadata'
Adding an underscore at the beginning of the filename will cause Athena to ignore the file. For example: _ignoredfile.csv.metadata
It can't be done. From the documentation:
Athena reads all files in an Amazon S3 location you specify in the CREATE TABLE statement, and cannot ignore any files included in the prefix. When you create tables, include in the Amazon S3 path only the files you want Athena to read. Use AWS Lambda functions to scan files in the source location, remove any empty files, and move unneeded files to another location.
A simple workaround that may serve your needs is to create an Athena view that will filter our the "mess" in the table. You can then simply use the view instead of the table itself.

SQL aggregate folder usage

I am trying to write a mySQL or hive query which can help me aggregate on which folders contain the most files. Suppose that I have the following folder paths.
/var/www/mysite/current/...../......
/var/www/mysite/backup/..../......
/var/www/misc/others/...../......
So basically the query should return after aggregation which folders have the most files. For example we should be able to look at how many files there are in /mysite and still be able tell how many came from /mysite/current vs. how many came from /mysite/backup.
Update 1:
Table Schema
CREATE EXTERNAL TABLE hadoop_fs_images(
Path STRING,
Number_of_files DOUBLE
)

Store a sql query result in pentaho variable

I am new in PDI (passing from SSIS) and I am having some troubles by handling the variables issue.
I would like to perform this:
From a sql select query I would like to save the result into a variable.
For that reason I have created one job and two transformations, given that in pentaho every step is executed in parallel.
The first transformation is going to be on charge of setting the variable and the second transformation is going to use this result as an input.
But in the first transformation I am having troubles by setting the variable, I do not understand where do I have to instanciate this variable to implement the "set season variable" step. And then how to get this result in the next transformation.
If anyone knows about this, or if you could recommend any link with a good example, I'll really appreciate it.
This can indeed be confusing for SSIS users. In PDI, you don't create a recordset variable as you do in SSIS. Simply creating a job creates one for you. Each job has two different types of "Results". One for recordset rows and one for filenames.
These variables are not directly accessible; they are just part of the job. There are steps that interact with them directly. For example under the "Job" branch when you're creating a transform, there is a Get rows from results step and a Copy rows to results step. They work directly with the job's row results.
Be aware that you must manually manage the metadata for the results. This is a pain, but over-all I find PDI's method of doing this more intuitive and easier than SSIS. I find SSIS more flexible in this regard.
There are also Get files from result and Set files in result. These interact with the job's built in file results. This is simply a list of every file touched by any step configured in the job. On the job tab there are tasks that deal with it directly such as Process result filenames, Add filenames to result and Delete filenames from results. These tasks operate on the built in file results list for the job and provide an easy way to, say, archive all the files loaded by the transform you just ran.
Be aware when using these steps that they record EVERY file touched by EVERY step in the job. If you look through most of the steps in transformations (data flows) that deal with files, there's usually an "Add files to results" checkbox that is checked by default. If you uncheck this, it will not add the file names to the jobs file results. You can also delete specific files from the file results with the Delete filenames from result step.
From your Job, start a Transformation:
Overload transformation variable into global variable in your job and use it:

"read.sql" function in R: Using R as an SQL browser?

I have a large .sql file, created as a backup from a MySQL database (containing several tables), and I would like to search elements within it from R.
Ideally, there would be a read.sql function that would turn the tables into some R list with data.frames in it. Is there something that comes close? If not, can RSQLite or RMySQL help? (going through the reference manuals, I don't see a simple function for what I described)
No can do, boss. For R to interpret your MySQL database file, it would have to do a large part of what the DBMS itself does. That's a tall order, infeasible in the general case.
Would this return what you seek (which I think upon review you will admit is not yet particularly well described):
require(RMySQL)
drv <- dbDriver("MySQL")
con <- dbConnect(drv)
dbListTables(con)
# Or
names(dbGetInfo(drv))
If these are just source code than all you would need is readLines. If you are looking for an R-engine that can take SQL code and produce useful results then the sqldf package may provide some help. It parses SQL code embedded in quoted strings and applies it either to dataframe objects in memory or to disk-resident tables (or both). Its default driver for disk files is SQLite but other drivers can be used.
My workaround so far (I am also a newbie with db) is to export the database as .csv file in the phpMyAdmin (need to tick "Export tables as separate files" in the "custom" method). And then use read_csv() on tables I want to work with.
It is not ideal because I would like to export the database and work on it on my computer with R (creating functions that will work when accessing the database that is online) and access the real database later, when I have done all my testing. But from the answers here, it seems the .sql export would not help for that anyway (?) and that I would need to recreate the db locally...