SQL aggregate folder usage - mysql

I am trying to write a mySQL or hive query which can help me aggregate on which folders contain the most files. Suppose that I have the following folder paths.
/var/www/mysite/current/...../......
/var/www/mysite/backup/..../......
/var/www/misc/others/...../......
So basically the query should return after aggregation which folders have the most files. For example we should be able to look at how many files there are in /mysite and still be able tell how many came from /mysite/current vs. how many came from /mysite/backup.
Update 1:
Table Schema
CREATE EXTERNAL TABLE hadoop_fs_images(
Path STRING,
Number_of_files DOUBLE
)

Related

Use of wildcards in external table definition

Thanks for reading! I would like to define an external table on a storage account where the path format is as follows:
flowevents/resourceId=/SUBSCRIPTIONS/<unique>/RESOURCEGROUPS/<unique>/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/<unique>/y=2022/m=05/d=11/h=09/m=00/<unique>/datafiles
I would like to partition the external table by date. The relevant documentation for this is located here. My understanding and experimentation indicates that this might not be possible to do, given the URI path above where there are unique values before the values that I would like to partition on and the answer given by Slavik here.
Is it possible to create an external table using wildcards to traverse the folders to achieve the partition scheme described above?
Is the only way to solve this to define multiple storage connection strings for all possible values of unique? Is there an upper limit to how many values may be provided?
The path traversal functionality I'm looking for can be found in LightIngest:
-prefix:resourceId=/SUBSCRIPTIONS/00-00C-00-00-00/RESOURCEGROUPS/ -pattern:*/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/*/y=2021/m=11/d=10/*.json
It does not seem to be supported when defining external tables. A possible reason for this is that the engine will get overloaded if you load too many files from external storage. I got the following error message when I defined 50 connections strings:
Partial query failure: Input stream/record/field too large (E_INPUT_STREAM_TOO_LARGE). (message: '', details: '')
It works as intended when I provided 30 connection strings and used four virtual columns to do partitioning. This error message is not described in the documentation, by the way.
Update, for kusto developers: I attempted to use virtual columns for the whole URI path and then query to generate the connection string. I verified that the table definition is correct using:
.show external table X articats limit 1
It would show the partitions with populated values. However, when attempting to query the external table using the recommended operators ("in" or "has") to navigate it does not work, the query goes on forever despite fetching a small file and running on a cluster on D14_v2 VMs. If I were to define an external table just for that file, it would load just fine.

WHERE filename in Apache Drill does a full scan in all files

select distinct filename from dfs.contoso.`folder/CSVs/`
> 2021-01.csv
> 2021-02.csv
> ...
or
select count(*) as cnt from dfs.contoso.`folder/CSVs/`
where filename = '2021-01.csv'
> 4562751239
The problem is both of these queries take AN HOUR. From the plan is obvious that Drill goes through ALL files in destination folder and AFTER THEN it filters the data by filename. That's absolutely unusable for bigger datasets.
Unfortunately, I cannot change the data structure and I cannot have single file in the from clause (from dfs.contoso.folder/CSVs/2021-01.csv`) because at that point Drill does not use created CSV schema which I need.
Is there any reason why Drill does this?
How can we do it effectively?
Drill 1.19
UPDATE
The main problem is not enumerating files in a folder but reading a data from a single file from many in a directory.
Having this filesystem:
CsvHistory/2019-01.csv [2GB]
CsvHistory/2019-02.csv [3GB]
...
CsvHistory/2021-09.csv [6GB]
We needed to do a query directly from one file without reading the others from the folder and without changing the filesystem structure since it's not allowed.
We needed this query not to traverse all the other files because it's huge waste of performance.
I'm sorry you gave up on Drill, but I'm going to post this for anyone else who might be reading this.
You do have to understand a bit about how Drill handles schemas. Firstly, Drill attempts to infer the schema from the underlying data. For the queries listed above, it looks like you are trying to find the file names in a given directory and count rows in each file. Neither of these requires a schema at all. As I mentioned, you should use the INFORMATION_SCHEMA to query directories or a SHOW FILES IN <dir> query for that.
Providing a Schema
If the schema Drill infers isn't cutting it for you, you can provide a schema to Drill either at query time or by running a CREATE SCHEMA query which will create a hidden schema file. Here is a link to the docs for that functionality: https://drill.apache.org/docs/create-or-replace-schema/. I've not used this functionality extensively, but I do know that you can certainly provide a schema file for single files. Not sure for entire directories but I believe it was meant to do that.
The documentation for the inline schema is a bit lacking but you can also do this at query time as follows:
SELECT Year
FROM table(dfs.test.`test.csvh` (schema=>'inline=(`Year` int)'))
WHERE Make = 'Ford'
The result here would be that the Year column would be interpreted as an INT rather than the default VARCHAR.
**UPDATE: ** It turns out that you CAN also provide a schema in the table() function above. See below
SELECT Year
FROM table(dfs.test.`test.csvh` (schema => 'path=`/path/to/schema`'))
WHERE Make = 'Ford'
Possibly Drill shuffles filenames and the records in it.
You might move each file to directories and group by on dir0:
# hadoop fs -mv /folder/CSVs/2021-01.csv /folder/CSVs/2021-01/2021-01.csv
sql> select dir0 as fileprefix,count(1) from dfs.contoso.`/folder/CSVs/` group by dir0
I'm a little unclear as to what you're trying to do.
If you are querying an individual file, you should not specify the file name as you are doing. As you've noted, Drill will do a full recursive directory scan, then open the file you requested. See below as to how to query a specific file. The better way to query an individual file is to specify the file name in the FROM clause as shown below.
SELECT count(*) as cnt
FROM dfs.contoso.`folder/CSVs/2021-01.csv`
You can also use globs and wildcards as well in the file path. You might also want to look at the drill docs 1 for some more info about efficiently querying directories.
If you are looking to explore what files are present in various directories, you'll want to use the INFORMATION_SCHEMA for that. Take a look here 2 and take a look at the section about files.
For instance:
SELECT *
FROM information_schema.`files`
WHERE schema_name = 'dfs.contoso'

Best practice for deleting multipe rows across tables in db

I'm wondering what is best practice when deleting multiple rows and items from a database. I'm working in C# and trying to make the process of deleting items faster so want to move this action to a stored procedure rather than calling the database for each item.
My setup is similar to a file system, so I have Folders, SubFolders and Files. Files can be of different types (e.g. XML, TXT, PDF) and certain File types have other attributes stored in different tables connected with the Id of the file. Files are connected to Folders and SubFolders in specific File_To_Folder and File_To_SubFolder tables.
Now, I want to create a script for deleting a Folder which has to delete from all of these different tables as well. Currently I have in place stored procedures for deleting a single File which deletes all items connected to it. My question is, what are the advantages/disadvantages to
a) using a cursor to loop through all items in the folder and call sp_deletePDF/sp_deleteTXT for each item.
or
b) creating a custom script figuring out all rows that need to be deleted from each table on it's own and just use one line delete from {table} where id in (id1, id2, id3)?

Amazon Athena - How can I exclude the metadata when create table based on query result

In Athena, I want to create a table based on the query result, but every query result contains 2 files, ".csv" and ".csv.metadata". All these files are in my table and the metadata makes the table looks messy. Is there any way to ignore these ".csv.metadata" files, only show the data of ".csv" files?
Any suggestion or code snippets will be appreciated.
Thank you.
You can exclude input files like this:
select * from your_table where "$PATH" not like '%metadata'
Adding an underscore at the beginning of the filename will cause Athena to ignore the file. For example: _ignoredfile.csv.metadata
It can't be done. From the documentation:
Athena reads all files in an Amazon S3 location you specify in the CREATE TABLE statement, and cannot ignore any files included in the prefix. When you create tables, include in the Amazon S3 path only the files you want Athena to read. Use AWS Lambda functions to scan files in the source location, remove any empty files, and move unneeded files to another location.
A simple workaround that may serve your needs is to create an Athena view that will filter our the "mess" in the table. You can then simply use the view instead of the table itself.

How do I write results of a MySQL query to a text file inside a loop?

I have a large database of individuals (say 1000 individuals/ 5,000 records per individual). I'd like to write a couple of fields for each individual (in this example let's say lat and long) to a text file (preferably comma separated).
The algorithm would look like this:
foo=select distinct (id) from <table-name>;
for each id in foo
{
smaller_result= select lat,long from <table-name> where id=$id;
write smaller_result to a text file with unique name (e.g. id.txt);
}
I can easily code this up in PHP (which I frequently use to interface with MySQL if I cannot run a command line SQL query directly). However, in this case, I need to share the code with a collaborator and have him run it (and he does not have and cannot install php). Also, the database is quite large and cannot easily be uploaded online (which would allow me to run the query through the web). So how else would I accomplish this?
a) Can this algorithm be written as a sql query that can be executed in a command line?
b) If not, can this be written in python such that my collaborator would just run a .py file?
We are both on OSX (Lion) and can access mysql and python from our shell / terminal.
You can output the result of any select query with SELECT ... INTO OUTFILE. This will generate a text file with the output (on the server).
So you need to create a single SELECT query which generates the file... I think in your case it can be easily done with a subquery, if not, you can create a stored procedure with a cursor that loops your 'foo' result set and appends it to a temp table, and then use SELECT ... INTO OUTFILE with that table.