Amazon Athena - How can I exclude the metadata when create table based on query result - create-table

In Athena, I want to create a table based on the query result, but every query result contains 2 files, ".csv" and ".csv.metadata". All these files are in my table and the metadata makes the table looks messy. Is there any way to ignore these ".csv.metadata" files, only show the data of ".csv" files?
Any suggestion or code snippets will be appreciated.
Thank you.

You can exclude input files like this:
select * from your_table where "$PATH" not like '%metadata'

Adding an underscore at the beginning of the filename will cause Athena to ignore the file. For example: _ignoredfile.csv.metadata

It can't be done. From the documentation:
Athena reads all files in an Amazon S3 location you specify in the CREATE TABLE statement, and cannot ignore any files included in the prefix. When you create tables, include in the Amazon S3 path only the files you want Athena to read. Use AWS Lambda functions to scan files in the source location, remove any empty files, and move unneeded files to another location.

A simple workaround that may serve your needs is to create an Athena view that will filter our the "mess" in the table. You can then simply use the view instead of the table itself.

Related

WHERE filename in Apache Drill does a full scan in all files

select distinct filename from dfs.contoso.`folder/CSVs/`
> 2021-01.csv
> 2021-02.csv
> ...
or
select count(*) as cnt from dfs.contoso.`folder/CSVs/`
where filename = '2021-01.csv'
> 4562751239
The problem is both of these queries take AN HOUR. From the plan is obvious that Drill goes through ALL files in destination folder and AFTER THEN it filters the data by filename. That's absolutely unusable for bigger datasets.
Unfortunately, I cannot change the data structure and I cannot have single file in the from clause (from dfs.contoso.folder/CSVs/2021-01.csv`) because at that point Drill does not use created CSV schema which I need.
Is there any reason why Drill does this?
How can we do it effectively?
Drill 1.19
UPDATE
The main problem is not enumerating files in a folder but reading a data from a single file from many in a directory.
Having this filesystem:
CsvHistory/2019-01.csv [2GB]
CsvHistory/2019-02.csv [3GB]
...
CsvHistory/2021-09.csv [6GB]
We needed to do a query directly from one file without reading the others from the folder and without changing the filesystem structure since it's not allowed.
We needed this query not to traverse all the other files because it's huge waste of performance.
I'm sorry you gave up on Drill, but I'm going to post this for anyone else who might be reading this.
You do have to understand a bit about how Drill handles schemas. Firstly, Drill attempts to infer the schema from the underlying data. For the queries listed above, it looks like you are trying to find the file names in a given directory and count rows in each file. Neither of these requires a schema at all. As I mentioned, you should use the INFORMATION_SCHEMA to query directories or a SHOW FILES IN <dir> query for that.
Providing a Schema
If the schema Drill infers isn't cutting it for you, you can provide a schema to Drill either at query time or by running a CREATE SCHEMA query which will create a hidden schema file. Here is a link to the docs for that functionality: https://drill.apache.org/docs/create-or-replace-schema/. I've not used this functionality extensively, but I do know that you can certainly provide a schema file for single files. Not sure for entire directories but I believe it was meant to do that.
The documentation for the inline schema is a bit lacking but you can also do this at query time as follows:
SELECT Year
FROM table(dfs.test.`test.csvh` (schema=>'inline=(`Year` int)'))
WHERE Make = 'Ford'
The result here would be that the Year column would be interpreted as an INT rather than the default VARCHAR.
**UPDATE: ** It turns out that you CAN also provide a schema in the table() function above. See below
SELECT Year
FROM table(dfs.test.`test.csvh` (schema => 'path=`/path/to/schema`'))
WHERE Make = 'Ford'
Possibly Drill shuffles filenames and the records in it.
You might move each file to directories and group by on dir0:
# hadoop fs -mv /folder/CSVs/2021-01.csv /folder/CSVs/2021-01/2021-01.csv
sql> select dir0 as fileprefix,count(1) from dfs.contoso.`/folder/CSVs/` group by dir0
I'm a little unclear as to what you're trying to do.
If you are querying an individual file, you should not specify the file name as you are doing. As you've noted, Drill will do a full recursive directory scan, then open the file you requested. See below as to how to query a specific file. The better way to query an individual file is to specify the file name in the FROM clause as shown below.
SELECT count(*) as cnt
FROM dfs.contoso.`folder/CSVs/2021-01.csv`
You can also use globs and wildcards as well in the file path. You might also want to look at the drill docs 1 for some more info about efficiently querying directories.
If you are looking to explore what files are present in various directories, you'll want to use the INFORMATION_SCHEMA for that. Take a look here 2 and take a look at the section about files.
For instance:
SELECT *
FROM information_schema.`files`
WHERE schema_name = 'dfs.contoso'

Hive - external tables and csv data

I need some help from you with a problem of understanding refrencing data from hive. The following situation: I have a CSV fil data.csv imported into hadoop. Now I have found many snippets that use an external table to create a schema on top of the csv file. My question is, how does hive know that the schema of the external table is connected to data.csv. In examples I cannot find a reference to the csv file.
Where is sample_1.csv referenced for usage in this hive example or how does hive know that data from sample_1.csv includes the data?
While creating external table we have to give the list of columns and hdfs location. Hive will store only column metadata like column name, datatype.. and the hdfs location.
When we execute query on external table it will fetch metadata and then fetch available files from hdfs location.
now we've got the answer. The manual recommends to store one file in one directory. When we then build an external table on top it seems that the data ist identified by the schema.
In my Testcase i have imorted 3 csv files with one schema 2 files got the matching schema. The third file got one column more. If i run a query the data of all three files are shown. The additional column from the third file is missing.
Everything is fine now - thank you!

Redshift/S3 - Copy the contents of a Redshift table to S3 as JSON?

It's straightforward to copy JSON data on S3 into a Redshift table using the standard Redshift COPY command.
However, I'm also looking for the inverse operation: to copy the data contained within an existing Redshift table to JSON that is stored in S3, so that a subsequent Redshift COPY command can recreate the Redshift table exactly as it was originally.
I know about the Redshift UNLOAD commnd, but it doesn't seem to offer any option to store the data in S3 directly in JSON format.
I know that I can write per-table utilities to parse and reformat the output of UNLOAD for each table, but I'm looking for a generic solution which allows me to do this Redshift-to-S3-JSON extract on any specified Redshift table.
I couldn't find any existing utilities that will do this. Did I miss something?
Thank you in advance.
I think the only way is to unload in CSV and write a simple lambda function that turns an input CSV into JSON taking the CSV header as keys and values of every row as values.
There is no built in way to do this yet. So you might have to hack your query with some hardcoding :
https://sikandar89dubey.wordpress.com/2015/12/23/how-to-dump-data-from-redshift-to-json/

Partitioning tables in hive

I am using Hive to load some text files from S3. Currently, the structure is as follows:
bucket/dir/id/text_files
The issue is that the <id> directory does not have the 'user=id' format that Hive seems to like for loading partitions. Typically, if the directory was bucket/dir/user=id, I could just do this:
CREATE EXTERNAL TABLE IF NOT EXISTS table1 (
data STRING
) PARTITIONED BY (user STRING)
LOCATION 'bucket/dir';
However, because I don't have the correct format for the partition directory, how would I go about doing the same thing, which is to say that I want to have a partition named user and make it equal to the id that is already in there?
Thank you for the help.
I hope this is work for you.
load data inpath 'bucket/dir/user=id' overwrite into table table1 partition(user='id');

Can I import tab-separated files into MySQL without creating database tables first?

As the title says: I've got a bunch of tab-separated text files containing data.
I know that if I use 'CREATE TABLE' statements to set up all the tables manually, I can then import them into the waiting tables, using 'load data' or 'mysqlimport'.
But is there any way in MySQL to create tables automatically based on the tab files? Seems like there ought to be. (I know that MySQL might have to guess the data type of each column, but you could specify that in the first row of the tab files.)
No, there isn't. You need to CREATE a TABLE first in any case.
Automatically creating tables and guessing field types is not part of the DBMS's job. That is a task best left to an external tool or application (That then creates the necessary CREATE statements).
If your willing to type the data types in the first row, why not type a proper CREATE TABLE statement.
Then you can export the excel data as a txt file and use
LOAD DATA INFILE 'path/file.txt' INTO TABLE your_table;