I've got 70,000+ CSV files in an S3 bucket. They all have the same headers. I would like to combine the files into one CSV, which I want to download onto my machine.
Using AWS Athena, I seem to be most of the way there. I have created a database from the S3 bucket. I can then run queries like this:
select * from my_table_name limit 100
And see the results of the query (which in my case is combining many CSVs from S3) in the Athena console.
However when I go to "Download results" of that query, I can't open the CSV in Excel (or a text editor).
Doing
file -b my_table_name.csv
returns data.
I'm confused because I can visually see the results of my Athena query but can't download them in a usable file format. Am I missing something obvious for how to download this data? Why isn't it giving me a normal (perhaps UTF-8) CSV?
In Athena settings, I had encryption on. That solved it.
Related
What have I tried
As you can see here, this is my configuration on trying to save the output data from query to s3 as CSV. The query was successful (from what I've check on the logs) but it seems that it doesn't create any file on the s3 output location.
Main Problem
I know about the CopyActivity and I tried that successfully, but the problem is that my query is too long that it reaches the limit 10240. From what I've searched, you can save the SQL file on S3 then use SQL Activity to add Script URI and point it on the SQL file saved on S3. But it seems that only CopyActivity can create and save CSV as an output on S3.
Questions
Is there any work around on long query to still use Copy Activity?
Is there a way to use the SQL Activity and create CSV file on S3?
Is it breaking down the SQL query will be the only solution?
I imported a text file from in GCS and did some preparations using DataPrep and write them back to GCS as CSV files. What I want to do is, do this for all the text files in that bucket Is there a way to do this for all the files in that bucket(in GCS) at once?
Below is my procedure. I selected a textfile from GCS(can't select more than one text file) and did some preparations(rename columns .create new columns and etc). Then write it back to GCS as CSV.
You can use the Dataset with parameters feature to load several files at once.
You can then use a wildcard to select all the files that you want to load.
Note that all the files need to have the same schema (same columns) for this to work.
See https://cloud.google.com/dataprep/docs/html/Create-Dataset-with-Parameters_118228628 for more information on how to use this feature.
An other solution is to add all the files into a folder* and to use the large + button to load all the files in that folder.
[*] technically under the same prefix on GCS
I have an AWS Athena service in place.
After the query Athena generates an CSV file.
Let's say I want to see the follow result (with headers) when I open that CSV in the excel or google sheet.
For the line 7 to 13 it is ok. It is actual result from Athena.
I want to add a header (like in the picture).
How do I accomplish that?
It isn't possible, because Athena doesn't support that and also it isn't compatible with CSV format. If you want to add "headers" you can use a workaround by using "union all", but it will not give you the result as you expect.
About CSV format:
https://en.wikipedia.org/wiki/Comma-separated_values
I currently have a CSV file with 200k lines that look like this
id,path,username,folderid
32423423424,asfasf-232-3,cooluser,234324-234-34324-424
When the crawler finishes, it does say it created the table and I can see the table details. When I try to preview the data in athena it returns zero records. The CSV file is stored in a S3 bucket, all permissions are correct.
Thanks.
Try keeping the file inside a particular folder and point the crawler towards the folder. I think this would work rather than pointing towards a particular file.
I have an idea on how to extract Table data to Cloud storage using Bq extract command but I would like rather like to know, if there are any options to extract a Big Query table as NewLine Delimited JSON to Local Machine?
I could extract Table data to GCS via CLI and also download JSON data from WEB UI but I am looking for solution using BQ CLI to download table data as JSON in Local machine?. I am wondering is that even possible?
You need to use Google Cloud Storage for your export job. Exporting data from BigQuery is explained here, check also the variants for different path syntaxes.
Then you can download the files from GCS to your local storage.
Gsutil tool can help you further to download the file from GCS to local machine.
You first need to export to GCS, then to transfer to local machine.
If you use the BQ Cli tool, then you can set output format to JSON, and you can redirect to a file. This way you can achieve some export locally, but it has certain other limits.
this exports the first 1000 line as JSON
bq --format=prettyjson query --n=1000 "SELECT * from publicdata:samples.shakespeare" > export.json
It's possible to extract data without using GCS, directly to your local machine, using BQ CLI.
Please see my other answer for details: BigQuery Table Data Export