Parsing CSV in Athena by column names - csv

I'm trying to create an external table based on CSV files. My problem is that not all CSV files are the same (for some of them there are missing columns) and the order of columns is not always the same.
The question is whether I can make Athena parse the columns by name, instead of by their order

No, athena cannot parse the columns by name instead of their order. The data should be in exact same order as defined in your table schema. You will need to preprocess you CSV's and change the column orders before writing them to S3.
Adding quotes from aws athena documentation :
When you create a new table schema in Athena, Athena stores the schema
in a data catalog and uses it when you run queries.
Athena uses an approach known as schema-on-read, which means a schema
is projected on to your data at the time you execute a query. This
eliminates the need for data loading or transformation.
When you create a database and table in Athena, you are simply
describing the schema and the location where the table data are
located in Amazon S3 for read-time querying. Database and table,
therefore, have a slightly different meaning than they do for
traditional relational database systems because the data isn't stored
along with the schema definition for the database and table.
Reference : Tables and databases in athena

Related

Is it possible to use sqlalchemy to reflect table and change data type of column from string to datetime?

I have a web application where users can upload CSVs. I use Python Pandas to actually do the upload. I have to give the users the ability to change the database table's column's types such as from strings to datetimes. Is there a way to do this in Sqlalchemy? I'm working with reflected tables, and so I have Table objects with all their columns but I have a feeling that Sqlalchemy does not have this capability and that I will have to execute raw SQL to do this.

Accessing large nested JSON data set from Hive

We have large nested json data coming in to a kafka topic, it has close to 3000 attributes. There is a business requirement to make all the 3000 attributes available for querying on a HIVE table.
I am planning to flatten it out and write to hdfs or hive table using spark sql
Question
Is it a feasible requirement to have a Business facing hive table with ~3000 attributes?
How can I ensure query performance on this large table (~3000 attributes)?

How can I import nested json data into multiple connected redshift subtables?

I have server log data that looks something like this:
2014-04-16 00:01:31-0400,583 {"Items": [
{"UsageInfo"=>"P-1008366", "Role"=>"Abstract", "RetailPrice"=>2, "EffectivePrice"=>0},
{"Role"=>"Text", "ProjectCode"=>"", "PublicationCode"=>"", "RetailPrice"=>2},
{"Role"=>"Abstract", "RetailPrice"=>2, "EffectivePrice"=>0, "ParentItemId"=>"396487"}
]}
What I'd like to a relational database that connects two tables - a UsageLog table and a UsageLogItems table, connected by a primary key id.
You can see that the UsageLog table would have feilds like:
UsageLogId
Date
Time
and the UsageLogItems table would have fields like
UsageLogId
UsageInfo
Role
RetailPrice
...
However, I am having trouble writing these into Redshift and being able to associate each record with unique and related ids as keys.
What I am currently doing is I use a ruby script that reads each line of the log file, parses out the UsageLog info (such as date and time), writes it to the database (writing single lines to Redshift is VERY slow), then creates a csv of the data from the UsageLogItems information and imports that to Redshift via S3, querying the largest id of the UsageLogs table and using that number to relate the two (this is also slow, because lots of UsageLogs do not contain any items, so I frequently load in 0 records from csv files).
This currently does work, but it is far too painfully slow to be effective at all. Is there a better way to handle this?
Amazon Redshift supports JSON ingestion using JSONPaths via COPY command.
http://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-copy-from-json.html

infer table structure from file in MySql

Another posting said there is a way to infer the table columns from a data file using phpMyAdmin. I haven't found documentation on this, can you point me to it? Does it only use the header row, or does it also sample the data to infer the data type?
I'm trying to create several tables in MySQL from data files, which have roughly 100 columns each, so I don't want to write the SQL DDL to create the tables manually.
Thanks!

Can I import tab-separated files into MySQL without creating database tables first?

As the title says: I've got a bunch of tab-separated text files containing data.
I know that if I use 'CREATE TABLE' statements to set up all the tables manually, I can then import them into the waiting tables, using 'load data' or 'mysqlimport'.
But is there any way in MySQL to create tables automatically based on the tab files? Seems like there ought to be. (I know that MySQL might have to guess the data type of each column, but you could specify that in the first row of the tab files.)
No, there isn't. You need to CREATE a TABLE first in any case.
Automatically creating tables and guessing field types is not part of the DBMS's job. That is a task best left to an external tool or application (That then creates the necessary CREATE statements).
If your willing to type the data types in the first row, why not type a proper CREATE TABLE statement.
Then you can export the excel data as a txt file and use
LOAD DATA INFILE 'path/file.txt' INTO TABLE your_table;