I am trying to upload the csv file on HDFS for Impala and failing many time. Not sure what is wrong here as I have followed the guide. And the csv is also on HDFS.
CREATE EXTERNAL TABLE gc_imp
(
asd INT,
full_name STRING,
sd_fd_date STRING,
ret INT,
ftyu INT,
qwerINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY','
LOCATION '/user/hadoop/Gc_4';
Error which I am getting. And I am using Hue for it.
> TExecuteStatementResp(status=TStatus(errorCode=None,
> errorMessage='MetaException: hdfs://nameservice1/user/hadoop/Gc_4 is
> not a directory or unable to create one', sqlState='HY000',
> infoMessages=None, statusCode=3), operationHandle=None)
Any lead.
/user/hadoop/Gc_4 must be a directory. So you need to create a directory, for example, /user/hadoop/Gc_4. Then you upload your Gc_4 to it. So the file path is /user/hadoop/Gc_4/Gc_4. After that, you can use LOCATION to specify the directory path /user/hadoop/Gc_4.
LOCATION must be a directory. This requirement is same in Hive and Impala.
It's not the answer but a workaround.
In most cases I have seen that the table uploaded but the "status" was not successful.
Also if you have stored the data with the help of Hive which gives you more control then "Don't forget to REFRESH the metadata on Impala UI." .Very Important.
Related
I am trying to load csv files into a Hive table. I need to have it done through HDFS.
My end goal is to have the hive table also connected to Impala tables, which I can then load into Power BI, but I am having trouble getting the Hive tables to populate.
I create a table in the Hive query editor using the following code:
CREATE TABLE IF NOT EXISTS dbname.table_name (
time_stamp TIMESTAMP COMMENT 'time_stamp',
attribute STRING COMMENT 'attribute',
value DOUBLE COMMENT 'value',
vehicle STRING COMMENT 'vehicle',
filename STRING COMMENT 'filename')
Then I check and see the LOCATION using the following code:
SHOW CREATE TABLE dbname.table_name;
and find that is has gone to the default location:
hdfs://our_company/user/hive/warehouse/dbname.db/table_name
So I go to the above location in HDFS, and I upload a few csv files manually, which are in the same five-column format as the table I created. Here is where I expect this data to be loaded into the Hive table, but when I go back to dbname in Hive, and open up the table I made, all values are still null, and when I try to open in browser I get:
DB Error
AnalysisException: Could not resolve path: 'dbname.table_name'
Then I try the following code:
LOAD DATA INPATH 'hdfs://our_company/user/hive/warehouse/dbname.db/table_name' INTO TABLE dbname.table_name;
It runs fine, but the table in Hive still does not populate.
I also tried all of the above using CREATE EXTERNAL TABLE instead, and specifying the HDFS in the LOCATION argument. I also tried making an HDFS location first, uploading the csv files, then CREATE EXTERNAL TABLE with the LOCATION argument pointed at the pre-made HDFS location.
I already made sure I have authorization privileges.
My table will not populate with the csv files, no matter which method I try.
What I am doing wrong here?
I was able to solve the problem using:
CREATE TABLE IF NOT EXISTS dbname.table_name (
time_stamp STRING COMMENT 'time_stamp',
attribute STRING COMMENT 'attribute',
value STRING COMMENT 'value',
vehicle STRING COMMENT 'vehicle',
filename STRING COMMENT 'filename')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
and
LOAD DATA INPATH 'hdfs://our_company/user/hive/warehouse/dbname.db/table_name' OVERWRITE INTO TABLE dbname.table_name;
I started to pull GLUE table via using pyathena since last week. However, one annoying thing I noticed that is if I wrote my code as shown below, sometimes it works and returns a pandas dataframe but other times, this piece of codes will create a csv and a csv metadata in the folder where physical data (parquet) are stored in S3 and registered in GLUE.
I know that if you use pandas cursor, it may end up with these two files but I just wonder if I can access data without these two files since every time these two files generated in S3, my read in process failed.
Thank you!
import os
access_key_id = os.getenv('AWS_ACCESS_KEY_ID')
secret_access_key = os.getenv('AWS_SECRET_ACCESS_KEY')
connect1 = connect(s3_staging_dir='s3://xxxxxxxxxxxxx')
df = pd.read_sql("select * from abc.table_name", connect1)
df.head()
go to Athena
click settings -> workgroup name -> edit workgroup
Update "Query result location"
click "Override client-side settings"
Note: If you have not setup any other workgroups for your Athena environment, you should only find one workgroup named "Primary".
This should resolve your problem. For more information you can read:
https://docs.aws.amazon.com/athena/latest/ug/querying.html
i am getting the following error if i run a copy command to copy contents of a .csv file in s3 to a table in redshift.
error:"String length exceeds DDL length".
i am using following copy command:
COPY enjoy from 's3://nmk-redshift-bucket/my_workbook.csv' CREDENTIALS 'aws_access_key_id=”****”;aws_secret_access_key=’**** ' CSV QUOTE '"' DELIMITER ',' NULL AS '\0'
i figured lets open the link given by s3 for my file through was console.
link for the work book is :
link to my s3bucket cvs file
the above file is filled with many weird characters i really don't understand.
the copy command is taking these characters instead of the information i have entered in my csv file.So hence leading to string length exceeded error.
i use sql workbench to query.My 'stl_load_errors' table in redshift has raw_field_values component similar to the chars in the link i mentioned above, thats how i got to know how its taking in the input
i am new to aws and utf-8 configs. so please i appreciate help on this
The link you provide points to a .xlsx file (but has a .csv extension instead of .xlsx), which is actually a zip file.
That is why you see those strange characters, the first 2 being 'PK', which means it is a zip file.
So you will have to export to .csv first, before using the file.
I'm attempting to COPY a CSV file to Redshift from an S3 bucket. When I execute the command, I don't get any error messages, however the load doesn't work.
Command:
COPY temp FROM 's3://<bucket-redacted>/<object-redacted>.csv'
CREDENTIALS 'aws_access_key_id=<redacted>;aws_secret_access_key=<redacted>'
DELIMITER ',' IGNOREHEADER 1;
Response:
Load into table 'temp' completed, 0 record(s) loaded successfully.
I attempted to isolate the issue via the system tables, but there is no indication there are issues.
Table Definition:
CREATE TABLE temp ("id" BIGINT);
CSV Data:
id
123,
The line endings in your csv file probably don't have a unix new line character at the end, so the COPY command probably sees your file as:
id123,
Given you have the IGNOREHEADER option enabled, and the line endings in the file aren't what COPY is expecting (my assumption based on past experience), the file contents get treated as one line, and then skipped.
I had this occur for some files created from a Windows environment.
I guess one thing to remember is that CSV is not a standard, more a convention, and different products/vendors have different implementations for csv file creation.
I repeated your instructions, and it worked just fine:
First, the CREATE TABLE
Then, the LOAD (from my own text file containing just the two lines you show)
This resulted in:
Code: 0 SQL State: 00000 --- Load into table 'temp' completed, 1 record(s) loaded successfully.
So, there's nothing obviously wrong with your commands.
At first, I thought that the comma at the end of your data line could cause Amazon Redshift to think that there is an additional column of data that it can't map to your table, but it worked fine for me. Nonetheless, you might try removing the comma, or create an additional column to store this 'empty' value.
I'm trying to migrating some MySQL tables to Amazon Redshift, but met some problems.
The steps are simple:
1. Dump the MySQL table to a csv file
2. Upload the csv file to S3
3. Copy the data file to RedShift
Error occurs in step 3:
The SQL command is:
copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS
'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' delimiter ',' csv;
The error info:
An error occurred when executing the SQL command: copy TABLE_A from
's3://ciphor/TABLE_A.csv' CREDENTIALS
'aws_access_key_id=xxxx;aws_secret_access_key=xxxx ERROR: COPY CSV is
not supported [SQL State=0A000] Execution time: 0.53s 1 statement(s)
failed.
I don't know if there's any limitations on the format of the csv file, say the delimiters and quotes, I cannot find it in documents.
Any one can help?
The problem is finally resolved by using:
copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS
'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' delimiter ','
removequotes;
More information can be found here http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
Now Amazon Redshift supports CSV option for COPY command. It's better to use this option to import CSV formatted data correctly. The format is shown bellow.
COPY [table-name] FROM 's3://[bucket-name]/[file-path or prefix]'
CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' CSV;
The default delimiter is ( , ) and the default quotes is ( " ). Also you can import TSV formatted data with CSV and DELIMITER option like this.
COPY [table-name] FROM 's3://[bucket-name]/[file-path or prefix]'
CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' CSV DELIMITER '\t';
There are some disadvantages to use the old way(DELIMITER and REMOVEQUOTES) that REMOVEQUOTES does not support to have a new line or a delimiter character within an enclosed filed. If the data can include this kind of characters, you should use CSV option.
See the following link for the details.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
If you want to save your self some code/ you have a very basic use case you can use Amazon Data Pipeline.
it stats a spot instance and perform the transformation within amazon network and it's really intuitive tool (but very simple so you can't do complex things with it)
You can try with this
copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' csv;
CSV itself means comma separated values, no need to provide delimiter with this. Please refer link.
[http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-format]
I always this code:
COPY clinical_survey
FROM 's3://milad-test/clinical_survey.csv'
iam_role 'arn:aws:iam::123456789123:role/miladS3xxx'
CSV
IGNOREHEADER 1
;
Description:
1- COPY the name of your file store in S3
2- FROM address of file
3- iam_role is a substitution for CREDENTIAL. Note that, iam_role should be defined in iam management menu at your console, and then in trust menu should be assigned to the user as well (That is the hardest part!)
4- CSV uses comma delimiter
5- IGNORHEADER 1 is a must! Otherwise it will throw an error. (skip one row of my CSV and consider it as a header)
Since the resolution has already been provided, I'll not repeat the obvious.
However, in case you receive some more error which you're not able to figure out, simply execute on your workbench while you're connected to any of the Redshift accounts:
select * from stl_load_errors [where ...];
stl_load_errors contains all the Amazon RS load errors in historical fashion where a normal user can view details corresponding to his / her own account but a superuser can have all the access.
The details are captured elaborately at :
Amazon STL Load Errors Documentation
Little late to comment but it can be useful:-
You can use an open source project to copy tables directly from mysql to redshift - sqlshift.
It only requires spark and if you have yarn then it can also be used.
Benefits:- It will automatically decides distkey and interleaved sortkey using primary key.
It looks like you are trying to load local file into REDSHIFT table.
CSV file has to be on S3 for COPY command to work.
If you can extract data from table to CSV file you have one more scripting option. You can use Python/boto/psycopg2 combo to script your CSV load to Amazon Redshift.
In my MySQL_To_Redshift_Loader I do the following:
Extract data from MySQL into temp file.
loadConf=[ db_client_dbshell ,'-u', opt.mysql_user,'-p%s' % opt.mysql_pwd,'-D',opt.mysql_db_name, '-h', opt.mysql_db_server]
...
q="""
%s %s
INTO OUTFILE '%s'
FIELDS TERMINATED BY '%s'
ENCLOSED BY '%s'
LINES TERMINATED BY '\r\n';
""" % (in_qry, limit, out_file, opt.mysql_col_delim,opt.mysql_quote)
p1 = Popen(['echo', q], stdout=PIPE,stderr=PIPE,env=env)
p2 = Popen(loadConf, stdin=p1.stdout, stdout=PIPE,stderr=PIPE)
...
Compress and load data to S3 using boto Python module and multipart upload.
conn = boto.connect_s3(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(bucket_name)
k = Key(bucket)
k.key = s3_key_name
k.set_contents_from_file(file_handle, cb=progress, num_cb=20,
reduced_redundancy=use_rr )
Use psycopg2 COPY command to append data to Redshift table.
sql="""
copy %s from '%s'
CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s'
DELIMITER '%s'
FORMAT CSV %s
%s
%s
%s;""" % (opt.to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,opt.delim,quote,gzip, timeformat, ignoreheader)