Unable to load .csv data from hdfs into Hive table in Hadoop - csv

I am trying to load csv files into a Hive table. I need to have it done through HDFS.
My end goal is to have the hive table also connected to Impala tables, which I can then load into Power BI, but I am having trouble getting the Hive tables to populate.
I create a table in the Hive query editor using the following code:
CREATE TABLE IF NOT EXISTS dbname.table_name (
time_stamp TIMESTAMP COMMENT 'time_stamp',
attribute STRING COMMENT 'attribute',
value DOUBLE COMMENT 'value',
vehicle STRING COMMENT 'vehicle',
filename STRING COMMENT 'filename')
Then I check and see the LOCATION using the following code:
SHOW CREATE TABLE dbname.table_name;
and find that is has gone to the default location:
hdfs://our_company/user/hive/warehouse/dbname.db/table_name
So I go to the above location in HDFS, and I upload a few csv files manually, which are in the same five-column format as the table I created. Here is where I expect this data to be loaded into the Hive table, but when I go back to dbname in Hive, and open up the table I made, all values are still null, and when I try to open in browser I get:
DB Error
AnalysisException: Could not resolve path: 'dbname.table_name'
Then I try the following code:
LOAD DATA INPATH 'hdfs://our_company/user/hive/warehouse/dbname.db/table_name' INTO TABLE dbname.table_name;
It runs fine, but the table in Hive still does not populate.
I also tried all of the above using CREATE EXTERNAL TABLE instead, and specifying the HDFS in the LOCATION argument. I also tried making an HDFS location first, uploading the csv files, then CREATE EXTERNAL TABLE with the LOCATION argument pointed at the pre-made HDFS location.
I already made sure I have authorization privileges.
My table will not populate with the csv files, no matter which method I try.
What I am doing wrong here?

I was able to solve the problem using:
CREATE TABLE IF NOT EXISTS dbname.table_name (
time_stamp STRING COMMENT 'time_stamp',
attribute STRING COMMENT 'attribute',
value STRING COMMENT 'value',
vehicle STRING COMMENT 'vehicle',
filename STRING COMMENT 'filename')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
and
LOAD DATA INPATH 'hdfs://our_company/user/hive/warehouse/dbname.db/table_name' OVERWRITE INTO TABLE dbname.table_name;

Related

Athena AWS bad field name and multiple folders with Hive DDL

I'm new into AWS Athena, and I'm trying to query multiple S3 buckets containing JSON files. I encountered a number of problems that don't have any answer in documentation (sadly their error log is not informative enough to try to solve it myself):
How to query a JSON field named with parenthesis? For example I have a field named "Capacity(GB)", and when I'm trying to include in the CREATE EXTERNAL statement I receive an error:
CREATE EXTERNAL TABLE IF NOT EXISTS test-scema.test_table (
`device`: string,
`Capacity(GB)`: string)
Your query has the following error(s):
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
java.lang.IllegalArgumentException: Error: : expected at the position
of 'Capacity(GB):string>' but '(' is found.
My files are located in sub folders in S3 in a following structure:
'location_name/YYYY/MM/DD/appstring/'
and I want to query all the dates of a specific app-string (out of many). is there any 'wildcard' I can use to replace the dates path?
Something like this:
LOCATION 's3://location_name/%/%/%/appstring/'
Do I have to load the raw data as-is using CREATE EXTERNAL TABLE, and only then query it, or I can add some WHERE statements build-in? Specifically is someting like this is possible:
CREATE EXTERNAL TABLE IF NOT EXISTS test_schema.test_table (
field1:string,
field2:string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://folder/YYYY/MM/DD/appstring'
WHERE field2='value'
What would be the outcomes in terms of billing? Cause right now I'm building this CREATE statement only to re-use the data in a SQL query once-again.
Thanks!
1. JSON field named with parenthesis
There is no need to create a field called Capacity(GB). Instead, create the field with a different name:
CREATE EXTERNAL TABLE test_table (
device string,
capacity string
)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='device,Capacity(GB)')
LOCATION 's3://xxx';
If you are using Nested JSON then you can use the Serde's mapping property (which I saw on issue with Hive Serde dealing nested structs):
CREATE external TABLE test_table (
top string,
inner struct<device:INT,
capacity:INT>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties
(
"mapping.capacity" = "Capacity(GB)"
)
LOCATION 's3://xxx';
This works nicely with an input of:
{ "top" : "123", "inner": { "Capacity(GB)": 12, "device":2}}
2. Subfolders
You cannot wildcard mid-path (s3://location_name/*/*/*/appstring/). The closest option is to use partitioned data but that would require a different naming format for your directories.
3. Creating tables
You cannot specify WHERE statements as part of the CREATE TABLE statement.
If your aim is to reduce data costs, then use partitioned data to reduce the number of files scanned or store in a column-based format such as Parquet.
For examples, see: Analyzing Data in S3 using Amazon Athena

Importing a CSV with a timestamp field into MonetDB

I'm importing a CSV into MonetDB. I create a table called fx:
CREATE TABLE fx(ticktime timestamp,broker varchar(6),pair varchar(10),side varchar(1),price float,size tinyint,level tinyint)
and now I am trying to upload a large CSV file that does not have a header.
My sample.csv:
20150828 00:00:00.023,BRK1,EUR/USD,A,1.12437,1,1
20150828 00:00:00.023,BRK1,EUR/USD,A,1.12439,5,2
20150828 00:00:00.023,BRK1,EUR/USD,A,1.12441,9,3
My command:
sql>copy into fx from 'c:\fx\sample.csv' using delimiters ',','\n';
Failed to import table line 1 field 1 'timestamp(7)' expected in '20150828 00:00:00.023'
How do I upload this csv?
The timestamp format in your file is not the one MonetDB likes. So two options:
1) Change the type of ticktime to string:
CREATE TABLE fx(ticktime string, broker varchar(6),pair varchar(10),side varchar(1),price float,size tinyint,level tinyint);
COPY INTO ...
However, you would then need to convert the string column ticktime to a new column ticktimet of type timestamp using string manipulation, for example:
ALTER TABLE fx add column ticktimet timestamp;
UPDATE fx SET ticktimet=str_to_timestamp(ticktime , '%Y%m%d %H:%M:%S');
Note that this solution will discard the subsecond part (e.g. .023) from the timestamp, as this is currently not supported in str_to_timestamp.
2) Change the CSV to use a date format MonetDB likes, e.g.
2015-08-28 00:00:00.023,BRK1,EUR/USD,A,1.12437,1,1
2015-08-28 00:00:00.023,BRK1,EUR/USD,A,1.12439,5,2
2015-08-28 00:00:00.023,BRK1,EUR/USD,A,1.12441,9,3
Then, COPY INTO should work directly.

Creating hive table over complex parquet file

I am trying to put a hive table on top of a parquet table that I created based of the following json contents:
{"user_id":"4513","providers":[{"id":"4220","name":"dbmvl","behaviors":{"b1":"gxybq","b2":"ntfmx"}},{"id":"4173","name":"dvjke","behaviors":{"b1":"sizow","b2":"knuuc"}}]}
{"user_id":"3960","providers":[{"id":"1859","name":"ponsv","behaviors":{"b1":"ahfgc","b2":"txpea"}},{"id":"103","name":"uhqqo","behaviors":{"b1":"lktyo","b2":"ituxy"}}]}
{"user_id":"567","providers":[{"id":"9622","name":"crjju","behaviors":{"b1":"rhaqc","b2":"npnot"}},{"id":"6965","name":"fnheh","behaviors":{"b1":"eipse","b2":"nvxqk"}}]}
I basically used spark sql to read the json and write out a parquet file.
I am running into issues with putting hive on top of the produced parquet file. Here is the hive hql I have:
create table test (mycol STRUCT<user_id:String, providers:ARRAY<STRUCT<id:String, name:String, behaviors:MAP<String, String>>>>) stored as parquet;
Alter table test set location 'hdfs:///tmp/test.parquet';
The above statements execute fine, but I get errors when I try to do a select * on the table:
Failed with exception java.io.IOException:java.lang.IllegalStateException: Column mycol at index 0 does not exist in {providers=providers, user_id=user_id}
Try changing your query to:
create table test (user_id:String, providers:ARRAY<STRUCT<id:String, name:String, behaviors:MAP<String, String>>>) stored as parquet;
The root JSON object gets flattened out when Parquet file is stored.

How to create a Hive table with multiple hdfs files

So basically I would like to create a table containing csv files
I tried something like this where the filenames only differ from eachother by the two last digits :
CREATE EXTERNAL TABLE pageviews
(page_date string,site string)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ';'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/hue/201401/pageviews/supersite_1046_201401**.csv';
To me this syntax looks ok but when I execute it I get the following:
Error occurred executing hive query: Unknown exception.
Any help would be appreciated.
The LOCATION parameter of a hive's create table statement takes as argument a *hdfs_path* (See here). Such a path cannot be file path, but must be a directory path, hence the error you get.
In your case, you could put the required files under a specific directory und specify this very directory in the LOCATION clause of the create table statement.

Uploading CSV for Impala

I am trying to upload the csv file on HDFS for Impala and failing many time. Not sure what is wrong here as I have followed the guide. And the csv is also on HDFS.
CREATE EXTERNAL TABLE gc_imp
(
asd INT,
full_name STRING,
sd_fd_date STRING,
ret INT,
ftyu INT,
qwerINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY','
LOCATION '/user/hadoop/Gc_4';
Error which I am getting. And I am using Hue for it.
> TExecuteStatementResp(status=TStatus(errorCode=None,
> errorMessage='MetaException: hdfs://nameservice1/user/hadoop/Gc_4 is
> not a directory or unable to create one', sqlState='HY000',
> infoMessages=None, statusCode=3), operationHandle=None)
Any lead.
/user/hadoop/Gc_4 must be a directory. So you need to create a directory, for example, /user/hadoop/Gc_4. Then you upload your Gc_4 to it. So the file path is /user/hadoop/Gc_4/Gc_4. After that, you can use LOCATION to specify the directory path /user/hadoop/Gc_4.
LOCATION must be a directory. This requirement is same in Hive and Impala.
It's not the answer but a workaround.
In most cases I have seen that the table uploaded but the "status" was not successful.
Also if you have stored the data with the help of Hive which gives you more control then "Don't forget to REFRESH the metadata on Impala UI." .Very Important.