how to load csv file into hive - csv

this is my csv file
id,name,address
"1xz","hari","streetno=1-23-2,street name=Lakehill,town=Washington"
"2xz","giri","streetno=5-6-3456,street name=second street,town=canada"
i was loaded this data using row format delimeter "," but it was not loading properley,i am facing the problem with address filed.in address field i have data like this format "streetno=1-23-2,street name=Lakehill,town=Washington" in this address filed values are terminated by again ",".i was found one solution in pig,help me to solve it using hive.
i am getting this output
"1xz" "hari" "streetno=1-23-2
"2xz" "giri" "streetno=5-6-3456
this is my schema
create table emps (id string,name string,addresss string ) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;

Use split() function, it returns array of strings: [0]='streetno', [1]='1-23-2':
split(address,'=')[1] as address --returns '1-23-2'

You already found a working solution in Pig so why not transfer that relation to a Hive table directly using HCatalog.
STORE pig_relation INTO 'hive_table_name' USING org.apache.hive.hcatalog.pig.HCatStorer();
Make sure you start up Pig using:
>pig -useHCatalog
Table must already exist in Hive.
Hope this helps.

CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Related

Export non-varchar data to CSV table using Trino (formerly PrestoDB)

I am working on some benchmarks and need to compare ORC, Parquet and CSV formats. I have exported TPC/H (SF1000) to ORC based tables. When I want to export it to Parquet I can run:
CREATE TABLE hive.tpch_sf1_parquet.region
WITH (format = 'parquet')
AS SELECT * FROM hive.tpch_sf1_orc.region
When I try the similar approach with CSV, then I get the error Hive CSV storage format only supports VARCHAR (unbounded). I would assumed that it would convert the other datatypes (i.e. bigint) to text and store the column format in the Hive metadata.
I can export the data to CSV using trino --server trino:8080 --catalog hive --schema tpch_sf1_orc --output-format=CSV --execute 'SELECT * FROM nation, but then it gets emitted to a file. Although this works for SF1 it quickly becomes unusable for SF1000 scale-factor. Another disadvantage is that my Hive metastores wouldn't have the appropriate meta-data (although I could patch it manually if nothing else works).
Anyone an idea how to convert my ORC/Parquet data to CSV using Hive?
In Trino Hive connector, the CSV table can contain varchar columns only.
You need to cast the exported columns to varchar when creating the table
CREATE TABLE region_csv
WITH (format='CSV')
AS SELECT CAST(regionkey AS varchar), CAST(name AS varchar), CAST(comment AS varchar)
FROM region_orc
Note that you will need to update your benchmark queries accordingly, e.g. by applying reverse casts.
DISCLAIMER: Read the full post, before using anything discussed here. It's not real CSV and you migth screw up!
It is possible to create typed CSV-ish tables when using the TEXTFILE format and use ',' as the field separator:
CREATE TABLE hive.test.region (
regionkey bigint,
name varchar(25),
comment varchar(152)
)
WITH (
format = 'TEXTFILE',
textfile_field_separator = ','
);
This will create a typed version of the table in the Hive catalog using the TEXTFILE format. It normally uses the ^A character (ASCII 10), but when set to ',' it resembles the same structure as CSV formats.
IMPORTANT: Although it looks like CSV, it is not real CSV. It doesn't follow RFC 4180, because it doesn't properly quote and escape. The following INSERT will not be inserted co:
INSERT INTO hive.test.region VALUES (
1,
'A "quote", with comma',
'The comment contains a newline
in it');
The text will be copied unmodified to the file without escaping quotes or commas. This should have been written like this to be proper CSV:
1,"A ""quote"", with comma","The comment contains a newline
in it"
Unfortunately, it is written as:
1,A "quote", with comma,The comment contains a newline
in it
This results in invalid data that will be represented by NULL columns. For this reason, this method can only be used when you have full control over the text-based data and are sure that it doesn't contain newlines, quotes, commas, ...

How to prevent Hive Create Table from splitting the column that has comma "," inside the data into two columns

I imported MySQL table using Sqoop. Some of the column values has comma "," in them. For example, "value, ST". I want to store that value in the same column like how its in MySQL but when i create Hive table, "value" and "ST" are stored in separate column. "ST" goes into the right column.
I've tried this
CREATE EXTERNAL TABLE IF NOT EXISTS personal_to_delete
(id_personal string,
no_ktp string,
nama string,
nama_tanpa_gelar string,
alamat1 string,
kodepos string,
id_kabupaten_alamat string,
id_propinsi string,
npwp string,
tgl_update string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\",
"quoteChar" = ","")
FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/hadoop/personal_to_delete';
I get Null values after i run this query. How do i solve this?
The problem seems to be that the seperator character used for separating the fields also appears in the values itself. This makes it difficult for the create table command to correctly work. You need either
to escape this character within the fields or
use a quote Character to enclose the fields or
use a different field seperator which does not appear in the fields itself
to make the file "easy understandable" for the import process.
I myself would probably take one of this 2 options:
As suggested by OneCricketeer: Let sqoop directly import to a hive table. Here seems to be a nice example: enter link description here.
When creating a file with sqoop try the option --fields-terminated-by which sets the field separator character. If you set it to something different then comma "," (e.g. semicolon ";" or something else not appearing in your data) and provide this information in your hive create statement (e.g. FIELDS TERMINATED BY ';') and leave out the serdeproperties it should work.

How to convert string "3.82384E+11" to BIGINT with MySQL?

I'm trying to save some ID values from CSV that are automatically converted to exponent numbers by Excel.
Like 382383816413 becomes 3.82384E+11. So I'm doing a full import into my MySQL database with:
LOAD DATA LOCAL INFILE
'file.csv'
INTO TABLE my_table
FIELDS TERMINATED BY ';'
ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(#`item_id`,
#`colum2`,
#`colum3`)
SET
item_id = #`item_id`;
I've tried using cast like:
CAST('3.82384E+11' as UNSIGNED) and it gives me just 3.
CAST('3.82384E+11' as BIGINT) and it doesn't work.
CAST('3.82384E+11' as UNSIGNED BIGINT) and gives me 3 again.
So, what's the better way to convert string exponent numbers to real big integers in MySQL?
Set column format as text instead of number in excel. Refer below link.
PHPExcel - set cell type before writing a value in it
My option was to convert the column with 3.82384E+11 to number in the excel file, so it get back to the original value. Then I export to CSV and use SQL query to import it fine.

Export Dynamodb to S3 using Hive

I referred to this link: http://docs.aws.amazon.com/emr/latest/ReleaseGuide/EMR_Hive_Commands.html.
My hive script is like below:
DROP TABLE IF EXISTS hiveTableName;
CREATE EXTERNAL TABLE hiveTableName (item map<string,string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test_table", "dynamodb.region"="us-west-2");
DROP TABLE IF EXISTS s3TableName;
CREATE EXTERNAL TABLE s3TableName (item map<string, string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION 's3://bucket/test-hive2';
SET dynamodb.throughput.read.percent=0.8;
INSERT OVERWRITE TABLE s3TableName SELECT *
FROM hiveTableName;
Dynamodb table can be successfully exported to S3, but the file format is not JSON, it is like:
uuid{"s":"db154955-8555-4b49-bf40-ee36605ac510"}num{"n":"1294"}info{"s":"qwefjdkslafjdafl"}
uuid{"s":"d9898564-2b56-42ba-9cfb-fd092e7d0b8d"}num{"n":"100"}info{"s":"qwefjdkslafjdafl"}
Does someone know how to export in JSON format? I know I can use Data Pipeline, and it can export dynamodb table to S3 in JSON format, but for some reason I need to use EMR. I tried another tool: https://github.com/awslabs/emr-dynamodb-connector, and use the command:
java -cp target/emr-dynamodb-tools-4.2.0-SNAPSHOT.jar org.apache.hadoop.dynamodb.tools.DynamoDBExport /where/output/should/go my-dynamo-table-name
but the error was
Error: Could not find or load main class org.apache.hadoop.dynamodb.tools.DynamoDBExport
Can someone tell me how to solve these problems? Thanks.
== update ==
If I use to_json, as Chris suggested, my code is as below:
DROP TABLE IF EXISTS hiveTableName2;
CREATE EXTERNAL TABLE hiveTableName2 (item map<string, string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test_table", "dynamodb.region"="us-west-2");
DROP TABLE IF EXISTS s3TableName2;
CREATE EXTERNAL TABLE s3TableName2 (item string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION 's3://backup-restore-dynamodb/hive-test';
INSERT OVERWRITE TABLE s3TableName2 SELECT to_json(item)
FROM hiveTableName2;
When I look at the generated file, it's like
{"uuid":"{\"s\":\"db154955-8555-4b49-bf40-ee36605ac510\"}","num":"{\"n\":\"1294\"}","info":"{\"s\":\"qwefjdkslafjdafl\"}"}
What I want is a nested map, like
map<string, map<string, string>>
not
map<string, string>
Can someone give me some suggestions? Thanks.
Your SELECT * query is emitting a serialized form of the Hive map, which isn't guaranteed to be JSON. You may want to consider using the Brickhouse Hive UDF's. In particular, calling the to_json function would be a good fit for guaranteeing a JSON format in your output.
to_json -- Convert an arbitrary Hive structure ( list,map, named_struct ) into JSON
INSERT OVERWRITE TABLE s3TableName SELECT to_json(item)
FROM hiveTableName;
On November 9, 2020, DynamoDB released a new feature to export your data to an S3 bucket - you can read more about it here:
https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/
It's a native, server-less solution, and currently (as of 11/20) supports DynamoDB JSON.

How to convert date in .csv file into SQL format before mass insertion

I have a csv file with a couple thousand game dates in it, but they are all in the MM/DD/YYYY format
2/27/2011,3:05 PM,26,14
(26 and 14 are team id #s), and trying to put them into SQL like that just results in 0000-00-00 being put into the date field of my table. This is the command I tried using:
LOAD DATA LOCAL INFILE 'c:/scheduletest.csv' INTO TABLE game
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
(`date`, `time`, `awayteam_id`, `hometeam_id`);
but again, it wouldn't do the dates right. Is there a way I can have it convert the date as it tries to insert it? I found another SO question similar to this, but I couldn't get it to work.
Have you tried the following:
LOAD DATA LOCAL INFILE 'c:/scheduletest.csv' INTO TABLE game
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
(#DATE_STR, `time`, `awayteam_id`, `hometeam_id`)
SET `date` = STR_TO_DATE(#DATE_STR, '%c/%e/%Y');
For more information, the documentation has details about the use of user variables with LOAD DATA (about half-way down - search for "User variables in the SET clause" in the page)
You can use variables to load the data from the csv into and run functions on them before inserting, like:
LOAD DATA INFILE 'file.txt'
INTO TABLE t1
(#datevar, #timevar, awayteam_id, hometeam_id)
SET date = STR_TO_DATE(#datevar, '%m/%d/%Y'),
SET time = etc etc etc;
My suggestion would be to insert the file into a temporary holding table where the date column is a character datatype. Then write a query with theSTR_TO_DATE conversion to move the data from the holding table to your final destination.
Convert field that you are using for the date to varchar type so it will play friendly with any format
Import CSV
Convert the dates to a valid mysql date format using something like:
UPDATE table SET field = STR_TO_DATE(field, '%c/%e/%Y %H:%i');
Then revert field type to date
Use a function to convert the format as needed.
I'm not an expert on MySQL, but http://dev.mysql.com/doc/refman/5.0/en/date-and-time-functions.html#function_str-to-date looks promising.
If you can't do that in the load command directly, you may try creating a table that allows you to load all the values as VARCHAR and then to do an insert into your game table with a select statement with the appropriate conversion instead.
If you file is not too big, you can use the Excel function TEXT. If, for example, your date is in cell A2, then the formula in a temporary column next to it would be =TEXT(A2,"yyyy-mm-dd hh:mm:ss"). This will do it and then you can paste the values of the formula's result back into the column and then delete the temporary column.