Export Dynamodb to S3 using Hive

Export Dynamodb to S3 using Hive - json

I referred to this link: http://docs.aws.amazon.com/emr/latest/ReleaseGuide/EMR_Hive_Commands.html.
My hive script is like below:
DROP TABLE IF EXISTS hiveTableName;
CREATE EXTERNAL TABLE hiveTableName (item map<string,string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test_table", "dynamodb.region"="us-west-2");
DROP TABLE IF EXISTS s3TableName;
CREATE EXTERNAL TABLE s3TableName (item map<string, string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION 's3://bucket/test-hive2';
SET dynamodb.throughput.read.percent=0.8;
INSERT OVERWRITE TABLE s3TableName SELECT *
FROM hiveTableName;
Dynamodb table can be successfully exported to S3, but the file format is not JSON, it is like:
uuid{"s":"db154955-8555-4b49-bf40-ee36605ac510"}num{"n":"1294"}info{"s":"qwefjdkslafjdafl"}
uuid{"s":"d9898564-2b56-42ba-9cfb-fd092e7d0b8d"}num{"n":"100"}info{"s":"qwefjdkslafjdafl"}
Does someone know how to export in JSON format? I know I can use Data Pipeline, and it can export dynamodb table to S3 in JSON format, but for some reason I need to use EMR. I tried another tool: https://github.com/awslabs/emr-dynamodb-connector, and use the command:
java -cp target/emr-dynamodb-tools-4.2.0-SNAPSHOT.jar org.apache.hadoop.dynamodb.tools.DynamoDBExport /where/output/should/go my-dynamo-table-name
but the error was
Error: Could not find or load main class org.apache.hadoop.dynamodb.tools.DynamoDBExport
Can someone tell me how to solve these problems? Thanks.
== update ==
If I use to_json, as Chris suggested, my code is as below:
DROP TABLE IF EXISTS hiveTableName2;
CREATE EXTERNAL TABLE hiveTableName2 (item map<string, string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "test_table", "dynamodb.region"="us-west-2");
DROP TABLE IF EXISTS s3TableName2;
CREATE EXTERNAL TABLE s3TableName2 (item string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION 's3://backup-restore-dynamodb/hive-test';
INSERT OVERWRITE TABLE s3TableName2 SELECT to_json(item)
FROM hiveTableName2;
When I look at the generated file, it's like
{"uuid":"{\"s\":\"db154955-8555-4b49-bf40-ee36605ac510\"}","num":"{\"n\":\"1294\"}","info":"{\"s\":\"qwefjdkslafjdafl\"}"}
What I want is a nested map, like
map<string, map<string, string>>
not
map<string, string>
Can someone give me some suggestions? Thanks.

Your SELECT * query is emitting a serialized form of the Hive map, which isn't guaranteed to be JSON. You may want to consider using the Brickhouse Hive UDF's. In particular, calling the to_json function would be a good fit for guaranteeing a JSON format in your output.
to_json -- Convert an arbitrary Hive structure ( list,map, named_struct ) into JSON
INSERT OVERWRITE TABLE s3TableName SELECT to_json(item)
FROM hiveTableName;

On November 9, 2020, DynamoDB released a new feature to export your data to an S3 bucket - you can read more about it here:
https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/
It's a native, server-less solution, and currently (as of 11/20) supports DynamoDB JSON.

Related

How to deal with JSON with special characters in Column Names in AWS ATHENA

I'm new to athena even though I have some short experience with Hive.
I'm trying to create a table from JSON files, which are exports from MongoDB. My problem is that MongoDB uses $oid, $numberInt, $numberDoble and others as internal references, but '$' is not accepted in a column name in Athena.
This is a one line JSON file that I created to test:
{"_id":{"$oid":"61f87ebdf655d153709c9e19"}}
and this is the table that referes to it:
CREATE EXTERNAL TABLE landing.json_table (
`_id` struct<`$oid`:string>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://bucket-name/test/';
When I run a simple SELECT * it returns this error:
HIVE_METASTORE_ERROR: Error: name expected at the position 7 of
'struct<$oid:string>' but '$' is found. (Service: null; Status Code:
0; Error Code: null; Request ID: null; Proxy: null)
Which is related to the fact that the JSON column contains the $.
Any idea on how to handle the situation? My only resolution for now is to create a script which "clean" the json file from the unaccepted characters but I would really prefer to handle it directly in Athena if possible

If you switch to the OpenX SerDe, you can create a SerDe mapping for JSON fields with special characters like $ in the name.
See AWS Blog entry Create Tables in Amazon Athena from Nested JSON and Mappings Using JSONSerDe , section "Walkthrough: Handling forbidden characters with mappings".
A mapping that would work for your example:
CREATE EXTERNAL TABLE landing.json_table (
`_id` struct<`oid`:string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
"mapping.oid"="$oid"
)
LOCATION 's3://bucket-name/test/';

How to load json snappy compressed in HIVE

I have a bunch of json snappy compressed files in HDFS.
They are HADOOP snappy compressed (not python, cf other SO questions)
and have nested structures.
Could not find a method to load them into
into HIVE (using json_tuple) ?
Can I get some ressources/hints on how to load them
Previous references (does not have valid answers)
pyspark how to load compressed snappy file
Hive: parsing JSON

Put all files in HDFS folder and create external table on top of it. If files have names like .snappy Hive will automatically recognize them. You can specify SNAPPY output format for writing table:
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
CREATE EXTERNAL TABLE mydirectory_tbl(
id string,
name string
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/mydir' --this is HDFS/S3 location
;
JSONSerDe can parse all complex structures, it is much easier than using json_tuple. Simple attributes in json are mapped to columns as is All in the square brackets [] is an array<>, in {} is a struct<> or map<>, complex types can be nested. Carefully read Readme: https://github.com/rcongiu/Hive-JSON-Serde. There is a section about nested structures and many examples of CREATE TABLE.
If you still want to use json_tuple, then create table with single STRING column then parse using json_tuple. But it is much more difficult.
All JSON records should be in single line (no newlines inside JSON objects, as well as \r) . The same is mentioned here https://github.com/rcongiu/Hive-JSON-Serde

If your data is partitioned (ex. by date)
Create the table in Hive
CREATE EXTERNAL TABLE IF NOT EXISTS database.table (
filename STRING,
cnt BIGINT,
size DOUBLE
) PARTITIONED BY ( \`date\` STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'folder/path/in/hdfs'
Recover the partition (before the recovery, the table seems to be empty)
MSCK REPAIR TABLE database.table

Unable to load .csv data from hdfs into Hive table in Hadoop

I am trying to load csv files into a Hive table. I need to have it done through HDFS.
My end goal is to have the hive table also connected to Impala tables, which I can then load into Power BI, but I am having trouble getting the Hive tables to populate.
I create a table in the Hive query editor using the following code:
CREATE TABLE IF NOT EXISTS dbname.table_name (
time_stamp TIMESTAMP COMMENT 'time_stamp',
attribute STRING COMMENT 'attribute',
value DOUBLE COMMENT 'value',
vehicle STRING COMMENT 'vehicle',
filename STRING COMMENT 'filename')
Then I check and see the LOCATION using the following code:
SHOW CREATE TABLE dbname.table_name;
and find that is has gone to the default location:
hdfs://our_company/user/hive/warehouse/dbname.db/table_name
So I go to the above location in HDFS, and I upload a few csv files manually, which are in the same five-column format as the table I created. Here is where I expect this data to be loaded into the Hive table, but when I go back to dbname in Hive, and open up the table I made, all values are still null, and when I try to open in browser I get:
DB Error
AnalysisException: Could not resolve path: 'dbname.table_name'
Then I try the following code:
LOAD DATA INPATH 'hdfs://our_company/user/hive/warehouse/dbname.db/table_name' INTO TABLE dbname.table_name;
It runs fine, but the table in Hive still does not populate.
I also tried all of the above using CREATE EXTERNAL TABLE instead, and specifying the HDFS in the LOCATION argument. I also tried making an HDFS location first, uploading the csv files, then CREATE EXTERNAL TABLE with the LOCATION argument pointed at the pre-made HDFS location.
I already made sure I have authorization privileges.
My table will not populate with the csv files, no matter which method I try.
What I am doing wrong here?

I was able to solve the problem using:
CREATE TABLE IF NOT EXISTS dbname.table_name (
time_stamp STRING COMMENT 'time_stamp',
attribute STRING COMMENT 'attribute',
value STRING COMMENT 'value',
vehicle STRING COMMENT 'vehicle',
filename STRING COMMENT 'filename')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
and
LOAD DATA INPATH 'hdfs://our_company/user/hive/warehouse/dbname.db/table_name' OVERWRITE INTO TABLE dbname.table_name;

Add data query with Hive-JSON-SerDe

I'm working with hive and i need to add data in json-format. I use https://github.com/rcongiu/Hive-JSON-Serde library. It loads data in hive from file.
~$ cat test.json
{"text":"foo","number":123}
{"text":"bar","number":345}
$ hadoop fs -put -f test.json /user/data/test.json
$ hive
hive> CREATE DATABASE test;
hive> CREATE EXTERNAL TABLE test ( text string )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/user/data';
hive> SELECT * FROM test;
OK
foo 123
bar 345
But i need load data from query, like:
insert into table test values {"text": "abc", number: 666}
Who knows how do this?

The question seems old, however, in case someone looking for an answer:
I tried another approach as following:
Create table test (text string);
LOAD data inpath 'path/test.json' INTO TABLE test;
insert into table test values ("{'text':'abc','number':666}");
The only different is when you need to load values it will be something like:
select get_json_object(str,'$.text') as text1, get_json_object(str,'$.number') as number1 from test;

A SerDe is really intended for use with external tables which read the data from files containing the data. So it will not help you directly insert json data and the insert query you give as an example will not work as such. I suggest that you should either write the data to a file on your hdfs and create an external table on the folder containing the file, or parse incoming data such that you can insert it as columns.

how to load csv file into hive

this is my csv file
id,name,address
"1xz","hari","streetno=1-23-2,street name=Lakehill,town=Washington"
"2xz","giri","streetno=5-6-3456,street name=second street,town=canada"
i was loaded this data using row format delimeter "," but it was not loading properley,i am facing the problem with address filed.in address field i have data like this format "streetno=1-23-2,street name=Lakehill,town=Washington" in this address filed values are terminated by again ",".i was found one solution in pig,help me to solve it using hive.
i am getting this output
"1xz" "hari" "streetno=1-23-2
"2xz" "giri" "streetno=5-6-3456
this is my schema
create table emps (id string,name string,addresss string ) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;

Use split() function, it returns array of strings: [0]='streetno', [1]='1-23-2':
split(address,'=')[1] as address --returns '1-23-2'

You already found a working solution in Pig so why not transfer that relation to a Hive table directly using HCatalog.
STORE pig_relation INTO 'hive_table_name' USING org.apache.hive.hcatalog.pig.HCatStorer();
Make sure you start up Pig using:
>pig -useHCatalog
Table must already exist in Hive.
Hope this helps.

CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Export Dynamodb to S3 using Hive - json

Related

How to deal with JSON with special characters in Column Names in AWS ATHENA

How to load json snappy compressed in HIVE

Unable to load .csv data from hdfs into Hive table in Hadoop

Add data query with Hive-JSON-SerDe

how to load csv file into hive

Categories

Resources