JSON parsing issue in hive - json

i am getting some issues while querying json data.
my sample data look like ...
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"ABC":{"XYZ":"123.dfer","founder":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"GAP":{"GGG":"123.dfer","FFF":"3.0","DDD":"Florida","GOP":"fg45","cdc":"QQQ","ZZZ":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"BOX":{"FRG":"123.dfer","CXD":"3.0","FAX":"Florida","SXD":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
i have done follwing
create table src (myjson string);
insert into src values
('{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"ABC":{"XYZ":"123.dfer","founder":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}')
,('{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"GAP":{"XVY":"123.dfer","FAH":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}')
,('{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"BOX":{"VOG":"123.dfer","FAH":"3.0","FAX":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}')
;
The issue is when i start do select get_json_object(myjson,'$.Rtype.MOD.Version[0].ABC.fashion') where get_json_object(myjson,'$.Rtype.MOD.Version[0].ABC') is not null from src
am getting NULLS for the some fields
count value for this say 2345
without where condition also countvalue 2345. this is the issue
the observasion i have seen is this is due to it is trying to fetch data that is $.Rtype.MOD.Version[0].GAP

hive> load data local inpath '/home/satish/s.json' into table sjson;
Loading data to table hivelearning.sjson
Table hivelearning.sjson stats: [numFiles=1, totalSize=216]
hive> select * from sjson;
{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"ABC":{"XYZ":"123.dfer","founder":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}
Time taken: 1.297 seconds, Fetched: 1 row(s)
hive> select get_json_object(data,'$.Rtype.MOD.Version[0].ABC.fashion') from sjson;
OK
fg45
Time taken: 0.084 seconds, Fetched: 1 row(s)

Related

String conversion in hive

I have a json string in which there isa field called version.
Version can either not be there or if it is there it will be of form x.y
.
I want to convert this to x.0 I am currently doing
CONCAT(split(get_json_object(json, '$.version'),'[.]')[0],".","0")
but this does not handle cases where version is not there.
I want "bad_version" to be returned if version is not there. Can I somehow use COALESCE and do some tweaks ?
Yes, you can use either COALESCE or CASE - the syntax is identical to database usage.
select coalesce(myField, 'bad_version') ....
or
select case when myField is null then 'bad_version' else myField end as x ....
You can conditionally test the result of get_json_object to see if it's NULL and return bad_version accordingly. When the version is valid, you can use a regular expression to replace the minor version with 0.
SELECT
IF(get_json_object(json, "$.version") IS NULL,
"bad_version",
regexp_replace(get_json_object(json, "$.version") , "\\..*$", ".0")
)
FROM json_table; -- The table I loaded with test data
Some simple example data:
hive> SELECT json FROM json_table;
OK
{"id":"001","version":"3.9"}
{"id":"002","notversion":"3.9"}
Time taken: 0.225 seconds, Fetched: 2 row(s)
And then the results of this query against this data:
hive> SELECT
> IF(get_json_object(json, "$.version") IS NULL,
> "bad_version",
> regexp_replace(get_json_object(json, "$.version") , "\\..+$", ".0")
> )
> FROM json_table;
OK
3.0
bad_version
Time taken: 0.225 seconds, Fetched: 2 row(s)

Mysql REPLACE resulted in 0 values on all rows

I've used the following SQL query in a MySQL database to replace part of a string in a cell:
UPDATE TEST.database2014 SET together1 = REPLACE(together1, "/1900", "20")
For some reason all the rows (225,000!) have now a value of 0.
This is the message which I got:
/* Affected rows:225,000 Found rows: 0 Warnings: 0 Duration for 1 query: 16,888 sec. */
ADDITIONAL INFORMATION:
data example contained in field together1:
TESTING^^^19/01/2014^^
Is there a known reason for this happening?
I find it strange that if no matches where found it converted all values to 0 anyway.
I think that you must use this:
UPDATE TEST.database2014 SET together1 = REPLACE(together1, "/19", "/20") WHERE togheter1 LIKE '%/19%'
if you want to upate all year 1900 to 2000

MySQL VARCHAR Type won't CONVERT to Integer

I have a column of data of type VARCHAR, that I want to CONVERT or CAST to an integer (my end goal is for all of my data points to be integers). However, all the queries I attempt return values of 0.
My data looks like this:
1
2
3
4
5
If I run either of the following queries:
SELECT CONVERT(data, BINARY) FROM table
SELECT CONVERT(data, CHAR) FROM table
My result is:
1
2
3
4
5
No surprises there. However, if I run either of these queries:
SELECT CONVERT(data, UNSIGNED) FROM table
SELECT CONVERT(data, SIGNED) FROM table
My result is:
0
0
0
0
0
I've searched SO and Google all over for an answer to this problem, with no luck, so I thought I would try the pros here.
EDIT/UPDATE
I ran some additional queries on the suggestions from the comments, and here are the results:
data LENGTH(data) LENGTH(TRIM(data)) ASCII(data)
1 3 3 0
2 3 3 0
3 3 3 0
4 3 3 0
5 3 3 0
It appears that I have an issue with the data itself. For anyone coming across this post: my solution at this point is to TRIM the excess from the data points and then CONVERT to UNSIGNED. Thanks for all of the help!
FURTHER EDIT/UPDATE
After a little research, turns out there were hidden NULL bytes in my data. The answer to this question helped out: How can I remove padded NULL bytes using SELECT in MySQL
What does SELECT data, LENGTH(data), LENGTH(TRIM(data)), ASCII(data) FROM table return? It's possible your numeric strings aren't just numeric strings.
Alternately, are you using multi-byte character encoding?
I believe the query you have is fine; as it worked for me: sqlfiddle.com/#!2/a15ec4/1/3.
Makes me think you have a data problem. Are you sure there's not a return or space in the data somewhere?
you can check the data by trying to do a length or a ascii on the data to see if you have more than expected:
select ascii(data) from foo where ascii(data) not between 48 and 57 or
select length(data) as mLEN from table having mlen>1 for length.
I believe this is the correct form:
SELECT CAST(data AS UNSIGNED) FROM test;
SELECT CAST(data AS SIGNED) FROM test;
Tested here: http://sqlfiddle.com/#!8/8c481/1
Try these syntax
SELECT CONVERT(data, UNSIGNED INTEGER) FROM table
or
SELECT CAST(data AS UNSIGNED) FROM table

BULK INSERT large CSV file and attach an additional column

I was able to use BULK INSERT on an SQL Server 2008 R2 database to import a CSV file (Tab delimited) with more than 2 million rows. This command is planned to run every week.
I added an additional column named "lastupdateddate" to the generated table to store the datestamp a row is updated via a INSERT trigger. But when I ran the BULK INSERT again, it failed due to mismatch in columns as there is no such a field in a raw CSV file.
Is there any possibility to configure BULK INSERT to ignore the "lastupdateddate" column when it runs?
Thanks.
-- EDIT:
I tried using a format file but still unable to solve the problem.
The table looks as below.
USE AdventureWorks2008R2;
GO
CREATE TABLE AAA_Test_Table
(
Col1 smallint,
Col2 nvarchar(50) ,
Col3 nvarchar(50) ,
LastUpdatedDate datetime
);
GO
The csv "data.txt" file is:
1,DataField2,DataField3
2,DataField2,DataField3
3,DataField2,DataField3
The format file is like:
10.0
3
1 SQLCHAR 0 7 "," 1 Col1 ""
2 SQLCHAR 0 100 "," 2 Col2 SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 0 100 "," 3 Col3 SQL_Latin1_General_CP1_CI_AS
The SQL command I ran is:
DELETE AAA_Test_Table
BULK INSERT AAA_Test_Table
FROM 'C:\Windows\Temp\TestFormatFile\data.txt'
WITH (formatfile='C:\Windows\Temp\TestFormatFile\formatfile.fmt');
GO
The error received is:
Msg 4864, Level 16, State 1, Line 2
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 2, column 1 (Col1).
Msg 4832, Level 16, State 1, Line 2
Bulk load: An unexpected end of file was encountered in the data file.
Msg 7399, Level 16, State 1, Line 2
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.
Msg 7330, Level 16, State 2, Line 2
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".
Yes you can using a format file as documented Here and use that format with bcp command with -f option like -f format_file_name.fmt.
Well another option would be; import all the data (I mean all fields) and then drop the non wanted column lastupdateddate using SQL like
ALTER TABLE your_bulk_insert_table DROP COLUMN lastupdateddate

BulkInsert into table with Identity column (T-SQL)

1) Can I do a BulkInsert from a CSV file, into a table, such that the table has an identity column that is not in the CSV, and gets automatically assigned?
2) Is there any rule that says the table that I'm BulkInsert'ing into, has to have the same columns in the same order as the flat file being read?
This is what I'm trying to do. Too many fields to include everything...
BULK INSERT ServicerStageACS
FROM 'C:\POC\DataFiles\ACSDemo1.csv'
WITH (FORMATFILE = 'C:\POC\DataFiles\ACSDemo1.Fmt');
GO
SELECT * FROM ServicerStageACS;
Error:
Msg 4864, Level 16, State 1, Line 3 Bulk load data conversion error
(type mismatch or invalid character for the specified codepage) for
row 1, column 1 (rowID).
I'm pretty sure the error is because I have an identity.
FMT starts like this:
9.0
4
1 SQLCHAR 0 7 "," 1 Month SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 0 100 "," 2 Client SQL_Latin1_General_CP1_CI_AS
A co-worker recommended that it was easier to do the bulk insert into a view. The view does not contain the identity field, or any other field not to be loaded.
truncate table ServicerStageACS
go
BULK INSERT VW_BulkInsert_ServicerStageACS
FROM 'C:\POC\DataFiles\ACSDemo1.csv'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
GO
SELECT * FROM ServicerStageACS;