How to cast and write back into a nested column - apache-drill

I have a nested json data file. I need to write to Parquet with specific numeric data types. So need to control some fields being Integer, some being Long etc. I can cast the columns, but cannot write them back to the nested location. The data must be in the nested structure.
Here is what I tried:
CREATE TABLE cdg.`test2.parquet` AS SELECT CAST(t.l1.l2.id AS INTEGER) l1.l2.id FROM cdg.`data.json` t;
The error I get is Error: PARSE ERROR: Encountered "." on the path after the closing bracket of the CAST statement.
AS INTEGER) l1.l2.id FROM c
^
Analysis #1: If I do not put in that nested field it writes the expression result out fine:
+---------+
| EXPR$0 |
+---------+
| 22222 |
| 22222 |
| 22222 |
| 22222 |
Any insights would be greatly appreciated.

This error is complaining that you cannot specify a field name l1.l2.id as alias to the cast expression. Alias should be a top level field.
Alias can be provided to a column with as or without using 'as' keyword (implicitly parser assumes as)
So for example if you want to name a the resultant expression of the cast expression aId then the following Sql statement can be used.
CREATE TABLE cdg.test2.parquet AS SELECT CAST(t.l1.l2.id AS INTEGER) as Id FROM cdg.data.json t;
Hope this helps.

Related

Unexpected result in WHERE clause on AI ID field

I have a table which's name is users in my MySQL database, and I am using this DB with Ruby on Rails application with ORM structure for years. The table has id field and this field is configured as AI (auto-increment), BIGINT.
Example of my users table;
+----+---------+
| id | name |
+----+---------+
| 1 | John |
| 2 | Tommy |
| 3 | ... |
| 4 | ... |
| 5 | ... |
| 6 | ... |
+----+---------+
The problem I am facing is when I execute the following query I get unexpected rows.
SELECT * FROM users WHERE id = '1AW3F4SEFR';
This query is returning the exact same value with the following query,
SELECT * FROM users WHERE id = 1;
I do not know why SQL let me use strings in WHERE clause on a data type INT. And as we can see from the example, my DB converts the strings I gave to the integer at position 0. I mean, I search for 1AW3F4SEFR and I expect not to get any result. But SQL statement returns the results for id = 1.
In Oracle SQL, the behavior of this exact same query is completely different. So, I believe there is something different on MySQL. But I am not sure about what causes this.
As has been explained in the request comments, MySQL has a weird way of converting strings to numbers. It simply takes as much of a string from the left as is numeric and ignores the rest. If the string doesn't start with a number the conversion defaults to 0.
Examples: '123' => 123, '12.3' => 12.3, '.123' => 0.123, '12A3' => 12, 'A123' => 0, '.1A1.' => 0.1
Demo: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=55cd18865fad4738d03bf28082217ca8
That MySQL doesn't raise an error here as other DBMS do, can easily lead to undesired query results that get a long time undetected.
The solution is easy though: Don't let this happen. Don't compare a numeric column with a string. If the ID '1AW3F4SEFR' is entered in some app, raise an error in the app or even prevent this value from being entered. When running the SQL query, make sure to pass a numeric value, so '1AW3F4SEFR' cannot even make it into the DBMS. (Look up how to use prepared statements and pass parameters of different types to the database system in your programming language.)
If for some reason you want to pass a string for the ID instead (I cannot think of any such reason though) and want to make your query fail-safe by not returning any row in case of an ID like '1AW3F4SEFR', check whether the ID string represents an integer value in the query. You can use REGEXP for this.
SELECT * FROM users WHERE id = #id AND #id REGEXP '^[0-9]+$';
Thus you only consider integer ID strings and still enable the DBMS to use an index when looking up the ID.
Demo: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=56f8ee902342752933c20b8762f14dbb

Create hive external table with complex data type and load from CSV or TSV having few columns with serialized JSON object

I have CSV (or TSV) with a column ('nw_day' in example below) having serialized array object and another column ('res_m' in example below) having serialized JSON object. It also has columns with STRING, TIMESTAMP, and FLOAT data type.
For the TSV that looks somewhat like (showing first row)
+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+
| com_id | w_start_time | cap | nw_day | res_m |
+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+
| dtf_id | 2019-04-24 06:00:03 | 444.3 | {'Fri','Mon','Sat','Sun','Thurs','Tue','Wed'} | {"some_str":"str_one","some_n":1,"some_t":2019-04-24 06:00:03.700+0000}|
+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+
I have tried the following statement, but it is not giving me perfect results.
CREATE EXTERNAL TABLE IF NOT EXISTS table_name(
com_id STRING,
w_start_time TIMESTAMP,
cap FLOAT,
nw_day array <STRING>,
res_m STRUCT <
some_str: STRING,
some_n: BIGINT,
some_t: TIMESTAMP
>)
COMMENT 's_e_s'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/location/to/folder/containing/csv'
TBLPROPERTIES ("skip.header.line.count"="1");
So, I'm thinking I deserialize those objects into hive complex datatypes with ARRAYS and STRUCT. But that is not exactly what I get when I run
select * from table_name limit 1;
which gives me
+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| com_id | w_start_time | cap | nw_day | res_m |
+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| dtf_id | 2019-04-24 06:00:03 | 444.3 | ["{'Fri'"," 'Mon'"," 'Sat'"," 'Sun'"," 'Thurs'"," 'Tue'"," 'Wed'}"] | {"some_str":"{\"some_str\":\"str_one\",\"some_n\":1,\"some_t\":2019-04-24 06:00:03.700+0000}\","some_n":null,"some_t":null}|
+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
So, it considering the whole object as a string and split the string by delimiter.
I need some help understanding how to load data from CSV/TSV to complex data types in Hive.
I found a similar question but the requirement is little different and there is no complex datatype involved in there.
Any help would be much appreciated. If this cannot be done and a preprocessing step has to be included prior to loading, some example of input data to complex datatype loads in hive would help me. Thanks in advance!

Count number of objects in JSON array stored as Hive string column

I have a Hive table with a JSON string stored as a string in a column.
Something like this.
Id | Column1 (String)
1 | [{k1:v1,k2:v2},{k3:v3,k4:v4}]
2 | [{k1:v1,k2:v2}]
I want to count the number of JSON objects in the column.
Id | Count
1 | 2
2 | 1
What would be the query to achieve this?
If the JSON objects are such simple structs without nested structs then you can split by '}' and use size()-1:
size(split(column,'[}]'))-1
It works with empty strings correctly, NULLs require special handling if you need to convert to 0:
case when column is null then 0 else size(split(column,'[}]'))-1 end

Using MySQL JSON field to join on a table with custom fields

So I made this system to store custom objects with custom fields for an app that I'm developing. First I have object_def where I save the object definitions:
id | name | fields
------------------------------------------------------------
101 | Group 1 | [{"name": "Title", "id": "AbCdE123"}, ...]
102 | Group 2 | [{"name": "Name", "id": "FgHiJ456"}, ...]
So we have ID (INT), name (VARCHAR) and fields (LONGTEXT). In fields are the object fields like this: {id: string, type: string, name: string}[].
Now In the object table, I have this:
id | object_def_id | object_values
------------------------------------------------------------
235 | 101 | {"AbCdE123": "The Object", ... }
236 | 102 | {"FgHiJ456": "John Perez", ... }
Where object_values is a LONGTEXT also. With that system, I'm able to show the objects on a table in my app using JSON.parse().
Now I've learned that there is a JSON type in MySQL and I want it to use it to do queries and stuff (I'm really new to this).
I've changed the LONGTEXT to JSON and now I wanted to do a SELECT that show the results like this:
#Select objects in group 1:
id | group | Title | ... | other_custom_field
-------------------------------------------------------
235 | Group 1 | The Object | ... | other_custom_value
#Select objects in group 2:
id | group | Name | ... | other_custom_field
-------------------------------------------------------
236 | Group 2 | John Perez | ... | other_custom_value
Id, then group name (I can do this with INNER JOIN) and then all the custom fields with the respective values.
Is this possible? How can I achieve this (hopefully without changing my database structure)? I'm learning MySQL, SQL and databases as I go so I really appreciate your help. Thanks!
Problems I see with your design:
Incorrect JSON format.
[{name: 'Title', id: 'AbCdE123'}, ...]
Should be:
[{"name": "Title", "id": "AbCdE123"}, ...]
You should use the JSON data type instead of LONGTEXT, because JSON will at least reject invalid JSON syntax.
Setting column headings based on data. You can't do this in SQL. Columns and headings must be fixed at the time you prepare the query. You can't do an SQL query that changes its own column headings.
Your object def has an array of attributes, but there's no way in MySQL 5.7 to loop over the "rows" of a JSON array. You'll need to use the JSON_TABLE() in MySQL 8.0.
That will get you closer to being able to look up object values, but then you'll still have to pivot the data into the result set you describe, with one attribute in each column, as if the data had been stored in a traditional way. But SQL doesn't allow you to do dynamic pivoting in a single query. You can't make an SQL query that dynamically grows its own select-list based on the data it finds.
This all makes me wonder...
Why don't you just store the data in the traditional way?
Create a table per object type. Add one column to that table per attribute. That way you get column names. You get column types. You get column constraints — for example, how would you simulate NOT NULL or UNIQUE in your current system?
If you don't want to use SQL, then don't. There are alternatives, like document databases or key/value databases. But don't torture poor SQL by using it to implement an Inner-Platform.

Handling #Num! in data field

In Access db I have a linked table that references Excel file.
In the Excel file I have 2 columns:
Col1 | Col2
---------------
date1 | =if(Col1="","",Col1+1) -> Evaluates to date1+1
<blank> | =if(Col1="","",Col1+1) -> Evaluates to ""
In Access I see it as
Col1 | Col2
---------------
date1 | date1+1
<null> | #Num!
I can't find a way to deal with the problem. The idea is to end up having <null> instead of the error value. Can I capture this error in Access? I have tried looking for error capturing function but I found nothing. I can think of workaround like returning 0 instead of "" and then filtering it out in Access but it doesn't seem like a proper way of doing it.
I could also use the first column to filter the second but again it doesn't seem proper, because in some other cases I could have just 1 column.
IIf evaluates both expressions, and don't mix dates and strings, so try this:
=IIf(IsNull(Col1),Null,DateAdd("d",1,Nz(Col1, Date()))
or:
=CVDate(Col1)+1