How can I split an array of structs into columns in Spark? - json

I have an column containing an array of phone numbers represented as structs, and need to put them into three columns by a "type" attribute (phone1, phone2, fax).
Here are two sample values of the column.
[{"number":"000-000-0000","type":"Phone1"},{"number":"000-000-0001","type":"Phone2"},{"number":"000-000-0002","type":"Fax"}]
[{"number":"000-000-1000","type":"Phone1"},{"number":"000-000-1001","typeCode":"Fax"},{"number":"000-000-1002","type":"Phone2"}]
I want to split each into three columns, one for each type.
I want something like this:
Phone1 Phone2 Fax
000-000-0000 000-000-0001 000-000-0002
000-000-1000 000-000-1002 000-000-1001
This answer shows how to put each element of the array into its own column.
How to explode an array into multiple columns in Spark
This gets me halfway there, but I can't rely on the order of items in the array. If I do this I'll get something like this, where the Phone2 and Fax values in the second column are out of place.
Phone1 Phone2 Fax
000-000-0000 000-000-0001 000-000-0002
000-000-1000 000-000-1001 000-000-1002
How can I split the single column value into three columns, using the type value? An array can have 0-3 numbers, but will never have more than one number of each type.

Here's one way which involves flattening the phone/fax#s via explode, followed by pivoting on the typeCode, as shown in the following example:
case class Contact(number: String, typeCode: String)
val df = Seq(
(1, Seq(Contact("111-22-3333", "Phone1"), Contact("111-44-5555", "Phone2"), Contact("111-66-7070", "Fax"))),
(2, Seq(Contact("222-33-4444", "Phone1"), Contact("222-55-6060", "Fax"), Contact("111-77-8888", "Phone2")))
).toDF("user_id", "contacts")
df.
withColumn("contact", explode($"contacts")).
groupBy($"user_id").pivot($"contact.typeCode").agg(first($"contact.number")).
show(false)
// +-------+-----------+-----------+-----------+
// |user_id|Fax |Phone1 |Phone2 |
// +-------+-----------+-----------+-----------+
// |1 |111-66-7070|111-22-3333|111-44-5555|
// |2 |222-55-6060|222-33-4444|111-77-8888|
// +-------+-----------+-----------+-----------+

Related

Postgres column with string unicode values

i have a column query_params with type TEXT but the values are stored as a string unicode. each value in the column is prefixed with a u and i'm struggling to remove it
Is there a way to remove the u from the values and convert the values dictionary to columns?
for example, the query SELECT query_params FROM api_log LIMIT 2 returns two rows
{
u'state': u'CA',
u'page_size': u'1000',
u'market': u'Western',
u'requested_at': u'2014-10-28T00:00:00+00:00'
},
{
u'state': u'NY',
u'page_size': u'1000',
u'market': u'Eastern',
u'requested_at': u'2014-10-28T00:10:00+00:00'
}
is it possible to handle unicode in postgres and convert to columns:
state | page_size | market | requested_at
------+-----------+----------+---------------------------
CA | 1000 | Western | 2014-10-28T00:00:00+00:00
NY | 1000 | Eastern | 2014-10-28T00:10:00+00:00
Thanks for any help.
You should remove u letters and replace single quotes with double ones to get properly formatted json. Then you can use the ->> operator to get its attributes:
select
v->>'state' as state,
v->>'page_size' as page_size,
v->>'market' as market,
v->>'requested_at' as requested_at
from (
select regexp_replace(query_params, 'u\''([^\'']*)\''', '"\1"', 'g')::json as v
from api_log
) s;
Test the solution in SqlFiddle.
Read about POSIX Regular Expression in the documentation.
Find an explanation of the regexp expression in regex101.com.

insert and fetch strings and matrices to/from MySQL with Matlab

I need to store data in a database. I have installed and configured a MySQL database (and an SQLite database) in Matlab. However I cannot store and retrieve anything other than scalar numeric values.
% create an empty database called test_data base with MySQL workbench.
% connect to it in Matlab
conn=database('test_database','root','XXXXXX','Vendor','MySQL');
% create a table to store values
create_test_table=['CREATE TABLE test_table (testID NUMERIC PRIMARY KEY, test_string VARCHAR(255), test_vector BLOB, test_scalar NUMERIC)'];
curs=exec(conn,create_test_table)
Result is good so far (curs.Message is an empty string)
% create a new record
datainsert(conn,'test_table',{'testID','test_string','test_vector','test_scalar'},{1,'string1',[1,2],1})
% try to read out the new record
sqlquery='SELECT * FROM test_table8';
data_to_view=fetch(conn,sqlquery)
Result is bad:
data_to_view =
1 NaN NaN 1
From the documentation for "fetch" I would expect:
data_to_view =
1×4 table
testID test_string test_vector test_scalar
_____________ ___________ ______________ ________
1 'string1' 1x2 double 1
Until I learn how to read blobs I'd even be willing to accept:
data_to_view =
1×4 table
testID test_string test_vector test_scalar
_____________ ___________ ______________ ________
1 'string1' NaN 1
I get the same thing with an sqlite database. How can I store and then read out strings and blobs and why isn't the data returned in table format?
Matlab does not document that the default options for SQLite and MySQL database retrieval are to attempt to return everything as a numeric array. One only needs this line:
setdbprefs('DataReturnFormat','cellarray')
or
setdbprefs('DataReturnFormat','table')
in order to get results with differing datatypes. However! now my result is:
data_to_view =
1×4 cell array
{[2]} {'string1'} {11×1 int8} {[1]}
If instead I input:
datainsert(conn,'test_table',{'testID','test_string','test_vector','test_scalar'},{1,'string1',typecast([1,2],'int8'),1})
Then I get:
data_to_view =
1×4 cell array
{[2]} {'string1'} {16×1 int8} {[1]}
which I can convert like so:
typecast(data_to_view{3},'double')
ans =
1 2
Unfortunately this does not work for SQLite. I get:
data_to_view =
1×4 cell array
{[2]} {'string1'} {' �? #'} {[1]}
and I can't convert the third part correctly:
typecast(unicode2native(data_to_view{1,3}),'double')
ans =
0.0001 2.0000
So I still need to learn how to read an SQLite blob in Matlab but that is a different question.

select int column and compare it with Json array column

this is row in option column in table oc_cart
20,228,27,229
why no result found when value is 228 but result found when value is 20 like below :
select 1 from dual
where 228 in (select option as option from oc_cart)
and result found when I change value to 20 like
select 1 from dual
where 20 in (select option as option from oc_cart)
The option column data type is TEXT
In SQL, these two expressions are different:
WHERE 228 in ('20,228,27,229')
WHERE 228 in ('20','228','27','229')
The first example compares the integer 228 to a single string value, whose leading numeric characters can be converted to the integer 20. That's what happens. 228 is compared to 20, and fails.
The second example compares the integer 228 to a list of four values, each can be converted to different integers, and 228 matches the second integer 228.
Your subquery is returning a single string, not a list of values. If your oc_cart.option holds a single string, you can't use the IN( ) predicate in the way you're doing.
A workaround is this:
WHERE FIND_IN_SET(228, (SELECT option FROM oc_cart WHERE...))
But this is awkward. You really should not be storing strings of comma-separated numbers if you want to search for an individual number in the string. See my answer to Is storing a delimited list in a database column really that bad?

How to search multiple items in JSON array in Postgres 9.3

I have scenario where i need to search multiple values in a JSON array. Below is my schema.
ID DATA
1 {"bookIds" : [1,2,3,5], "storeIds": [2,3]}
2 {"bookIds" : [1,2], "storeIds": [1,3]}
3 {"bookIds" : [11,12,10,9], "storeIds": [4,3]}
I want all the rows with value 1,2. Below is query i am using (This is query is written by one of fellow stackoverflow user Mr. klin credit to him).
select t.*
from JSONTest t, json_array_elements(data->'bookIds') books
where books::text::int in (1, 2);
However output I am duplicate rows in output, below is my output.
id data
1 {"bookIds" : [1,2,3,5], "storeIds": [2,3]}
1 {"bookIds" : [1,2,3,5], "storeIds": [2,3]}
2 {"bookIds" : [1,2], "storeIds": [1,3]}
2 {"bookIds" : [1,2], "storeIds": [1,3]}
I want only two rows in output that is id 1,2. How can i do that? I don't want use Distinct due to other constraints,
SQL Fiddle : http://sqlfiddle.com/#!15/6457a/2
Unfortunately there is no direct conversion function from a JSON array to a "real" Postgres array. (data ->'bookIds')::text returns something that is nearly a Postgres array literal: e.g. [1,2,3,5]. If you replace the [] with {} the value can be cast to an integer array. Once we have a proper integer array we can use the #> to test if it contains another array:
select *
from jsontest
where translate((data ->'bookIds')::text, '[]', '{}')::int[] #> array[1,2];
translate((data ->'bookIds')::text, '[]', '{}') will convert [1,2,3,5] to {1,2,3,5} which then is converted to an array using ::int[]
SQLFiddle: http://sqlfiddle.com/#!15/6457a/4

MYSQL - Find rows, where part of search string matches part of value in column

I wasn't able to find this anywhere, here's my problem:
I have a string like '1 2 3 4 5' and then I have a mysql table that has a column, let's call it numbers, that look like this:
numbers
1 2 6 8 9 14
3
1 5 3 6 9
7 8 9 23 44
10
I am trying to find the easiest way (hopefully in a single query) to find the rows, where any of the numbers in my search string (1 or 2 or 3 or 4 or 5) is contained in the numbers column. In the give example I am looking for rows with 1,2 and 3 (since they share numbers with my search string).
I am trying to do this with a single query and no loops.
Thanks!
The best solution would be to get rid of the column containing a list of values, and use a schema where each value is in its own row. Then you can use WHERE number IN (1, 2, 3, 4, 5) and join this with the table containing the rest of the data.
But if you can't change the schema, you can use a regular expression.
SELECT *
FROM yourTable
WHERE numbers REGEXP '[[:<:]](1|2|3|4|5)[[:<:]]'
[[:<:]] and [[:<:]] match the beginning and end of words.
Note that this type of search will be very slow if the table is large, because it's not feasible to index it.
Here is a start point (split string function) : http://blog.fedecarg.com/2009/02/22/mysql-split-string-function/ := SplitString(string,delimiter,position)
Create a function so it converts a string to an array := stringSplitted(string,delimiter)
Create a function so it compares two arrays :=arrayIntersect(array1, array2)
SELECT numbers
FROM table
WHERE arrayIntersect(#argument, numbers)
Two function definitions with loops and one single query without any loop
SELECT * FROM MyTable WHERE (numbers LIKE '%1%' OR numbers LIKE '%2%')
or you can also use REGEX something like this
SELECT * FROM events WHERE id REGEXP '5587$'