Unpack string column with multiple dictionaries in PySpark - json

In Databricks, using PySpark, I am working on a DataFrame that has the following column, where each row is a list with multiple dictionaries:
I would like to unpack/flatten this column, so that there is a separate column for each of the dictionary values.
However, the issue is that the data type of this column is a string.
How can I unpack the column?
For reference, here is an example of a value:
[{"long_name":"Sofia","short_name":"Sofia","types":["locality","political"]},{"long_name":"Sofia City Province","short_name":"Sofia City Province","types":["administrative_area_level_1","political"]},{"long_name":"Bulgaria","short_name":"BG","types":["country","political"]}]

Your string column can be converted to array of structs using from_json and providing the schema. Then you can do inline to explode it to columns.
df = spark.createDataFrame(
[('[{"long_name":"Sofia","short_name":"Sofia","types":["locality","political"]},{"long_name":"Sofia City Province","short_name":"Sofia City Province","types":["administrative_area_level_1","political"]},{"long_name":"Bulgaria","short_name":"BG","types":["country","political"]}]',)],
['address_components'])
df = df.selectExpr(
"inline(from_json(address_components, 'array<struct<long_name:string,short_name:string,types:array<string>>>'))"
)
df.show(truncate=0)
# +-------------------+-------------------+----------------------------------------+
# |long_name |short_name |types |
# +-------------------+-------------------+----------------------------------------+
# |Sofia |Sofia |[locality, political] |
# |Sofia City Province|Sofia City Province|[administrative_area_level_1, political]|
# |Bulgaria |BG |[country, political] |
# +-------------------+-------------------+----------------------------------------+

Related

Is it possible to flip key/value pairs only for certain strings in dictionary and parse dictionary into columns/rows within db table?

I have SQL table that has dictionaries in it. Some of the key:value pairs in the dictionary are mapped backwards. I would like to flip the key:value pairs only if the value contains test. If the data can be flipped, I would like to parse the dictionary into columns and row for respective columns.
The table and data looks like this currently:
dict_col
0 {'aaa':'bbb'}
1 {'123':'test a'}
2 {'345':'test b','ccc':'dd','789':'test c'}
I would like the dictionary objects to look like this with flipping key:value pairs only where value contains test:
dict_col
0 {'aaa':'bbb'}
1 {'test a':'123'}
2 {'test b':'345','ccc':'dd','test c':'789'}
The final table should look like this (added hypens for ease of viewing):
aaa bbb ccc test_a test_b test_c
dict_col
0 {'aaa':'bbb'} bbb
1 {'test a':'123'} 123
2 {'test b':'345','ccc':'dd','test c':'789'} dd 345 789
This will not help, but I can do this in python but not sure how to do JSON/DICT from SQL:
df['dict_col3']=df['dict_col'].apply(lambda x: {k: v for d in [{v:k} if 'test' in v else {k:v} for k,v in x.items()] for k, v in d.items()})
I Have tried something like this but I dont know how to correct the objects where test is in the value place:
SELECT
JSON_VALUE(dict_col, '$.345') AS PostCode
FROM TABLE

Extracting a list of dicts for a Pandas column

I have a list of dictionaries within a pandas column to designate landing pages for a particular keyword.
keyword | 07-31-2019 | landing_pages |
cloud api | 50 | [{'url' : 'www.example.com', 'date' : '07-31-2019'}, {'url' ... ]|
database | 14 | [{'url' : 'www.example.com/2', 'date' : '08-30-2019'} ... ]|
*There are actually many date columns, but I've only shown 1 for example.
My issue is that I already have columns for each date, so I want to extract the landing pages as a list and have that as a new column.
keyword | 07-31-2019 | landing_pages
cloud api | 50 | www.example.com, www.example.com/other
database | 14 | www.example.com/2, www.example.com/3
So far, I've tried using json_normalize, which gave me a new table of dates and landing pages. I've tried getting the values with list comprehension, but that gave me the wrong result as well. One way I can think of is to use loops to solve the problem, but I'm concerned that's not efficient. How can I do this efficiently?
Use generator with join for extract url values (if data are dictionaries):
df['landing_pages'] = df['landing_pages'].apply(lambda x: ', '.join(y['url'] for y in x))
print (df)
keyword 07-31-2019 landing_pages
0 cloud api 50 www.example.com
1 database 14 www.example.com/2
If not working because strings repr of dictionaries:
import ast
df['landing_pages'] = df['landing_pages']
.apply(lambda x: ', '.join(y['url'] for y in ast.literal_eval(x)))
EDIT: If want maximal url by recent dates create DataFrame with adding new keys by index values, then convert datetimes from strings and last use DataFrameGroupBy.idxmax for index of maximum datetimes, select by DataFrame.loc for rows with urls and last assign column url to original DataFrame:
L = [dict(x, **{'i':k}) for k, v in df['landing_pages'].items() for x in v]
df1 = pd.DataFrame(L)
df1['date'] = pd.to_datetime(df1['date'])
df['url by max date'] = df1.loc[df1.groupby('i')['date'].idxmax()].set_index('i')['url']

How can I split an array of structs into columns in Spark?

I have an column containing an array of phone numbers represented as structs, and need to put them into three columns by a "type" attribute (phone1, phone2, fax).
Here are two sample values of the column.
[{"number":"000-000-0000","type":"Phone1"},{"number":"000-000-0001","type":"Phone2"},{"number":"000-000-0002","type":"Fax"}]
[{"number":"000-000-1000","type":"Phone1"},{"number":"000-000-1001","typeCode":"Fax"},{"number":"000-000-1002","type":"Phone2"}]
I want to split each into three columns, one for each type.
I want something like this:
Phone1 Phone2 Fax
000-000-0000 000-000-0001 000-000-0002
000-000-1000 000-000-1002 000-000-1001
This answer shows how to put each element of the array into its own column.
How to explode an array into multiple columns in Spark
This gets me halfway there, but I can't rely on the order of items in the array. If I do this I'll get something like this, where the Phone2 and Fax values in the second column are out of place.
Phone1 Phone2 Fax
000-000-0000 000-000-0001 000-000-0002
000-000-1000 000-000-1001 000-000-1002
How can I split the single column value into three columns, using the type value? An array can have 0-3 numbers, but will never have more than one number of each type.
Here's one way which involves flattening the phone/fax#s via explode, followed by pivoting on the typeCode, as shown in the following example:
case class Contact(number: String, typeCode: String)
val df = Seq(
(1, Seq(Contact("111-22-3333", "Phone1"), Contact("111-44-5555", "Phone2"), Contact("111-66-7070", "Fax"))),
(2, Seq(Contact("222-33-4444", "Phone1"), Contact("222-55-6060", "Fax"), Contact("111-77-8888", "Phone2")))
).toDF("user_id", "contacts")
df.
withColumn("contact", explode($"contacts")).
groupBy($"user_id").pivot($"contact.typeCode").agg(first($"contact.number")).
show(false)
// +-------+-----------+-----------+-----------+
// |user_id|Fax |Phone1 |Phone2 |
// +-------+-----------+-----------+-----------+
// |1 |111-66-7070|111-22-3333|111-44-5555|
// |2 |222-55-6060|222-33-4444|111-77-8888|
// +-------+-----------+-----------+-----------+

insert and fetch strings and matrices to/from MySQL with Matlab

I need to store data in a database. I have installed and configured a MySQL database (and an SQLite database) in Matlab. However I cannot store and retrieve anything other than scalar numeric values.
% create an empty database called test_data base with MySQL workbench.
% connect to it in Matlab
conn=database('test_database','root','XXXXXX','Vendor','MySQL');
% create a table to store values
create_test_table=['CREATE TABLE test_table (testID NUMERIC PRIMARY KEY, test_string VARCHAR(255), test_vector BLOB, test_scalar NUMERIC)'];
curs=exec(conn,create_test_table)
Result is good so far (curs.Message is an empty string)
% create a new record
datainsert(conn,'test_table',{'testID','test_string','test_vector','test_scalar'},{1,'string1',[1,2],1})
% try to read out the new record
sqlquery='SELECT * FROM test_table8';
data_to_view=fetch(conn,sqlquery)
Result is bad:
data_to_view =
1 NaN NaN 1
From the documentation for "fetch" I would expect:
data_to_view =
1×4 table
testID test_string test_vector test_scalar
_____________ ___________ ______________ ________
1 'string1' 1x2 double 1
Until I learn how to read blobs I'd even be willing to accept:
data_to_view =
1×4 table
testID test_string test_vector test_scalar
_____________ ___________ ______________ ________
1 'string1' NaN 1
I get the same thing with an sqlite database. How can I store and then read out strings and blobs and why isn't the data returned in table format?
Matlab does not document that the default options for SQLite and MySQL database retrieval are to attempt to return everything as a numeric array. One only needs this line:
setdbprefs('DataReturnFormat','cellarray')
or
setdbprefs('DataReturnFormat','table')
in order to get results with differing datatypes. However! now my result is:
data_to_view =
1×4 cell array
{[2]} {'string1'} {11×1 int8} {[1]}
If instead I input:
datainsert(conn,'test_table',{'testID','test_string','test_vector','test_scalar'},{1,'string1',typecast([1,2],'int8'),1})
Then I get:
data_to_view =
1×4 cell array
{[2]} {'string1'} {16×1 int8} {[1]}
which I can convert like so:
typecast(data_to_view{3},'double')
ans =
1 2
Unfortunately this does not work for SQLite. I get:
data_to_view =
1×4 cell array
{[2]} {'string1'} {' �? #'} {[1]}
and I can't convert the third part correctly:
typecast(unicode2native(data_to_view{1,3}),'double')
ans =
0.0001 2.0000
So I still need to learn how to read an SQLite blob in Matlab but that is a different question.

How to search multiple items in JSON array in Postgres 9.3

I have scenario where i need to search multiple values in a JSON array. Below is my schema.
ID DATA
1 {"bookIds" : [1,2,3,5], "storeIds": [2,3]}
2 {"bookIds" : [1,2], "storeIds": [1,3]}
3 {"bookIds" : [11,12,10,9], "storeIds": [4,3]}
I want all the rows with value 1,2. Below is query i am using (This is query is written by one of fellow stackoverflow user Mr. klin credit to him).
select t.*
from JSONTest t, json_array_elements(data->'bookIds') books
where books::text::int in (1, 2);
However output I am duplicate rows in output, below is my output.
id data
1 {"bookIds" : [1,2,3,5], "storeIds": [2,3]}
1 {"bookIds" : [1,2,3,5], "storeIds": [2,3]}
2 {"bookIds" : [1,2], "storeIds": [1,3]}
2 {"bookIds" : [1,2], "storeIds": [1,3]}
I want only two rows in output that is id 1,2. How can i do that? I don't want use Distinct due to other constraints,
SQL Fiddle : http://sqlfiddle.com/#!15/6457a/2
Unfortunately there is no direct conversion function from a JSON array to a "real" Postgres array. (data ->'bookIds')::text returns something that is nearly a Postgres array literal: e.g. [1,2,3,5]. If you replace the [] with {} the value can be cast to an integer array. Once we have a proper integer array we can use the #> to test if it contains another array:
select *
from jsontest
where translate((data ->'bookIds')::text, '[]', '{}')::int[] #> array[1,2];
translate((data ->'bookIds')::text, '[]', '{}') will convert [1,2,3,5] to {1,2,3,5} which then is converted to an array using ::int[]
SQLFiddle: http://sqlfiddle.com/#!15/6457a/4