Extracting a list of dicts for a Pandas column - json

I have a list of dictionaries within a pandas column to designate landing pages for a particular keyword.
keyword | 07-31-2019 | landing_pages |
cloud api | 50 | [{'url' : 'www.example.com', 'date' : '07-31-2019'}, {'url' ... ]|
database | 14 | [{'url' : 'www.example.com/2', 'date' : '08-30-2019'} ... ]|
*There are actually many date columns, but I've only shown 1 for example.
My issue is that I already have columns for each date, so I want to extract the landing pages as a list and have that as a new column.
keyword | 07-31-2019 | landing_pages
cloud api | 50 | www.example.com, www.example.com/other
database | 14 | www.example.com/2, www.example.com/3
So far, I've tried using json_normalize, which gave me a new table of dates and landing pages. I've tried getting the values with list comprehension, but that gave me the wrong result as well. One way I can think of is to use loops to solve the problem, but I'm concerned that's not efficient. How can I do this efficiently?

Use generator with join for extract url values (if data are dictionaries):
df['landing_pages'] = df['landing_pages'].apply(lambda x: ', '.join(y['url'] for y in x))
print (df)
keyword 07-31-2019 landing_pages
0 cloud api 50 www.example.com
1 database 14 www.example.com/2
If not working because strings repr of dictionaries:
import ast
df['landing_pages'] = df['landing_pages']
.apply(lambda x: ', '.join(y['url'] for y in ast.literal_eval(x)))
EDIT: If want maximal url by recent dates create DataFrame with adding new keys by index values, then convert datetimes from strings and last use DataFrameGroupBy.idxmax for index of maximum datetimes, select by DataFrame.loc for rows with urls and last assign column url to original DataFrame:
L = [dict(x, **{'i':k}) for k, v in df['landing_pages'].items() for x in v]
df1 = pd.DataFrame(L)
df1['date'] = pd.to_datetime(df1['date'])
df['url by max date'] = df1.loc[df1.groupby('i')['date'].idxmax()].set_index('i')['url']

Related

Postgres column with string unicode values

i have a column query_params with type TEXT but the values are stored as a string unicode. each value in the column is prefixed with a u and i'm struggling to remove it
Is there a way to remove the u from the values and convert the values dictionary to columns?
for example, the query SELECT query_params FROM api_log LIMIT 2 returns two rows
{
u'state': u'CA',
u'page_size': u'1000',
u'market': u'Western',
u'requested_at': u'2014-10-28T00:00:00+00:00'
},
{
u'state': u'NY',
u'page_size': u'1000',
u'market': u'Eastern',
u'requested_at': u'2014-10-28T00:10:00+00:00'
}
is it possible to handle unicode in postgres and convert to columns:
state | page_size | market | requested_at
------+-----------+----------+---------------------------
CA | 1000 | Western | 2014-10-28T00:00:00+00:00
NY | 1000 | Eastern | 2014-10-28T00:10:00+00:00
Thanks for any help.
You should remove u letters and replace single quotes with double ones to get properly formatted json. Then you can use the ->> operator to get its attributes:
select
v->>'state' as state,
v->>'page_size' as page_size,
v->>'market' as market,
v->>'requested_at' as requested_at
from (
select regexp_replace(query_params, 'u\''([^\'']*)\''', '"\1"', 'g')::json as v
from api_log
) s;
Test the solution in SqlFiddle.
Read about POSIX Regular Expression in the documentation.
Find an explanation of the regexp expression in regex101.com.

insert and fetch strings and matrices to/from MySQL with Matlab

I need to store data in a database. I have installed and configured a MySQL database (and an SQLite database) in Matlab. However I cannot store and retrieve anything other than scalar numeric values.
% create an empty database called test_data base with MySQL workbench.
% connect to it in Matlab
conn=database('test_database','root','XXXXXX','Vendor','MySQL');
% create a table to store values
create_test_table=['CREATE TABLE test_table (testID NUMERIC PRIMARY KEY, test_string VARCHAR(255), test_vector BLOB, test_scalar NUMERIC)'];
curs=exec(conn,create_test_table)
Result is good so far (curs.Message is an empty string)
% create a new record
datainsert(conn,'test_table',{'testID','test_string','test_vector','test_scalar'},{1,'string1',[1,2],1})
% try to read out the new record
sqlquery='SELECT * FROM test_table8';
data_to_view=fetch(conn,sqlquery)
Result is bad:
data_to_view =
1 NaN NaN 1
From the documentation for "fetch" I would expect:
data_to_view =
1×4 table
testID test_string test_vector test_scalar
_____________ ___________ ______________ ________
1 'string1' 1x2 double 1
Until I learn how to read blobs I'd even be willing to accept:
data_to_view =
1×4 table
testID test_string test_vector test_scalar
_____________ ___________ ______________ ________
1 'string1' NaN 1
I get the same thing with an sqlite database. How can I store and then read out strings and blobs and why isn't the data returned in table format?
Matlab does not document that the default options for SQLite and MySQL database retrieval are to attempt to return everything as a numeric array. One only needs this line:
setdbprefs('DataReturnFormat','cellarray')
or
setdbprefs('DataReturnFormat','table')
in order to get results with differing datatypes. However! now my result is:
data_to_view =
1×4 cell array
{[2]} {'string1'} {11×1 int8} {[1]}
If instead I input:
datainsert(conn,'test_table',{'testID','test_string','test_vector','test_scalar'},{1,'string1',typecast([1,2],'int8'),1})
Then I get:
data_to_view =
1×4 cell array
{[2]} {'string1'} {16×1 int8} {[1]}
which I can convert like so:
typecast(data_to_view{3},'double')
ans =
1 2
Unfortunately this does not work for SQLite. I get:
data_to_view =
1×4 cell array
{[2]} {'string1'} {' �? #'} {[1]}
and I can't convert the third part correctly:
typecast(unicode2native(data_to_view{1,3}),'double')
ans =
0.0001 2.0000
So I still need to learn how to read an SQLite blob in Matlab but that is a different question.

Oracle SQL when querying a range of data

I have a table that for an ID, will have data in several bucket fields. I want a function to pull out a sum of buckets, but the function parameters will include the start and end bucket field.
So, if I had a table like this:
ID Bucket0 Bucket30 Bucket60 Bucket90 Bucket120
10 5.00 12.00 10.00 0.0 8.00
If I send in the ID and the parameters Bucket0, Bucket0, it would return only the value in the Bucket0 field: 5.00
If I send in the ID and the parameters Bucket30, Bucket120, it would return the sum of the buckets from 30 to 120, or (12+10+0+8) 30.00.
Is there a nicer way to write this other than a huge ugly
if parameter1=bucket0 and parameter2=bucket0
then select bucket0
else if parameter1=bucket0 and parameter2=bucket1
then select bucket0 + bucket1
else if parameter1=bucket0 and parameter2=bucket2
then select bucket0 + bucket1 + bucket2
and so on?
The table already exists, so I don't have a lot of control over that. I can make my parameters for the function however I want. I can safely say that if a set of buckets are wanted, none in the middle will be skipped, so specifying start and end buckets would work. I could have a single comma delimited string of all buckets wanted.
It would have been better if your table had been normalised, like this:
id | bucket | value
---+-----------+------
10 | bucket000 | 5
10 | bucket030 | 12
10 | bucket060 | 10
10 | bucket090 | 0
10 | bucket120 | 8
Also, the buckets should better have names that are easy to compare in ranges, so that bucket030 comes between bucket000 and bucket120 in the normal alphabetical order, which is not the case if you leave out the padded zeroes.
If the above normalisation is not possible, then use an unpivot clause to turn your current table into the structure depicted above:
select id, sum(value)
from (
select *
from mytable
unpivot (value for bucket_id in (bucket0 as 'bucket000',
bucket30 as 'bucket030',
bucket60 as 'bucket060',
bucket90 as 'bucket090',
bucket120 as 'bucket120'))
) normalised
where bucket_id between 'bucket000' and 'bucket060'
group by id
When you do this with parameter variables, make sure those parameters have the padded zeroes as well.
You could for instance ensure that as follows for parameter1:
if parameter1 like 'bucket%' then
parameter1 := 'bucket' || lpad(+substr(parameter1, 7), 3, '0');
end if;
...etc.

How to Convert Query Result to JSON Object Inside Postgres

I have a simple query, SELECT name, grp FROM things; that results in the following table:
name | grp
------+-----
a | y
b | x
c | x
d | z
e | z
f | z
I would like to end up with the following single JSON object:
{y: [a], x: [b,c], z: [d,e,f]}
I feel like I'm closer with the query SELECT grp, array_agg(name) as names FROM things GROUP BY grp; which gives three rows with the "name" condensed into an array, but I don't know where to go from here to get the rows condensed into a single JSON object.
SELECT json_build_object(grp, array_agg(name)) as objects FROM things GROUP BY grp; is maybe slightly closer since that results in a single column result of individual JSON objects like {y: [a]}, but they are still individual objects, so that might not be the right path to go down.
This is using Postgresql 9.4.
It seems the key here is the json_object_agg function which is not listed with the rest of the json functions.
See: http://www.postgresql.org/docs/9.4/static/functions-aggregate.html
The following query gets me exactly what I'm looking for:
SELECT json_object_agg(each.grp, each.names) FROM (
SELECT grp, array_agg(name) as names FROM things GROUP BY grp
) AS each;

Access partition function: Is there a way to make it show bin categories that don't have a count?

I'm trying to use the Access Partition function to generate the bins used to generate a histogram chart to show the frequency distribution of my % utilization data set. However, the Partition function only shows the category bin ranges (e.g. 0:9, 10:19 etc) only for the categories that have a count. I would like it to show up to 100.
Example:
Using this function:
% Utilization: Partition([Max],0,100,10)
The Full SQL is:
SELECT Count([qry].[Max]) AS Actuals, Partition([Max],0,100,10) AS [% Utilization]
FROM [qry]
GROUP BY Partition([Max],0,100,10);
gives me:
Actuals | % Utilization
4 | 0: 9
4 | 10: 19
4 | 20: 29
but I want it to show 0s for the ranges that don't have values up to 90:99. Can this be done?
Thanks in Advance
The only way I can think of doing this is with an additional Bins table that contains all the bins you wish to illustrate:
SELECT Bins.[% Utilization], t.Actuals FROM Bins
LEFT JOIN
(SELECT Count(max) AS Actuals,
Partition([max],0,100,10) AS [% Utilization]
FROM qry
GROUP BY Partition([max],0,100,10)) t
ON t.[% Utilization]=bins.[% Utilization]