Hadoop PIG with nested Json - json

I have a list of movies with ratings by user.
{"_id":59607,"title":"King Corn (2007)",
"genres":["Documentary"],
"ratings":[ {"userId":1860,"rating":3},
{"userId":9970,"rating":3.5},
{"userId":16929,"rating":1.5},
{"userId":23473,"rating":4},
{"userId":23733,"rating":4},
{"userId":27584,"rating":3},
{"userId":28232,"rating":4},
{"userId":29482,"rating":3},
{"userId":40976,"rating":5},
{"userId":44631,"rating":4},
{"userId":47613,"rating":3},
{"userId":49763,"rating":3},
{"userId":58160,"rating":4.5},
{"userId":62249,"rating":3},
{"userId":65923,"rating":4},
{"userId":67507,"rating":4},
{"userId":68259,"rating":3.5},
{"userId":70331,"rating":5},
{"userId":71420,"rating":3.5}
]
}
I need to count how many ratings are done by every user. This is my attempt to get in the ratings.
a = load '/movies_1m.json' using JsonLoader('id:int, title : chararray, genres : { ( genre : chararray ) }, ratings: { ( userId : int, rating: float) } ');
then
b = FOREACH a GENERATE FLATTEN(ratings);
describe give me following:
b: {ratings::userId: int,ratings::rating: float}
just to count the users I need to access the inside of ratings. But this is the point where it is not succeeding. I tried this:
c = FOREACH b GENERATE COUNT(ratings);
it gets me an error.
I need to get something like this:
{userId: int, rating: float}

You need to GROUP in order to COUNT since that is an aggregate operation.
b = FOREACH a GENERATE FLATTEN(ratings);
gr = GROUP b by ratings::userId;
c = FOREACH gr GENERATE group,COUNT($1);
\d c
Output
Note, none of the users in your example repeat, so these are all one.
(1860,1)
(9970,1)
(16929,1)
(23473,1)
(23733,1)
(27584,1)
(28232,1)
(29482,1)
(40976,1)
(44631,1)
(47613,1)
(49763,1)
(58160,1)
(62249,1)
(65923,1)
(67507,1)
(68259,1)
(70331,1)
(71420,1)

Related

How to truncate double precision value in PostgreSQL by keeping exactly first two decimals?

I'm trying to truncate double precision value when I'm build json using json_build_object() function in PostgreSQL 11.8 but with no luck. To be more precise I'm trying to truncate 19.9899999999999984 number to ONLY two decimals but making sure it DOES NOT round it to 20.00 (which is what it does), but to keep it at 19.98.
BTW, what I've tried so far was to use:
1) TRUNC(found_book.price::numeric, 2) and I get value 20.00
2) ROUND(found_book.price::numeric, 2) and I get value 19.99 -> so far this is closesest value but not what I need
3) ROUND(found_book.price::double precision, 2) and I get
[42883] ERROR: function round(double precision, integer) does not exist
Also here is whole code I'm using:
create or replace function public.get_book_by_book_id8(b_id bigint) returns json as
$BODY$
declare
found_book book;
book_authors json;
book_categories json;
book_price double precision;
begin
-- Load book data:
select * into found_book
from book b2
where b2.book_id = b_id;
-- Get assigned authors
select case when count(x) = 0 then '[]' else json_agg(x) end into book_authors
from (select aut.*
from book b
inner join author_book as ab on b.book_id = ab.book_id
inner join author as aut on ab.author_id = aut.author_id
where b.book_id = b_id) x;
-- Get assigned categories
select case when count(y) = 0 then '[]' else json_agg(y) end into book_categories
from (select cat.*
from book b
inner join category_book as cb on b.book_id = cb.book_id
inner join category as cat on cb.category_id = cat.category_id
where b.book_id = b_id) y;
book_price = trunc(found_book.price, 2);
-- Build the JSON response:
return (select json_build_object(
'book_id', found_book.book_id,
'title', found_book.title,
'price', book_price,
'amount', found_book.amount,
'is_deleted', found_book.is_deleted,
'authors', book_authors,
'categories', book_categories
));
end
$BODY$
language 'plpgsql';
select get_book_by_book_id8(186);
How do I achieve to keep EXACTLY ONLY two FIRST decimal digits 19.98 (any suggestion/help is greatly appreciated)?
P.S. PostgreSQL version is 11.8
In PostgreSQL 11.8 or 12.3 I cannot reproduce:
# select trunc('19.9899999999999984'::numeric, 2);
trunc
-------
19.98
(1 row)
# select trunc(19.9899999999999984::numeric, 2);
trunc
-------
19.98
(1 row)
# select trunc(19.9899999999999984, 2);
trunc
-------
19.98
(1 row)
Actually I can reproduce with the right type and a special setting:
# set extra_float_digits=0;
SET
# select trunc(19.9899999999999984::double precision::text::numeric, 2);
trunc
-------
19.99
(1 row)
And a possible solution:
# show extra_float_digits;
extra_float_digits
--------------------
3
(1 row)
select trunc(19.9899999999999984::double precision::text::numeric, 2);
trunc
-------
19.98
(1 row)
But note that:
Note: The extra_float_digits setting controls the number of extra
significant digits included when a floating point value is converted
to text for output. With the default value of 0, the output is the
same on every platform supported by PostgreSQL. Increasing it will
produce output that more accurately represents the stored value, but
may be unportable.
As #pifor suggested I've managed to get it done by directly passing trunc(found_book.price::double precision::text::numeric, 2) as value in json_build_object like this:
json_build_object(
'book_id', found_book.book_id,
'title', found_book.title,
'price', trunc(found_book.price::double precision::text::numeric, 2),
'amount', found_book.amount,
'is_deleted', found_book.is_deleted,
'authors', book_authors,
'categories', book_categories
)
Using book_price = trunc(found_book.price::double precision::text::numeric, 2); and passing it as value for 'price' key didn't work.
Thank you for your help. :)

How to change the structure of the JSON field in the PostgreSQL?

Array (name, value) to map.
[{'name':'email,'value':'email#email.em'},{'name':'phone','value':123123}]
to
{'email':'email#email.em','phone':123123}
I need to organize search and sorting such email and I would like to simplify to make View my Table with map data struct.
with t(id,j) as (
values(1,'[{"name":"email","value":"email#email.em"},{"name":"phone","value":123123}]'::json))
select json_object_agg(a.j->>'name', a.j->>'value')
from t, json_array_elements(j) a(j) group by id;
╔════════════════════════════════════════════════════╗
║ json_object_agg ║
╠════════════════════════════════════════════════════╣
║ { "email" : "email#email.em", "phone" : "123123" } ║
╚════════════════════════════════════════════════════╝
I changed your input data to actually be json (dbl quotes instead on single and added keyname for array. After this transofrmation here is example of how to "build" it:
t=# with j as (
select '{"arr":[{"name":"email","value":"email#email.em"},{"name":"phone","value":123123}]}'::json v
)
select
concat('{',v->'arr'->0->'name',':',v->'arr'->0->'value',',',v->'arr'->1->'name',':',v->'arr'->1->'value','}')::json
from j
;
{"email":"email#email.em","phone":123123}

Dividing a list of 2 query statements

I have some query statements and I want to take the average by basically doing top_level_comment_count.fdiv(code_review_assigned_count).round(2)
Here are my 2 query statements:
top_level_comment_count = CrucibleComment.group(:user_id).where(parent_comment_id: nil).count
code_review_assigned_count = Reviewer.group(:user_id).count
Both of these return something that looks like this:
40=>5,
41=>1,
43=>4,
44=>10,
45=>2,
46=>13,
48=>7,
50=>7,
51=>6,
52=>5,
54=>7,
55=>41,
56=>2,
58=>21,
60=>7,
61=>8,
62=>3,
63=>1,
So, what I am wanting to do is if the :user_ids are the same, take the average.
My def currently looks like this:
def self.average_top_level_comments
a = CrucibleComment.group(:user_id).where(parent_comment_id: nil).count
b = Reviewer.group(:user_id).count
end
In other words I am wanting to do this statement:
return nil unless code_review_assigned_count && top_level_comment_count
top_level_comment_count.fdiv(code_review_assigned_count).round(2)
for a group of numbers. How can I do this?
For example:
id:40 => 5.0/3.3
id: 41 => 1/2.2
id: 43 => 4 /1.0
I would suggest using inject.
Something like:
division_result = top_level_comment_count.inject({}) do |result, item|
id = item.first
count = item.last
result[id] = count.to_f / code_review_assigned_count[id]
result
end
That will return a hash with the IDs as keys and the results of the division as the values.

How do I sum up properties of a JSON object in coffescript?

I have an object that looks like this one:
object =
title : 'an object'
properties :
attribute1 :
random_number: 2
attribute_values:
a: 10
b: 'irrelevant'
attribute2 :
random_number: 4
attribute_values:
a: 15
b: 'irrelevant'
some_random_stuff: 'random stuff'
I want to extract the sum of the 'a' values on attribute1 and attribute2.
What would be the best way to do this in Coffeescript?
(I have already found one way to do it but that just looks like Java-translated-to-coffee and I was hoping for a more elegant solution.)
Here is what I came up with (edited to be more generic based on comment):
sum_attributes = (x) =>
sum = 0
for name, value of object.properties
sum += value.attribute_values[x]
sum
alert sum_attributes('a') # 25
alert sum_attributes('b') # 0irrelevantirrelevant
So, that does what you want... but it probably doesn't do exactly what you want with strings.
You might want to pass in the accumulator seed, like sum_attributes 0, 'a' and sum_attributes '', 'b'
Brian's answer is good. But if you wanted to bring in a functional programming library like Underscore.js, you could write a more succinct version:
sum = (arr) -> _.reduce arr, ((memo, num) -> memo + num), 0
sum _.pluck(object.properties, 'a')
total = (attr.attribute_values.a for key, attr of obj.properties).reduce (a,b) -> a+b
or
sum = (arr) -> arr.reduce((a, b) -> a+b)
total = sum (attr.attribute_values.a for k, attr of obj.properties)

Split column string into multiple columns strings

I have a entry in the table that is a string which is delimited by semicolons. Is possible to split the string into separate columns? I've been looking online and at stackoverflow and I couldn't find one that would do the splitting into columns.
The entry in the table looks something like this (anything in brackets [] is not actually in my table. Just there to make things clearer):
sysinfo [column]
miscInfo ; vendor: aaa ; bootr: bbb; revision: ccc; model: ddd [string a]
miscInfo ; vendor: aaa ; bootr: bbb; revision: ccc; model: ddd [string b]
...
There are a little over one million entries with the string that looks like this. Is it possible in mySQL so that the query returns the following
miscInfo, Vendor, Bootr, Revision , Model [columns]
miscInfo_a, vendor_a, bootr_a, revision_a, model_a
miscInfo_b, vendor_b, bootr_b, revision_b, model_b
...
for all of the rows in the table, where the comma indicates a new column?
Edit:
Here's some input and output as Bohemian requested.
sysinfo [column]
Modem <<HW_REV: 04; VENDOR: Arris ; BOOTR: 6.xx; SW_REV: 5.2.xxC; MODEL: TM602G>>
<<HW_REV: 1; VENDOR: Motorola ; BOOTR: 216; SW_REV: 2.4.1.5; MODEL: SB5101>>
Thomson DOCSIS Cable Modem <<HW_REV: 4.0; VENDOR: Thomson; BOOTR: 2.1.6d; SW_REV: ST52.01.02; MODEL: DCM425>>
Some can be longer entries but they all have similar format. Here is what I would like the output to be:
miscInfo, vendor, bootr, revision, model [columns]
04, Arris, 6.xx, 5.2.xxC, TM602G
1, Motorola, 216, 2.4.1.5, SB5101
4.0, Thomson, 2.1.6d, ST52.01.02, DCM425
You could make use of String functions (particularly substr) in mysql: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html
Please take a look at how I've split my coordinates column into 2 lat/lng columns:
UPDATE shops_locations L
LEFT JOIN shops_locations L2 ON L2.id = L.id
SET L.coord_lat = SUBSTRING(L2.coordinates, 1, LOCATE('|', L2.coordinates) - 1),
L.coord_lng = SUBSTRING(L2.coordinates, LOCATE('|', L2.coordinates) + 1)
In overall I followed UPDATE JOIN advice from here MySQL - UPDATE query based on SELECT Query and STR_SPLIT question here Split value from one field to two
Yes I'm just splitting into 2, and SUBSTRING might not work well for you, but anyway, hope this helps :)