I have a set of tweets where I want to calculate the number of replies a user got using Pig.
My pig script looks like (Assuming y1 has the required json):
y2 = GROUP y1 BY in_reply_to_user_id_str;
y3 = FOREACH y2 GENERATE group AS in_reply_to_user_id_str, COUNT(y1) AS number_of_replies_to_user;
y4 = FOREACH y3 GENERATE in_reply_to_user_id_str, number_of_replies_to_user;
y5 = JOIN y1 BY user_id LEFT OUTER, y4 BY in_reply_to_user_id_str;
STORE y5 INTO '$DATA_OUTPUT' USING JsonStorage()
Now, my output looks like:
{"y1::user_id":"9642792"............"y4::in_reply_to_user_id_str":"9642792","y4::number_of_replies_to_user":1}
Whereeas I was expecting something like:
{"user_id":"9642792"..............."number_of_replies_to_user":1}
I donot want the alias names y1 and y5. I deleted some unwanted fields that are not required to answer the question, just to make it more readable.
How can I do that? My Pig version (0.15) does not support $0...
Also, is there a better way of calculating this value? SQL seems very straight forward but Pig is really confusing.
Add an additional step to generate the fields you need from y5 and then store the resulting y6 relation
y5 = JOIN y1 BY user_id LEFT OUTER, y4 BY in_reply_to_user_id_str;
y6 = FOREACH y5 GENERATE y1::$0,y1::$1,y1::$2,..........y4::$0,y4::$1;
STORE y6 INTO '$DATA_OUTPUT' USING JsonStorage();
Related
I have 2 different very simple functions with the same input-output structure (Both return a count(*) when avg of 3 notes is >= 4 (function1) and the other a count(*) when avg of 3 notes is < 4 (function2)), They both work properly in separate but now i need to join both into just one function with 2 outputs, I now maybe is a very easy question but i am only getting started with Haskell:
function1::[(String, Int,Int,Int)]->Int
function1 ((name,note1,note2,note3):xs) =
if (note1+note2+note3) `div` 3 >=4 then length xs else length xs
function2::[(String, Int,Int,Int)]->Int
function2 ((name,note1,note2,note3):xs) =
if (note1+note2+note3) `div` 3 <4 then length xs else length xs
Thanks!
You can use &&& from Control.Arrow.
combineFunctions f1 f2 = f1 &&& f2
Then use it like this :
combinedFunc = combineFunctions function1 function2
(res1,res2) = combinedFunc sharedArg
You already use tuples (name,note1,note2,note3) in your input data, so you must be familiar with the concept.
The simplest way to produce two outputs simultaneously is to put the two into one tuple:
combinedFunction f1 f2 input = (out1, out2)
where
out1 = f1 input
out2 = f2 input
It so happens that this can be written shorter as combinedFunction f1 f2 = f1 &&& f2 and even combinedFunction = (&&&), but that's less important for now.
A more interesting way to produce two outputs simultaneously is to redefine what it means to produce an output:
combinedFunWith k f1 f2 input = k out1 out2
where
out1 = f1 input
out2 = f2 input
Here instead of just returning them in a tuple, we pass them as arguments to some other user-specified function k. Let it decide what to do with the two outputs!
As can also be readily seen, our first version can be expressed with the second, as combinedFunction = combinedFunWith (,), so the second one seems to be more general ((,) is just a shorter way of writing a function foo x y = (x,y), without giving it a name).
Consider that I am given an English sentence such as:
"If x1 is greater than x2, set y to 2"
Is there a method to extract the conditions and actions from such a statement in a "action"-parse tree or code format such as below?
if x1 > x2:
y = 2
I am trying to store the coefficients from a simulated regression in a variable b1 and b2 in the code below, but I'm not quite sure how to go about this. I've tried using return scalar b1 = _b[x1] and return scalar b2 = _b[x2], from the rclass() function, but that didn't work. Then I tried using scalar b1 = e(x1) and scalar b2 = e(x2), from the eclass() function and also wasn't successful.
The goal is to use these stored coefficients to estimate some value (say rhat) and test the standard error of rhat.
Here's my code below:
program montecarlo2, eclass
clear
version 11
drop _all
set obs 20
gen x1 = rchi2(4) - 4
gen x2 = (runiform(1,2) + 3.5)^2
gen u = 0.3*rnormal(0,25) + 0.7*rnormal(0,5)
gen y = 1.3*x1 + 0.7*x2 + 0.5*u
* OLS Model
regress y x1 x2
scalar b1 = e(x1)
scalar b2 = e(x2)
end
I want to do something like,
rhat = b1 + b2, and then test the standard error of rhat.
Let's hack a bit at your program:
Version 1
program montecarlo2
clear
version 11
set obs 20
gen x1 = rchi2(4) - 4
gen x2 = (runiform(1,2) + 3.5)^2
gen u = 0.3*rnormal(0,25) + 0.7*rnormal(0,5)
gen y = 1.3*x1 + 0.7*x2 + 0.5*u
* OLS Model
regress y x1 x2
end
I cut drop _all as unnecessary given the clear. I cut the eclass. One reason for doing that is the regress will leave e-class results in its wake any way. Also, you can if you wish add
scalar b1 = _b[x1]
scalar b2 = _b[x2]
scalar r = b1 + b2
either within the program after the regress or immediately after the program runs.
Version 2
program montecarlo2, eclass
clear
version 11
set obs 20
gen x1 = rchi2(4) - 4
gen x2 = (runiform(1,2) + 3.5)^2
gen u = 0.3*rnormal(0,25) + 0.7*rnormal(0,5)
gen y = 1.3*x1 + 0.7*x2 + 0.5*u
* OLS Model
regress y x1 x2
* stuff to add
end
Again, I cut drop _all as unnecessary given the clear. Now the declaration eclass is double-edged. It gives the programmer scope for their program to save e-class results, but you have to say what they will be. That's the stuff to add indicated by a comment above.
Warning: I've tested none of this. I am not addressing the wider context. #Dimitriy V. Masterov's suggestion of lincom is likely to be a really good idea for whatever your problem is.
This might seem like a basic question:
I am loading a json into Pig using elephant bird and I want only few fields in the json. Again, I want to use those fields and generate new fields and add in the original json. I have the following pig script:
data_input = LOAD '$DATA_INPUT' USING com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map []);
x = FOREACH data_input GENERATE json#'user__id_str' AS user_id, json#'user__created_at' AS user_created_at, json#'avl_user_days_active' AS user_days_active, json#'user__notifications' AS user_notifications, json#'user__follow_request_sent' AS user_follow_request_sent, json#'user__friends_count' AS user_following_count, json#'user__name' AS user_name, json#'user__time_zone' AS user_time_zone, json#'user__profile_background_color' AS user_profile_background_color, json#'user__is_translation_enabled' AS user_translation_enabled, json#'user__profile_link_color' AS user_profile_link_color, json#'user__utc_offset' AS user_UTC_offset, json#'user__profile_sidebar_border_color' AS user_profile_sidebar_border_color, json#'user__has_extended_profile' AS user_has_extended_profile, json#'user__profile_background_tile' AS user_profile_background_tile, json#'user__is_translator' AS user_is_tranlator, json#'user__profile_text_color' AS user_profile_text_color, json#'user__location' AS user_location, json#'user__profile_banner_url' AS user_profile_banner_url, json#'user__profile_use_background_image' AS user_profile_use_background_image, json#'user__default_profile_image' AS user_default_profile_image, json#'user__description' AS user_description, json#'user__profile_background_image_url_https' AS user_profile_background_image_url_https, json#'user__profile_sidebar_fill_color' AS user__profile_sidebar_fill_color, json#'user__followers_count' AS user_followers_count, json#'user__profile_image_url' AS user_profile_image_url, json#'user__geo_enabled' AS user_geo_enabled, json#'user__entities__description__urls' AS user_entities_description_urls, json#'user__screen_name' AS user_scren_name, json#'user__favourites_count' AS user_total_liked, json#'user__url' AS user_url, json#'user__statuses_count' AS user_total_posts, json#'user__default_profile' AS user_default_profile, json#'user__lang' AS user_language, json#'user__protected' AS user_protected, json#'user__listed_count' AS user_totalPublic_lists, json#'user__profile_image_url_https' AS user_profile_image_url_https, json#'user__contributors_enabled' AS user_contributors_enabled, json#'user__following' AS user_following, json#'user__verified' AS user_verified;
y1 = FOREACH x GENERATE user_total_posts/user_days_active as user_post_frequency;
y2 = GROUP x BY user_id;
z = FOREACH y2 GENERATE COUNT(x);
Now, I want to add the aliases y1 and y2 to x and write it to an output file. That is, I want to add new fields user_post_frequency and user_total_replies to x and store it.
How can I do that?
EDIT:
I tried to join the two aliases:
y1 = FOREACH x GENERATE user_id, user_total_posts/user_days_active as user_post_frequency;
y2 = JOIN x BY user_id, y1 BY user_id;
fs -rmr /tmp/user
STORE y2 INTO '/tmp/user' USING JsonStorage();
But my output looks like:
{"x::user_id":"9642792","x::user_created_at":"Wed Oct 24 02:44:30 +0000 2007","x::user_days_active":"3272","x::user_notifications":"false","x::user_foll ow_request_sent":"false","x::user_following_count":"500","x::user_name":"Everything Finance","x::user_time_zone":"Eastern Time (US & Canada)","x::user_p rofile_background_color":"131516","x::user_translation_enabled":"false","x::user_profile_link_color":"992E01"," y1::user_id":"9642792","y1::user_post_frequency":10.910452322738386}
I donot want x:: and y:: in the output. Just want the field names. Anything I should do?
You can use the PIG project range capability to generate all the columns and add additional columns to the existing relation.
x.$0 .. will project all fields in the relation x
y1 = FOREACH x GENERATE x.$0 ..,user_total_posts/user_days_active as user_post_frequency;
y2 = GROUP y1 BY user_id;
z = FOREACH y2 GENERATE y1.$0 ..,COUNT(x) as user_total_replies ;
If you do not want x:: and y:: add another FOREACH
y3 = FOREACH y2 GENERATE x::user_id AS user_id, ....
I have a data frame in pandas as follows:
A B C D
3 4 3 1
5 2 2 2
2 1 4 3
My final goal is to produce some constraints for an optimization problem using the information in each row of this data frame so I don't want to generate an output and add it to the data frame. The way that I have done that is as below:
def Computation(row):
App = pd.Series(row['A'])
App = App.tolist()
PT = [row['B']] * len(App)
CS = [row['C']] * len(App)
DS = [row['D']] * len(App)
File3 = tuplelist(zip(PT,CS,DS,App))
return m.addConstr(quicksum(y[r,c,d,a] for r,c,d,a in File3) == 1)
But it does not work out by calling:
df.apply(Computation, axis = 1)
Could you please let me know if there is anyway to do this process?
.apply will attempt to convert the value returned by the function to a pandas Series or DataFrame. So, if that is not your goal, you are better off using .iterrows:
# In pseudocode:
for row in df.iterrows:
constrained = Computation(row)
Also, your Computation can be expressed as:
def Computation(row):
App = list(row['A']) # Will work as long as row['A'] is iterable
# For the next 3 lines, see note below.
PT = [row['B']] * len(App)
CS = [row['C']] * len(App)
DS = [row['D']] * len(App)
File3 = tuplelist(zip(PT,CS,DS,App))
return m.addConstr(quicksum(y[r,c,d,a] for r,c,d,a in File3) == 1)
Note: [<list>] * n will create n pointers or references to the same <list>, not n independent lists. Changes to one copy of n will change all copies in n. If that is not what you want, use a function. See this question and it's answers for details. Specifically, this answer.