Selecting a row from a dataframe based on condition in R - mysql

I am trying to subset a dataframe based on each user_id and order_date.
If ecomm_id and pulse_id exists in the row for that userid and for order_date, that row should be selected to new dataframe.
Else only one row with no ecomm_id must be selected to the new data frame and all other rows must be discarded.
Sample data:
userid returning device store_n testid ecomm_id pulse_id order_date
1.00 1 0 9328 Experience E 1 23 7/25/2015
1.00 1 0 NA Experience E NA NA 7/25/2015
2.00 1 1 NA Experience C NA NA 7/14/2015
3.00 1 0 3486 Experience F 2 86 7/23/2015
3.00 1 0 NA Experience F NA NA 7/24/2015
3.00 1 0 NA Experience F NA NA 7/24/2015
Expected Output:
userid returning device store_n testid ecomm_id pulse_id order_date
1.00 1 0 9328 Experience E 1 23 7/25/2015
2.00 1 1 NA Experience C NA NA 7/14/2015
3.00 1 0 3486 Experience F 2 86 7/23/2015
3.00 1 0 NA Experience F NA NA 7/24/2015

Hope this helps!
df <- data.frame(userid=c(1,1,2,3,3,3),
returning=c(1,1,1,1,1,1),
device=c(0,0,1,0,0,0),
store_n=c(9328,NA,NA,3486,NA,NA),
testid=c('Experience E','Experience E','Experience C','Experience F','Experience F','Experience F'),
ecomm_id=c(1,NA,NA,2,NA,NA),
pulse_id=c(23,NA,NA,86,NA,NA),
order_date=c('7/25/2015','7/25/2015','7/14/2015','7/23/2015','7/24/2015','7/24/2015')
)
library(dplyr)
df1 <- unique(df) %>% group_by(userid,order_date) %>% summarise(count=n())
df1 <- merge(unique(df),df1,on=c(userid,order_date))
final_df <- df1[!(is.na(df1$ecomm_id) & is.na(df1$pulse_id) & df1$count > 1),-ncol(df1)]
Don't forget to let us know if it solved your problem :)

With data.table, this becomes a concise "one-liner":
library(data.table)
setDT(DT)[order(ecomm_id), .SD[1], keyby = .(userid, order_date)]
userid order_date returning device store_n testid tid ecomm_id pulse_id
1: 1.00 7/25/2015 1 0 9328 Experience E 1 23
2: 2.00 7/14/2015 1 1 NA Experience C NA NA
3: 3.00 7/23/2015 1 0 3486 Experience F 2 86
4: 3.00 7/24/2015 1 0 NA Experience F NA NA
By ordering by ecomm_id, the NA entries are moved to the bottom. Now, for each combination of userid and order_date the first element within that group is picked.
Note that this assumes that there is at most one entry per group in case of non-NA ecomm_ids because the OP has specified:
If ecomm_id and pulse_id exists in the row for that userid and for order_date, that row should be selected to new dataframe.

Related

How to export a list of dataframes generated as output within R to a single excel sheet

below is a reproducible sample dataframe generated in R. The number of colums is the same for each investor(given by ID). What is different is the number of rows. I would like to export this multiple list of data frames(for each investor) into a single excel sheet. There are over 3000 investors(with differeing number of rows). Please help out
[[993]]
investor asset quantity price datetime RG_count RL_count PG_count PL_count
1 1011 MC 2200 8 2016-03-02 0 0 0 0
2 1011 NIJL 100 50 2016-02-22 NA NA NA NA
3 1011 RPAL 300 2 2016-02-16 0 0 0 0
[[994]]
investor asset quantity price datetime RG_count RL_count PG_count PL_count
1 1156 LOYV 1400 10.54 2010-09-15 01:00:00 0 0 1 0
[[995]]
investor asset quantity price datetime RG_count RL_count PG_count PL_count
1 1140 LPC 13272 551.302 2017-03-27 01:00:00 0 0 1 0
[[996]]
investor asset quantity price datetime RG_count RL_count PG_count PL_count
1 1941 MBK 2700 62.20 2017-04-24 01:00:00 0 0 0 3
[[997]]
investor asset quantity price datetime RG_count RL_count PG_count PL_count
1 1944 JFM -79040 17.00 2011-07-14 01:00:00 0 0 1 0
2 1944 MC -221490 3.00 2010-10-20 01:00:00 0 0 1 0
3 1944 RAPL -59340 1.20 2012-03-13 00:00:00 0 0 0 0
4 1944 XT -56300 1.75 2012-03-22 00:00:00 NA NA NA NA
As was mentioned in the comments, I recommend using bind_rows() from the dplyr package to append your data frames, and write.xlsx from the openxlsx package to create your new file.
install.packages("dplyr", "openxlsx")
library("dplyr", "openxlsx")
If all of your data exist in a single data frame, good news! Your adventure ends here.
example <- bind_rows(investors_dataframe)
write.xlsx(example, "/Users/Username/Documents/filename.xlsx")
However, if you are trying to combine multiple data frames, a good approach would be to create a list of all your data frames, then use that list as an argument for bind_rows:
example2 <- bind_rows(list_of_dataframes)
write.xlsx(example2, "/Users/Username/Documents/filename.xlsx")
The 3 most common solutions to bind lists are:
# Base R
do.call("rbind", myList)
# dplyr
dplyr::bind_rows(myList)
# data.table
data.table::rbindlist(myList)
Benchmark is here
The openxlsx or writexl package can be a good choice to write out your data.
Benchmark is here

BigQuery/SQL - Split value on specific variants

need to replace duplicates in column market_offers with value that can be SUM for all entries that give main value.
but there is a case that there are different rank and country codes so
input
country_code rank store_id category_id offers market_offers
se 1 14582 1106 410 504860
se 1 1955 1294 2 504860
se 1 9831 1158 151 504860
se 2 666 11158 536 4000
se 2 6587 25863 6586 4000
se 2 6666 158 536 4000
se 5 65853 76722 1521 302
se 5 6587 25863 6586 302
expected result
country_code rank store_id category_id offers market_offers
se 1 14582 1106 410 168 286
se 1 1955 1294 2 168 286
se 1 9831 1158 151 168 286
se 2 666 11158 536 1333
se 2 6587 25863 6586 1333
se 2 6666 158 536 1333
se 5 65853 76722 1521 151
se 5 6587 25863 6586 151
Consider below
select * except(market_offers),
round(market_offers / count(1) over(partition by market_offers, rank), 2) as market_offers
from `project.dataset.table`
if applied to sample data in your question - output is

How to reference on a different row when you have a joint key in MySQL workbench?

my code in MySQL workbench
SELECT
id,
user_id,
parent_id,
approved_time,
#new_approved_t:=IF(new_product_type != 'B', approved_time, #new_approved_t))AS new_approved_time
FROM p
my data frame look like this after I run the code:
id user_id parent_id product_type approved_time New_approved_time
30 11 NA A 8/4/2017 8/4/2017
31 11 30 B 12/1/2017 8/4/2017
54 5 NA A 5/5/2018 5/5/2018
322 5 54 B 7/22/2018 5/5/2018
21 5 NA C 8/1/2018 8/1/2018
13 5 NA C 8/2/2018 8/2/2018
2445 5 NA C 9/25/2018 9/25/2018
111 44 NA A 10/4/2018 10/4/2018
287 44 111 B 10/8/2018 10/4/2018
211 33 NA A 12/5/2018 12/5/2018
277 33 211 B 12/25/2018 12/5/2018
1120 33 NA C 1/1/2019 1/1/2019
1389 33 211 B 1/11/2019 1/1/2019
I would like all my product_type ending in 'B' and it's new_approved_time column to use the corresponding parent_id approved time. The result should look like in below:
id user_id parent_id product_type approved_time New_approved_time
30 11 NA A 8/4/2017 8/4/2017
31 11 30 B 12/1/2017 8/4/2017
54 5 NA A 5/5/2018 5/5/2018
322 5 54 B 7/22/2018 5/5/2018
21 5 NA C 8/1/2018 8/1/2018
13 5 NA C 8/2/2018 8/2/2018
2445 5 NA C 9/25/2018 9/25/2018
111 44 NA A 10/4/2018 10/4/2018
287 44 111 B 10/8/2018 10/4/2018
211 33 NA A 12/5/2018 12/5/2018
277 33 211 B 12/25/2018 12/5/2018
1120 33 NA C 1/1/2019 1/1/2019
1389 33 211 B 1/11/2019 12/5/2018 <-this is where I dont know how to write my code
Thank you!
Found a solution here with another column I didn't include in my example, but basically it's the reverse of parent_id column I listed above which only listed id in product_type A and rest of them is null.
But I got hint from Dan's code on COALESCE() function, thanks Dan.
CASE WHEN c.new_product_type IN ('A', 'B') THEN #new_approved_t:=IF(COALESCE(if((a.is_loc=1 AND b.ploc_RID IS NOT NULL), a.id, null)), a.approved_time, #new_approved_t)
ELSE a.approved_time END AS new_approved_time
Because the parent (pp) has its own value there needs to be a self join like this. Because of the join criteria pp.approved_time will be NULL for non-B products.
SELECT
p.id,
p.user_id,
p.parent_id,
p.approved_time,
COALESCE(pp.approved_time, p.approved_time) AS new_approved_time
FROM p
LEFT JOIN p AS pp ON p.new_product_type = 'B' AND p.parent_id = pp.id

Rvest R not getting inner table

I'm trying to retrieve the Medals Table inside Wikipedia for Olympics 2012.
library(rvest)
library(magrittr)
url <- "https://en.wikipedia.org/wiki/United_States_at_the_2012_Summer_Olympics"
xpath0 <- '//*[#id="mw-content-text"]/table[1]'
xpath1 <- '//*[#id="mw-content-text"]/table[2]'
xpath2 <- '//*[#id="mw-content-text"]/table[2]/tbody/tr/td[1]'
xpath3 <- '//*[#id="mw-content-text"]/table[2]/tbody/tr/td[1]/table'
tb <- url %>%
html() %>%
html_nodes(xpath=xpath0) %>%
html_nodes("") %>%
html_table()
xpath0 or xpath1 return an error
Error in parse_simple_selector(stream) :
Expected selector, got <EOF at 1>
xpath2 and xpath3 return empty lists.
At same time I tried to use Selectorgadget (https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html) to point to the exact element. I got
//td[(((count(preceding-sibling::) + 1) = 1) and parent::)] |
//*[contains(concat( " ", #class, " " ), concat( " ",
"headerSortDown", " " ))]
and the Error
Error in parse_simple_selector(stream) :
Expected selector, got
I really appreciate any help.
Joa
The first table with the names has a complicated structure and seems to be very difficult to convert into a standard format. At least I didn't succeed.
A summary of the number of medals by sport and the total medals can be obtained with
library(rvest) #v.0.2.0.9000
url <- "https://en.wikipedia.org/wiki/United_States_at_the_2012_Summer_Olympics"
tb <- read_html(url) %>% html_node("table.wikitable:nth-child(2)") %>% html_table(fill=TRUE)
#> head(tb)
# Medals by sport NA NA NA NA NA NA
#1 Sport 01 ! 02 ! 03 ! Total NA NA
#2 Swimming 16 9 6 31 NA NA
#3 Track & field 9 12 7 28 NA NA
#4 Gymnastics 3 1 2 6 NA NA
#5 Shooting 3 0 1 4 NA NA
#6 Tennis 3 0 1 4 NA NA
Then there is another table summarizing all competitors that you can get with
tb2 <- read_html(url) %>% html_node("table.wikitable:nth-child(20)") %>% html_table()
#> head(tb2)
# Sport Men Women Total
#1 Archery 3 3 6
#2 Athletics (track and field) 63 62 125
#3 Badminton 2 1 3
#4 Basketball 12 12 24
#5 Boxing 9 3 12
#6 Canoeing 5 2 7
And this is the table of multiple medalists:
tb3 <- read_html(url) %>% html_node("table.wikitable:nth-child(8)") %>% html_table(fill=TRUE)
#> head(tb3)
# Multiple medalists NA NA NA NA NA NA
#1 Name Sport 01 ! 02 ! 03 ! Total NA
#2 Michael Phelps Swimming 4 2 0 6 NA
#3 Missy Franklin Swimming 4 0 1 5 NA
#4 Allison Schmitt Swimming 3 1 1 5 NA
#5 Ryan Lochte Swimming 2 2 1 5 NA
#6 Allyson Felix Track & field 3 0 0 3 NA
It really depends on which table you want to have, as pointed out by #Metrics. There are many tables on that page.

Storing shipping prices in MySQL table where many columns will be added in future

I'm looking to create a system where users specify shipping prices for their items. The variables I need to cover are the weight band (from in grams to in grams), the price and the countries covered. From this data I can calculate shipping cost by referencing the customers country and the total weight of the basket.
My first thought is something like this:
id from_weight to_weight price us ca gb fr es de it
------------------------------------------------------------------
1 0g 499g 1.99 Y Y N N N N N
2 500g 999g 2.99 Y Y N N N N N
3 1000g 1999g 4.99 Y Y N N N N N
4 2000g 2999g 7.99 Y Y N N N N N
5 0g 499g 4.99 N N Y Y Y Y Y
6 500g 999g 6.99 N N Y Y Y Y Y
7 1000g 1999g 9.99 N N Y Y Y Y Y
8 2000g 2999g 14.99 N N Y Y Y Y Y
However the plan would be to add more and more country options. This would mean adding more columns each time. Is this the best way to structure this kind of data. Any other suggestions
normally it is preferred practice to leave the table structure the same, and just add rows to cater for the case that you illustrated. (there are reasons such as optimisation where you can deviate from this)
I would suggest at looking up "3rd Normal Form", if your database complies with the rules to be in 3rd normal form then, you generally end up with a lot less maintenance and easier extendibility down the track.
table1
id | from_weight | to_weight | price
1 | 0g | 499g | 1.99
table2
id | table1id | countrycode | status
1 | 1 | us | Y
2 | 1 | ca | Y
3 | 1 | gb | N
this is how you would query the data
select price from table1
join table2 on table1.id=table2.table1id
where countrycode='us' and status='Y' and
300 between from_weight and to_weight
If you want to avoid adding columns, you can add a 2nd table that has priceID and Country as columns and just remove all the country columns. Then you just make priceID have a foreign key referencing the ID from your first table and you can add new countries as needed.
Yes, I'd say so. Your goal is going to be to find the price for a given weight and country combination, so it should be easy to perform this query:
SELECT price FROM table WHERE $weight > from_weight AND $weight < to_weight AND $country;
And your schema allows that easily. I'd recommend that you select your to_weight such that a value of 499.9 fits in one of the categories, unless weight is restricted to be an integer. I wouldn't worry about adding new columns, that's easy and you can default it to false for any new country.
It seems the rows of your table form groups of related pricing plans. If that's the case, I would suggest you have a PricingPlan table and a PlanDetail table.
PricingPlan
-----------
PricingPlanId* PlanTitle
----------------------------------------
1 planA for North America
2 planB for EU
PlanDetail
----------
PricingPlanId* DetailId* FromWeight ToWeight Price
------------------------------------------------------
1 1 0g 499g 1.99
1 2 500g 999g 2.99
1 3 1000g 1999g 4.99
1 4 2000g 2999g 7.99
2 1 0g 499g 4.99
2 2 500g 999g 6.99
2 3 1000g 1999g 9.99
2 4 2000g 2999g 14.99
A third, PlanCountry table should used so you don't have to add any columns for new countries that you want to relate to a pricing plan (a new row would be added in that table, if for example you want Mexico to be included in planA):
Country
-------
CountryCode* CountryName
-----------------------------
us USA
ca Canada
uk United Kingdom
fr France
es Spain
de Germany
it Italy
PlanCountry
-----------
PricingPlanId* CountryCode*
---------------------------
1 us
1 ca
2 uk
2 fr
2 es
2 de
2 it
1 mx --- Mexico added for planA