Group_by results in duplicate rows - duplicates

I am grouping monthly data from individual to team level using the following code
teamdata <-individualdata %>%
group_by(team_month) %>%
summarise(individual_count = sum(is_teammember))
Unfortunately, this produces numerous duplicate rows for each team_month with similar entries. How can I avoid this happening?
Thanks for helping me.

Related

sqlalchemy.orm.subquery does not seem to load only specific columns

how can i load only specific columns when using subquery?
It seems like subquery loads all the columns from the table even though I included load_only option before calling subquery.
Code snippet is
results = session.query(User).options(load_only(User.name, User.fullname))
results = results.subquery()
The first result statement only loads name and fullname from User but the second result statement loads all the columns.
Any help is greatly appreciated. Thanks so much.
I just found a solution. It works if i do
results = session.query(User.name, User.fullname)
results = results.subquery()
Not sure if anynone has a better solution?

How to query list of Id's in Database using dplyr in r

I'm new to using r to manipulate data from the database
I want to know how to query a list of Id's in a database table
I want a situation whereby the query returns all records of Id's if found
Before I used to query just one id with the code below
start_1<-tbl(connect, "accountbits")%>%
filter(Tranx_id == "2022011813250866101336997")%>%
collect()
So it shows the query with details attached with the id.
I want to have many id's like the example below
start_2<-tbl(connect, "accountbits")%>%
filter(Tranx_id = c("2022011813250866101336997","20220115675250866101336997"
"202201181325086610143246997","2022015433250866101336997")%>%
collect()
I want it to bring all records attached to this id in the database
Thank you
The R operator you are looking for is %in%. This checks set intersection:
c(1,3,5) %in% c(1,2,3,4)
# = (TRUE, TRUE, FALSE)
because 1 and 3 are in c(1,2,3,4).
You can type ?`%in%` at the console for help info about this operator (` is the backtick, located next to the number 1 in the top left corner of more keyboards).
There are dbplyr translations defined for %in% so a command like:
start_2 <- tbl(connect, "accountbits")%>%
filter(Tranx_id %in% c("1234","2345","3456"))
will translate into SQL like:
SELECT *
FROM accountbits
WHERE Tranx_id IN (1234, 2345, 3456)
and collect() will pull those values into local R memory as expected.

Scrape table using rvest - Embedded symbols/links

I tried to scrape the table on the following webpage: http://www.comstats.de/squad/1-FC+Bayern+München
My approach is successfull at first glance using the following code:
read_html("http://www.comstats.de/squad/1-FC+Bayern+München") %>%
html_node("#inhalt > table.rangliste.autoColor.tablesorter.zoomable") %>%
html_table(header = TRUE, fill = TRUE)
However, in the second column there are differing number of linked symbols which lead to a corrupt table having different number of elements (which is why there is need for fill = TRUE).
I was researching for hours... Who can help me out?
In case someone is searching for an answer to such questions as well: One possible solution is to use package htmltable (https://cran.r-project.org/web/packages/htmltab/vignettes/htmltab.html):
library(htmltab)
htmltab(doc = "http://www.comstats.de/squad/1-FC+Bayern+München", which = '//*[#id="inhalt"]/table[2]')

Pulling data from two tables (for a 'related products' query) in MySQL

I'm trying to pull data from two tables for a 'related products' widget.
I've tried all the JOINS and UNIONS I can and still get nothing.
The first table (productdocs) stores documents. The second (prodrelated) shows when a product is related to a document:
productdocs
pdid (unique ID for the document)
pdname (name of the uploaded document)
prodrelated
prprodid (the ID for the PRODUCT)
pritemid (the ID for the document)
I am trying to output the productdocs.pdname for any documents that match up with the product's id. In otherwords, show the pdname when:
WHERE productdocs.pdid = prodrelated.pritemid
I would post my SQL code, but none of it has worked, so I think it would be pointless. I hope I explained this correctly given my frazzled brain - Any help greatly appreciated.
You can use a simple INNER JOIN for this, e.g.:
SELECT pd.pdid, pd.pdname
FROM productdocs pd JOIN prodrelated pr ON pd.pdid = pr.pritemid
WHERE pd.prprodid = <any_id>;
If you don't want to filter out any records, you can get rid of WHERE clause and it will output all the records.
Here's MySQL's documentation for JOIN.
Wow you guys are fast - thank you so much.
Darshan - thank you above all, I was able to make a few mods to what you wrote and it worked great. I tried to +1 your answer but maybe I don't have enough 'reputation'? Here is what I got working, thanks to you:
SELECT pd.pdid, pd.pdname
FROM productdocs pd
JOIN prodrelated pr
ON pd.pdid = pr.pritemid
WHERE pr.prprodid = '#url.prodid#'
In the future I will try to post some code example, but on this one I honestly tried at least 7 different queries so I had no idea which to post!

finding correct xpath for a table without an id

I am following a tutorial on R-Bloggers using rvest to scrape table. I think I have the wrong column id value, but I don't understand how to get the correct one. Can someone explain what value I should use, and why?
As #hrbrmstr points out this is against the WSJ terms of service, however the answer is useful for those who face a similar issue with a different webpage.
library("rvest")
interest<-url("http://online.wsj.com/mdc/public/page/2_3020-libor.html")%>%read_html()%>%html_nodes(xpath='//*[#id="column0"]/table[1]') %>% html_table()
The structure returns is an empty list.
For me it is usual a trial and error to find the correct table. In this case, the third table is what you are looking for:
library("rvest")
page<-url("http://online.wsj.com/mdc/public/page/2_3020-libor.html")%>%read_html()
tables<-html_nodes(page, "table")
html_table(tables[3])
Instead of using the xpath, I just parse out the "table" tag and looked at each table to locate the correct one. The piping command is handy but it makes it harder to debug when something goes wrong.