Webscraping html table using R - html

I had some help from users of Stackoverflow already, trying to solve this problem. However, I ran into new trouble:
URL <- "http://karakterstatistik.stads.ku.dk/Histogram/ASOB05038E/Summer-2015"
pg <- read_html(URL)
get_val <- function(x, label) {
xpath <- sprintf(".//table/tr/td[contains(., '%s')][1]/following-sibling::td", label)
html_nodes(x, xpath=xpath) %>%
html_text() %>%
trimws()
}
library("stringr")
trimmed = get_val(pg, "Karakter") %>%
str_replace_all(pattern = "\\n|\\t|\\r" ,
replacement = "")
trimmed
I want to get the exam results for both the retake and the exam, but since both of the headlines for the two tables are the same, R only takes the values from the retake.
To be specific, I would like to get the column "Antal" right next to the grades, 12, 10, 7, 4, 02, 00, -3 in both the tables under the headline Resultater
Any help would be appreciated a lot! :)

results <- html_nodes(pg, xpath=".//td[#style='width: 50%;' and
descendant::h3[contains(text(), 'Resultater')]]/table")
html_table(results[[1]])
## Karakter Antal Antal
## 1 12 11 (9,6 %)
## 2 10 48 (41,7 %)
## 3 7 41 (35,7 %)
## 4 4 4 (3,5 %)
## 5 02 1 (0,9 %)
## 6 00 1 (0,9 %)
## 7 -3 4 (3,5 %)
## 8 Ej mødt 5 (4,3 %)
html_table(results[[2]])
## Karakter Antal Antal
## 1 12 0 (0,0 %)
## 2 10 0 (0,0 %)
## 3 7 1 (9,1 %)
## 4 4 1 (9,1 %)
## 5 02 1 (9,1 %)
## 6 00 1 (9,1 %)
## 7 -3 0 (0,0 %)
## 8 Ej mødt 7 (63,6 %)

Related

Retrieving data from HTML in RStudio

I want to retrieve data frame from this HTML : https://www.transfermarkt.pl/pko-ekstraklasa/torschuetzenliste/wettbewerb/PL1/saison_id/2020/altersklasse/alle/detailpos//plus/1
Is there any simple way to get a table like from this site? I tried the way below, but I don't know what to enter in "html_node"
transfermarkt <- xml2::read_html("https://www.transfermarkt.pl/pko-ekstraklasa/torschuetzenliste/wettbewerb/PL1/saison_id/2020/altersklasse/alle/detailpos//plus/1")
transfermarkt %>%
html_node("responsive-table") %>%
html_text()
You can Right click on the table and choose Inspect to see the relevant selectors:
Use html_node("#yw1 table") since you want the <table> inside id="yw1"
Change html_text() to html_table() since this is tabular data
Add drop_na('#') to remove superfluous rows (rows that have NA values in the # column)
library(rvest)
library(tidyverse)
transfermarkt <- xml2::read_html("https://www.transfermarkt.pl/pko-ekstraklasa/torschuetzenliste/wettbewerb/PL1/saison_id/2020/altersklasse/alle/detailpos//plus/1")
transfermarkt %>%
html_node("#yw1 > table") %>%
html_table() %>%
drop_na('#')
#
Zawodnik
Narodowość
Wiek (obecny)
Klub
Czas na boisku
Gole na mecz
1
Tomas Pekhart Środkowy napastnik
NA
Tomas Pekhart
Środkowy napastnik
NA
31
19
0
5
1.510'
79'
1,00
2
Jesús Imaz Ofensywny pomocnik
NA
Jesús Imaz
Ofensywny pomocnik
NA
30
19
4
1
1.610'
161'
0,53
3
Flávio Paixão Środkowy napastnik
NA
Flávio Paixão
Środkowy napastnik
NA
36
22
3
4
1.693'
188'
0,41
...
...
...
...
...
...
...
...
...
...
...
...
...
...

R Extracting table from web, without <span class="hidden">

I am trying to scrap results of Polish elections that were held this weekend, but I come to problem that before every intager random float is added.
I have tried using htmltab, but it did not work - as you can see random number is added
library(htmltab)
url <- "https://wybory2018.pkw.gov.pl/pl/geografia/020000#results_vote_council"
tmp <- htmltab::htmltab(doc = html, which = 1)
tmp
Wyszczególnienie Liczba
2 Mieszkańców 0.972440432 755 957
3 Wyborców 0.977263472 273 653
4 Obwodów 0.99998061 940
I have checked in html what is the problem:
library(xml2)
library(rvest)
webpage <- xml2::read_html(url)
a <- webpage %>%
rvest::html_nodes("tbody")
a[1]
<tbody>\n<tr>\n<td>Mieszkańców</td>\n <td class=\"table-number\">\n<span class=\"hidden\">0.97244043</span>2 755 957</td>\n </tr>\n<tr>\n<td>Wyborców</td>\n <td class=\"table-number\">\n<span class=\"hidden\">0.97726347</span>2 273 653</td>\n </tr>\n<tr>\n<td>Obwodów</td>\n <td class=\"table-number\">\n<span class=\"hidden\">0.9999806</span>1 940</td>\n </tr>\n</tbody>"
I assume the problem is with <span class=\"hidden\">, but how to get rid of it?
EDIT
I need the info from the 9th table with results of the parties
Nr listy Komitet wyborczy Liczba % głosów ważnych
Głosów na kandydatów komitetu Kandydatów
12 KOMITET WYBORCZY WYBORCÓW Z DUTKIEWICZEM DLA DOLNEGO ŚLĄSKA 93 260 45 8.29%
9 KOMITET WYBORCZY WYBORCÓW WOLNOŚĆ W SAMORZĄDZIE 15 499 46 1.38%
8 KOMITET WYBORCZY WYBORCÓW KUKIZ'15 53 800 41 4.78%
1 KOMITET WYBORCZY WYBORCÓW BEZPARTYJNI SAMORZĄDOWCY 168 442 46 14.98%
11 KOMITET WYBORCZY WOLNI I SOLIDARNI 9 624 38 0.86%
7 KOMITET WYBORCZY RUCH NARODOWY RP 14 874 38 1.32%
10 KOMITET WYBORCZY PRAWO I SPRAWIEDLIWOŚĆ 320 908 45 28.53%
2 KOMITET WYBORCZY POLSKIE STRONNICTWO LUDOWE 58 820 46 5.23%
6 KOMITET WYBORCZY PARTII RAZEM 18 087 44 1.61%
3 KOMITET WYBORCZY PARTIA ZIELONI 19 783 36 1.76%
5 KOALICYJNY KOMITET WYBORCZY SLD LEWICA RAZEM 61 889 46 5.50%
4 KOALICYJNY KOMITET WYBORCZY PLATFORMA.NOWOCZESNA KOALICJA OBYWATELSKA 289 831 46 25.77%
EDIT 2
I have found not the most elegant solution:
#https://stackoverflow.com/questions/7963898/extracting-the-last-n-characters-from-a-string-in-r
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
tmp <- htmltab::htmltab(doc = html, which = 9)
tmp2 <- xml2::read_html(html) %>%
rvest::html_nodes("tbody") %>%
magrittr::extract2(9) %>%
rvest::html_nodes("tr") %>%
rvest::html_nodes("td") %>%
rvest::html_nodes("span") %>%
rvest::html_text() %>%
matrix(ncol = 4, byrow = T) %>%
data.frame()
names(tmp) <- c("a", "b", "c", "d", "e", "f", "g")
tmp3 <- cbind(tmp, tmp2) %>%
mutate(n_to_delate = nchar(X1),
c1 = as.character(c),
n_whole = nchar(c1),
c2 = substrRight(c1, n_whole - n_to_delate),
c3 = gsub(" ", "", c2),
c4 = as.numeric(c3)) %>%
select(b, c4)
names(tmp3) <- c("party", "n_of_votes")
Solving the original question:
You can remove those nodes before the conversion to a table:
library(rvest)
pg <- read_html("https://wybory2018.pkw.gov.pl/pl/geografia/020000#results_vote_council")
tbl_1 <- html_nodes(pg, xpath=".//table[#class = 'stat_table']")[1]
xml_remove(html_nodes(tbl_1, xpath=".//span[#class='hidden']"))
html_table(tbl_1)
## [[1]]
## Wyszczególnienie Liczba
## 1 Mieszkańców 2 755 957
## 2 Wyborców 2 273 653
## 3 Obwodów 1 940
Solving the updated requirements:
library(rvest)
pg <- read_html("https://wybory2018.pkw.gov.pl/pl/geografia/020000#results_vote_council")
Let's target that particular table. Using the "View Source" version of the document, we can go for the header that precedes that table and then got to the table:
target_tbl <- html_node(pg, xpath=".//header[contains(., 'mandatów pomiędzy')]/following-sibling::table")
Still get rid of the hidden spans:
xml_remove(html_nodes(target_tbl, xpath=".//span[#class='hidden']"))
Now, we need to know how many real columns there are since it has one of those daft headers that are multi-line with <td>'s that span multiple columns:
length(
html_nodes(target_tbl, xpath=".//tbody/tr[1]") %>%
html_nodes("td")
) -> n_cols
Now we pull out each column, set good column names, turn it into a data frame and remove the junk column that is just feeding the filled in bars:
as.data.frame(
setNames(
lapply(1:n_cols, function(.idx) {
html_nodes(target_tbl, xpath=sprintf(".//tbody/tr/td[%s]", .idx)) %>%
html_text(trim=TRUE)
}),
c(
"nr_listy", "komitet_wyborczy", "głosów_na_kandydatów_komitetu",
"kandydatów", "mandatów", "pct_głosów_ważnych", "junk",
"udział_w_podziale_mandatów"
)
),
stringsAsFactors = FALSE
) -> xdf
xdf$junk <- NULL
str(xdf)
## 'data.frame': 12 obs. of 7 variables:
## $ nr_listy : chr "1" "2" "3" "4" ...
## $ komitet_wyborczy : chr "KOMITET WYBORCZY WYBORCÓW BEZPARTYJNI SAMORZĄDOWCY" "KOMITET WYBORCZY POLSKIE STRONNICTWO LUDOWE" "KOMITET WYBORCZY PARTIA ZIELONI" "KOALICYJNY KOMITET WYBORCZY PLATFORMA.NOWOCZESNA KOALICJA OBYWATELSKA" ...
## $ głosów_na_kandydatów_komitetu: chr "168 442" "58 820" "19 783" "289 831" ...
## $ kandydatów : chr "46" "46" "36" "46" ...
## $ mandatów : chr "6" "1" "0" "13" ...
## $ pct_głosów_ważnych : chr "14.98%" "5.23%" "1.76%" "25.77%" ...
## $ udział_w_podziale_mandatów : chr "Tak" "Tak" "Nie" "Tak" ...
I don't think piping makes the lapply() block more readable but just in case it's preferred:
lapply(1:n_cols, function(.idx) {
html_nodes(target_tbl, xpath=sprintf(".//tbody/tr/td[%s]", .idx)) %>%
html_text(trim=TRUE)
}) %>%
setNames(c(
"nr_listy", "komitet_wyborczy", "głosów_na_kandydatów_komitetu",
"kandydatów", "mandatów", "pct_głosów_ważnych", "junk",
"udział_w_podziale_mandatów"
)) %>%
as.data.frame(stringsAsFactors = FALSE) -> xdf

Scraping wikipedia table r

Trying to scrape the first 8 tables (very high, high, medium, low) from the human development index in Wikipedia.
Started with but getting a list of zero. What am I doing wrong? New to R :(
libray(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index#Complete_list_of_countries"
webpage <- read_html(url)
hdi_tables <- html_nodes(webpage, 'table')
head(hdi_tables, n = 10)
scrape <- url %>%
read_html() %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/div[5]/table/tbody/tr/td[1]/table') %>%
html_table()
head(scrape, n=10)
I think it would be easier to work with the original data source:
Select "Human Development Index (HDI)" in both the drop-down select lists, then click the "Download Data" link to get a CSV file named Human Development Index (HDI).csv.
Read it into R:
library(tidyverse)
Human_Development_Index_HDI_ <- read_csv("path/to/Human Development Index (HDI).csv",
skip = 1)
You can reshape the data, get the values for 2015 and classify countries as low, medium, high or very high:
hdi <- Human_Development_Index_HDI_ %>%
gather(Year, HDI, -`HDI Rank (2015)`, -Country) %>%
filter(Year == "2015") %>%
na.omit() %>%
mutate(Year = as.numeric(Year),
classification = cut(HDI,
breaks = c(0, 0.549, 0.699, 0.799, 1),
labels = c("low", "medium", "high", "very_high")))
hdi
# A tibble: 188 x 5
`HDI Rank (2015)` Country Year HDI classification
<int> <chr> <dbl> <dbl> <fctr>
1 169 Afghanistan 2015 0.479 low
2 75 Albania 2015 0.764 high
3 83 Algeria 2015 0.745 high
4 32 Andorra 2015 0.858 very_high
5 150 Angola 2015 0.533 low
6 62 Antigua and Barbuda 2015 0.786 high
7 45 Argentina 2015 0.827 very_high
8 84 Armenia 2015 0.743 high
9 2 Australia 2015 0.939 very_high
10 24 Austria 2015 0.893 very_high
# ... with 178 more rows
You could change the filter to get values for 2014 too, if you want to replicate the "change from previous year" values in the Wikipedia table.
If you're okay with parsing the wikipedia markup language instead, you could try using WikipediR to grab the markup of the page (from skimming the documentation, try page_content with as_wikitext set to true). Then you'll get some lines that all look like this:
| 1 || {{steady}} ||style="text-align:left"| {{flag|Norway}} || 0.949 || {{increase}} 0.001
This should be parseable in R using strsplit or something.

Scraping html text into table with delimiters that do not have a clear pattern using R (rvest)

I'm just learning how to use R to scrape data from webpages, and I'm running into a couple of issues.
For reference, the website that I am practicing on is here: http://www.rsssf.com/tables/34q.html
As far as I know, the website I am scraping data from is not a table so I can't directly scrape the information into a table, so here is the code I wrote to just have all of the text:
wcq_1934_html <- read_html("http://www.rsssf.com/tables/34q.html")
wcq_1934_node <- html_nodes(wcq_1934_html, "pre")
wcq_1934_text <- html_text(wcq_1934_node, trim = TRUE)
This results in a very long text file with all of the information that I need, just not formatted in an ideal way.
So I am next attempting to substring this text in order to get an output that looks something like this.
Country A - Country A Score - Country B - Country B Score
It doesn't have to be exactly like this, I just basically need for each game the country and how many goals they scored and ideally it should be comparable with the other country from the same game so I can know who won or lost! I do not need any of the other information like where the game was played, etc.
So I've tried three different ways to get this:
First test: split text by dashes:
test <- strsplit(wcq_1934_text, "-")
df_test <- data.frame(test)
This gives me the information I need in a table but the rows don't match the exact scores that I need (i.e. Lithuania 0, and Sweden 2 are in separate rows)
Second test: split text by spaces:
test2 <- strsplit(wcq_1934_text, " ")
df_test2 <- data.frame(test2)
This is helpful because it gives me the scores in one row (0-2 for the first game), but the countries are unevenly spaced out across rows.
Third test: split text by "tabs"
test3 <- strsplit(wcq_1934_text, " ")
df_test3 <- data.frame(test3)
This has a similar issue to the first test.
Any suggestions would be much appreciated. This is my first ever Stack Overflow post, although I've lurked around and this website has been helpful to me for a very long time. Thank you in advance!
Here's a solution that provides you most of what you need, though as MrFlick commented, it is a little fragile to this page. I'll stay with rvest, though as biomiha suggested, it isn't really buying you a lot here (though it does cleanly break out the <pre> block).
Starting with your wcq_1934_text, it's a single long string, let's break it up by newlines (CRLF in this case):
wcq_1934_text <- strsplit(wcq_1934_text, "[\r\n]+")[[1]]
str(wcq_1934_text)
# chr [1:51] "Hosts: Italy (not automatically qualified)" "Holders: Uruguay (did not enter)" "Group 1 [Sweden]" ...
I'll the magrittr package merely because it helps break out each step of the process using the %>% non-pipe; you can convert it non-magrittr by changing (say) func1() %>% func2() %>% func3() to func3(func2(func1())) (yuck) or intermediate assignment of return values, ret1 <- func1(); ret2 <- func2(ret1); ....
library(magrittr)
dat <- Filter(function(a) grepl("^[0-9][0-9]", a), wcq_1934_text) %>%
paste(., collapse = "\n") %>%
textConnection() %>%
read.fwf(file = ., widths = c(10, 16, 17, 4, 99), stringsAsFactors = FALSE) %>%
lapply(trimws) %>%
as.data.frame(stringsAsFactors = FALSE)
The widths are fragile and unique to this page. If other reporting pages have slightly different column layouts, you'll need to use a different function, perhaps one that can automatically determine the breaks.
head(dat)
# V1 V2 V3 V4 V5
# 1 11.06.33 Stockholm Sweden 6-2 Estonia
# 2 29.06.33 Kaunas Lithuania 0-2 Sweden
# 3 11.03.34 Madrid Spain 9-0 Portugal
# 4 18.03.34 Lisboa Portugal 1-2 Spain
# 5 25.03.34 Milano Italy 4-0 Greece
# 6 25.03.34 Sofia Bulgaria 1-4 Hungary
From here, it's up to you which columns you want to use.
For instance, handling of the date, you might want:
dat$V1 <- as.POSIXct(gsub("([0-9]+)$", "19\\1", dat$V1), format = "%d.%m.%Y")
dat$V1
# [1] "1933-06-11 PST" "1933-06-29 PST" "1934-03-11 PST" "1934-03-18 PST" "1934-03-25 PST" "1934-03-25 PST" "1934-04-25 PST" "1934-04-29 PST"
# [9] "1933-10-15 PST" "1934-03-15 PST" "1933-09-24 PST" "1933-10-29 PST" "1934-04-29 PST" "1934-02-25 PST" "1934-04-08 PST" "1934-04-29 PST"
# [17] "1934-03-11 PST" "1934-04-15 PST" "1934-01-28 PST" "1934-02-01 PST" "1934-02-04 PST" "1934-03-04 PST" "1934-03-11 PST" "1934-03-18 PST"
# [25] "1934-05-24 PST" "1934-03-16 PST" "1934-04-06 PST"
The gsub stuff is because as.POSIXct assumes 2-digit years less than 69 are in the 20th century, 19th for 69-99.
It's easy enough to use either strsplit on the scores, but you could also do:
library(tidyr)
dat %>%
separate(V4, c("score1", "score2"), sep="-") %>%
head()
# Warning: Too few values at 1 locations: 10
# V1 V2 V3 score1 score2 V5
# 1 1933-06-11 Stockholm Sweden 6 2 Estonia
# 2 1933-06-29 Kaunas Lithuania 0 2 Sweden
# 3 1934-03-11 Madrid Spain 9 0 Portugal
# 4 1934-03-18 Lisboa Portugal 1 2 Spain
# 5 1934-03-25 Milano Italy 4 0 Greece
# 6 1934-03-25 Sofia Bulgaria 1 4 Hungary
(The warning is expected, since one game was not played so has "n/p" for a score. You might want to handle non-score values in V4 before trying the split, perhaps replacing anything not numeric-dash-numeric with NA.)
Equally specific to this particular site but may be easier to generalize:
library(rvest)
library(purrr)
library(dplyr)
library(stringi)
pg <- read_html("http://www.rsssf.com/tables/34q.html")
Target the <pre> and strip out some things that aren't part of "tables":
html_nodes(pg, "pre") %>%
html_text() %>%
stri_split_lines() %>%
flatten_chr() %>%
discard(stri_detect_regex, "^(NB| )") -> lines
Now, we get the start and end lines indexes of each "group":
starts <- which(grepl("^Group", lines))
ends <- c(starts[-1], length(lines))
We iterate over those starts and ends and:
extract the group info
clean up the table
discard any "empty" tables
turn the tabular data into a data frame, doing some munging along the way
I can annotate the following more if needed:
map2_df(starts, ends, ~{
grp_info <- stri_match_all_regex(lines[.x], "Group ([[:digit:]]+) \\[(.*)]")[[1]][,2:3]
lines[(.x+1):.y] %>%
discard(stri_detect_regex, "(^[^[:digit:]]| round)") %>%
discard(`==`, "") -> grp
if (length(grp) == 0) return(NULL)
stri_split_regex(grp, "\ \ +") %>%
map_df(~{
.x[1:4] %>%
as.list() %>%
set_names(c("date", "team_a", "team_b", "score_team")) %>%
flatten_df() %>%
separate(score_team, c("score", "team_c"), sep=" ") %>%
mutate(group_num = grp_info[1], group_info = grp_info[2]) %>%
separate(date, c("d", "m", "y")) %>%
mutate(date = as.Date(sprintf("19%s-%s-%s", y, m, d))) %>%
select(-d, -m, -y)
})
})
## # A tibble: 27 x 7
## team_a team_b score team_c group_num group_info date
## <chr> <chr> <chr> <chr> <chr> <chr> <date>
## 1 Stockholm Sweden 6-2 Estonia 1 Sweden 1933-06-11
## 2 Kaunas Lithuania 0-2 Sweden 1 Sweden 1933-06-29
## 3 Madrid Spain 9-0 Portugal 2 Spain 1934-03-11
## 4 Lisboa Portugal 1-2 Spain 2 Spain 1934-03-18
## 5 Milano Italy 4-0 Greece 3 Italy 1934-03-25
## 6 Sofia Bulgaria 1-4 Hungary 4 Hungary, Austria 1934-03-25
## 7 Wien Austria 6-1 Bulgaria 4 Hungary, Austria 1934-04-25
## 8 Budapest Hungary 4-1 Bulgaria 4 Hungary, Austria 1934-04-29
## 9 Warszawa Poland 1-2 Czechoslovakia 5 Czechoslovakia 1933-10-15
## 10 Praha Czechoslovakia n/p Poland 5 Czechoslovakia 1934-03-15
## 11 Beograd Yugoslavia 2-2 Switzerland 6 Romania, Switzerland 1933-09-24
## 12 Bern Switzerland 2-2 Romania 6 Romania, Switzerland 1933-10-29
## 13 Bucuresti Romania 2-1 Yugoslavia 6 Romania, Switzerland 1934-04-29
## 14 Dublin Ireland 4-4 Belgium 7 Netherlands, Belgium 1934-02-25
## 15 Amsterdam Netherlands 5-2 Ireland 7 Netherlands, Belgium 1934-04-08
## 16 Antwerpen Belgium 2-4 Netherlands 7 Netherlands, Belgium 1934-04-29
## 17 Luxembourg Luxembourg 1-9 Germany 8 Germany, France 1934-03-11
## 18 Luxembourg Luxembourg 1-6 France 8 Germany, France 1934-04-15
## 19 Port-au-Prince Haiti 1-3 Cuba 11 USA 1934-01-28
## 20 Port-au-Prince Haiti 1-1 Cuba 11 USA 1934-02-01
## 21 Port-au-Prince Haiti 0-6 Cuba 11 USA 1934-02-04
## 22 Cd. de Mexico Mexico 3-2 Cuba 11 USA 1934-03-04
## 23 Cd. de Mexico Mexico 5-0 Cuba 11 USA 1934-03-11
## 24 Cd. de Mexico Mexico 4-1 Cuba 11 USA 1934-03-18
## 25 Roma USA 4-2 Mexico 11 USA 1934-05-24
## 26 Cairo Egypt 7-1 Palestina 12 Egypt 1934-03-16
## 27 Tel Aviv Palestina 1-4 Egypt 12 Egypt 1934-04-06

Subsetting in a function to calculate a row total

I have a data frame with results for certain instruments, and I want to create a new column which contains the totals of each row. Because I have different numbers of instruments each time I run an analysis on new data, I need a function to dynamically calculate the new column with the Row Total.
To simply my problem, here’s what my data frame looks like:
Type Value
1 A 10
2 A 15
3 A 20
4 A 25
5 B 30
6 B 40
7 B 50
8 B 60
9 B 70
10 B 80
11 B 90
My goal is to achieve the following:
A B Total
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 70 70
6 80 80
7 90 90
I’ve tried various method, but this way holds the most promise:
myList <- list(a = c(10, 15, 20, 25), b = c(30, 40, 50, 60, 70, 80, 90))
tmpDF <- data.frame(sapply(myList, '[', 1:max(sapply(myList, length))))
> tmpDF
a b
1 10 30
2 15 40
3 20 50
4 25 60
5 NA 70
6 NA 80
7 NA 90
totalSum <- rowSums(tmpDF)
totalSum <- data.frame(totalSum)
tmpDF <- cbind(tmpDF, totalSum)
> tmpDF
a b totalSum
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 NA 70 NA
6 NA 80 NA
7 NA 90 NA
Even though this way did succeeded in combining two data frames of different lengths, the ‘rowSums’ function gives the wrong values in this example. Besides that, my original data isn't in a list format, so I can't apply such a 'solution'.
I think I’m overcomplicating this problem, so I was wondering how can I …
Subset data from a data frame on the basis of ‘Type’,
Insert these individual subsets of different lengths into a new data frame,
Add an ‘Total’ column to this data frame which is the correct sum of the
individual subsets.
An added complication to this problem is that this needs to be done in an function or in an otherwise dynamic way, so that I don’t need to manually subset the dozens of ‘Types’ (A, B, C, and so on) in my data frame.
Here’s what I have so far, which doesn’t work, but illustrates the lines I’m thinking along:
TotalDf <- function(x){
tmpNumberOfTypes <- c(levels(x$Type))
for( i in tmpNumberOfTypes){
subSetofData <- subset(x, Type = i, select = Value)
if( i == 1) {
totalDf <- subSetOfData }
else{
totalDf <- cbind(totalDf, subSetofData)}
}
return(totalDf)
}
Thanks in advance for any thoughts or ideas on this,
Regards,
EDIT:
Thanks to the comment of Joris (see below) I got an end in the right direction, however, when trying to translate his solution to my data frame, I run into additional problems. His proposed answer works, and gives me the following (correct) sum of the values of A and B:
> tmp78 <- tapply(DF$value,DF$id,sum)
> tmp78
1 2 3 4 5 6
6 8 10 12 9 10
> data.frame(tmp78)
tmp78
1 6
2 8
3 10
4 12
5 9
6 10
However, when I try this solution on my data frame, it doesn’t work:
> subSetOfData <- copyOfTradesList[c(1:3,11:13),c(1,10)]
> subSetOfData
Instrument AccountValue
1 JPM 6997
2 JPM 7261
3 JPM 7545
11 KFT 6992
12 KFT 6944
13 KFT 7069
> unlist(sapply(rle(subSetOfData$Instrument)$lengths,function(x) 1:x))
Error in rle(subSetOfData$Instrument) : 'x' must be an atomic vector
> subSetOfData$InstrumentNumeric <- as.numeric(subSetOfData$Instrument)
> unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
> subSetOfData$id <- unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 2L, 3L, 1L, 2L, :
replacement has 3 rows, data has 6
I have the disturbing idea that I’m going around in circles…
Two thoughts :
1) you could use na.rm=T in rowSums
2) How do you know which one has to go with which? You might add some indexing.
eg :
DF <- data.frame(
type=c(rep("A",4),rep("B",6)),
value = 1:10,
stringsAsFactors=F
)
DF$id <- unlist(lapply(rle(DF$type)$lengths,function(x) 1:x))
Now this allows you to easily tapply the sum on the original dataframe
tapply(DF$value,DF$id,sum)
And, more importantly, get your dataframe in the correct form :
> DF
type value id
1 A 1 1
2 A 2 2
3 A 3 3
4 A 4 4
5 B 5 1
6 B 6 2
7 B 7 3
8 B 8 4
9 B 9 5
10 B 10 6
> library(reshape)
> cast(DF,id~type)
id A B
1 1 1 5
2 2 2 6
3 3 3 7
4 4 4 8
5 5 NA 9
6 6 NA 10
TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B")
, Value = c(10,15,20,25,30,40,50,60,70,80,90)
, stringsAsFactors = FALSE)
# Added Type C for testing
# TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B", "C", "C", "C")
# , Value = c(10,15,20,25,30,40,50,60,70,80,90, 100, 150, 130)
# , stringsAsFactors = FALSE)
lnType <- with(TV, tapply(Value, Type, length))
lnType <- as.integer(lnType)
lnType
id <- unlist(mapply(FUN = rep_len, length.out = lnType, x = list(1:max(lnType))))
(TV <- cbind(id, TV))
require(reshape2)
tvWide <- dcast(TV, id ~ Type)
# Alternatively
# tvWide <- reshape(data = TV, direction = "wide", timevar = "Type", ids = c(id, Type))
tvWide <- subset(tvWide, select = -id)
# If you want something neat without the <NA>
# for(i in 1:ncol(tvWide)){
#
# if (is.na(tvWide[j,i])){
# tvWide[j,i] = 0
# }
#
# }
# }
tvWide
transform(tvWide, rowSum=rowSums(tvWide, na.rm = TRUE))