Hide Java output in HTML file - html

I write this code in scala.html to show the MySQL data in an HTML table:
#for(a <- 1 to rowNum){
#rs.next()
<tr>
<td>#rs.getString("building_id")</td>
<td>#rs.getString("building_name")</td>
<td>#rs.getString("building_type")</td>
<td>#rs.getString("address")</td>
</tr>
It gives the following result:
true true true true true true true true true true
id name type address
1 The Floravale condo Westwood Avenue
2 building2 condo Jurong West Street 21
3 building3 hdb Jurong West Street 31
4 building4 hdb Jurong West Street 81
5 building5 hdb Jurong West Street 61
6 building6 hdb Jurong West Street 81
7 building7 hdb Kang Ching Road
8 building8 hdb Kang Ching Road
9 building9 hdb Boon Lay Drive
10 building10 hdb Boon Lay Place
How can I hide the true as the output of #rs.next()?
Or are there other ways to display data? Thanks!

As commented, it is probably that true is being returned, and therefore appearing as a result. A more typical usage would be to collect your rows in a List passed to the view, often in it's own case class (say Building), then use a map:
#buildings.map { building =>
<td>building.id</td>
<td>building.name</td>
...
}

Related

Placing "NA" into an Empty Position?

I am trying to scrape name/address information from yellowpages (https://www.yellowpages.ca/). I have a function (from :(R) Webscraping Error : arguments imply differing number of rows: 1, 0) that is able to retrieve this information:
library(rvest)
library(dplyr)
scraper <- function(url) {
page <- url %>%
read_html()
tibble(
name = page %>%
html_elements(".jsListingName") %>%
html_text2(),
address = page %>%
html_elements(".listing__address--full") %>%
html_text2()
)
}
However, sometimes the address information is not always present. For example : there are several barbers listed on this page https://www.yellowpages.ca/search/si/1/barber/Sudbury+ON and they all have addresses except one of them. As a result, when I run this function, I get the following error:
scraper("https://www.yellowpages.ca/search/si/1/barber/Sudbury+ON")
Error:
! Tibble columns must have compatible sizes.
* Size 14: Existing data.
* Size 12: Column `address`.
i Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.
My Question: Is there some way that I can modify the definition of the "scraper" function in such a way, such that when no address is listed, an NA appears in that line? For example:
barber address
1 barber111 address111
2 barber222 address222
3 barber333 NA
Is there some way I could add a statement similar to CASE WHEN that would grab the address or place an NA when the address is not there?
In order to match the businesses with their addresses, it is best to find a root node for each listing and get the text from the relevant child node. If the child node is empty, you can add an NA
library(rvest)
library(dplyr)
scraper <- function(url) {
nodes <- read_html(url) %>% html_elements(".listing_right_section")
tibble(name = nodes %>% sapply(function(x) {
x <- html_text2(html_elements(x, css = ".jsListingName"))
if(length(x)) x else NA}),
address = nodes %>% sapply(function(x) {
x <- html_text2(html_elements(x, css = ".listing__address--full"))
if(length(x)) x else NA}))
}
So now we can do:
scraper("https://www.yellowpages.ca/search/si/1/barber/Sudbury+ON")
#> # A tibble: 14 x 2
#> name address
#> <chr> <chr>
#> 1 Lords'n Ladies Hair Design 1560 Lasalle Blvd, Sudbury, ON P3A~
#> 2 Jo's The Lively Barber 611 Main St, Lively, ON P3Y 1M9
#> 3 Hairapy Studio 517 & Barber Shop 517 Notre Dame Ave, Sudbury, ON P3~
#> 4 Nickel Range Unisex Hairstyling 111 Larch St, Sudbury, ON P3E 4T5
#> 5 Ugo Barber & Hairstyling 911 Lorne St, Sudbury, ON P3C 4R7
#> 6 Gordon's Hairstyling 19 Durham St, Sudbury, ON P3C 5E2
#> 7 Valley Plaza Barber Shop 5085 Highway 69 N, Hanmer, ON P3P ~
#> 8 Rick's Hairstyling Shop 28 Young St, Capreol, ON P0M 1H0
#> 9 President Men's Hairstyling & Barber Shop 117 Elm St, Sudbury, ON P3C 1T3
#> 10 Pat's Hairstylists 33 Godfrey Dr, Copper Cliff, ON P0~
#> 11 WildRootz Hair Studio 911 Lorne St, Sudbury, ON P3C 4R7
#> 12 Sleek Barber Bar 324 Elm St, ON P3C 1V8
#> 13 Faiella Classic Hair <NA>
#> 14 Ben's Barbershop & Hairstyling <NA>
Created on 2022-09-16 with reprex v2.0.2
Perhaps even simpler solution
library(tidyverse)
library(rvest)
scraper <- function(url) {
page <- url %>%
read_html() %>%
html_elements(".listing_right_top_section")
tibble(
name = page %>%
html_element(".jsListingName") %>%
html_text2(),
address = page %>%
html_element(".listing__address--full") %>%
html_text2()
)
}
# A tibble: 14 x 2
name address
<chr> <chr>
1 Lords'n Ladies Hair Design 1560 Lasalle Blvd, Sudbury, ON P3A 1Z7
2 Jo's The Lively Barber 611 Main St, Lively, ON P3Y 1M9
3 Hairapy Studio 517 & Barber Shop 517 Notre Dame Ave, Sudbury, ON P3C 5L1
4 Nickel Range Unisex Hairstyling 111 Larch St, Sudbury, ON P3E 4T5
5 Ugo Barber & Hairstyling 911 Lorne St, Sudbury, ON P3C 4R7
6 Gordon's Hairstyling 19 Durham St, Sudbury, ON P3C 5E2
7 Valley Plaza Barber Shop 5085 Highway 69 N, Hanmer, ON P3P 1J6
8 Rick's Hairstyling Shop 28 Young St, Capreol, ON P0M 1H0
9 President Men's Hairstyling & Barber Shop 117 Elm St, Sudbury, ON P3C 1T3
10 Pat's Hairstylists 33 Godfrey Dr, Copper Cliff, ON P0M 1N0
11 WildRootz Hair Studio 911 Lorne St, Sudbury, ON P3C 4R7
12 Sleek Barber Bar 324 Elm St, ON P3C 1V8
13 Faiella Classic Hair NA
14 Ben's Barbershop & Hairstyling NA

getting the relevant tag for scraping address from website

I am trying to scrape locations of Walmart in the State of Missouri using the link below:
https://www.walmart.com/store/finder?location=Missouri&distance=50
library(rvest)
library(xml2)
library(tidyverse)
url <- read_html("https://www.walmart.com/store/finder?location=Missouri&distance=50")
I used SelectorGadget to check what is in the NearbyStores and use it to extract store address.
Trying extracting the city first but I get nothing
url %>% html_elements(".city")
{xml_nodeset (0)}
Then I tried to extract address and store type but still get nothing.
url %>% html_elements(".result-element-address")
{xml_nodeset (0)}
url %>% html_elements(".result-element-store-type")
{xml_nodeset (0)}
I am trying to create a data frame with name of the city, and address
The tag you are looking for does not exist in the document you are requesting. It is built dynamically by javascript code after the page loads. Fortunately the actual data does exist on the page, in the form of a json string inside one of the script tags. This requires a bit of parsing, but contains all the info you need:
library(rvest)
library(xml2)
library(tidyverse)
url <- read_html("https://www.walmart.com/store/finder?location=Missouri&distance=50")
stores <- html_element(url, xpath = "//script[#id='storeFinder']") %>%
html_text() %>%
jsonlite::parse_json()
do.call(rbind, lapply(stores$storeFinder$storeFinderCarousel$stores,
function(x) as.data.frame(x$address)))
#> postalCode address city state country
#> 1 65401 500 S Bishop Ave Rolla MO US
#> 2 65584 185 Saint Robert Blvd Saint Robert MO US
#> 3 65453 100 Ozark Dr Cuba MO US
#> 4 65560 1101 W Highway 32 Salem MO US
#> 5 65066 1888 Highway 28 Owensville MO US
#> 6 63080 350 Park Ridge Rd Sullivan MO US
#> 7 65101 401 Supercenter Dr Jefferson City MO US
#> 8 65065 4252 Highway 54 Osage Beach MO US
#> 9 65483 1433 S Sam Houston Blvd Houston MO US
#> 10 65109 724 Stadium West Blvd Jefferson City MO US
#> 11 65026 1802 S Business 54 Eldon MO US
#> 12 65020 94 Cecil St Camdenton MO US
#> 13 65536 1800 S Jefferson Ave Lebanon MO US

VBA parse JSON empty data

I'm trying to parse some data to a sheet with VBA. My code work's fine when all data in the JSON are provided, but when there is no (team1)(name) or (team1)(id), I get an error of incompatible data.
The code is below! It wrote 3 or 4 lines of data before the error.
The JSON data is more below.
IS there any way to avoid registers that don’t have all data or just write an “empty” value when the data is null?
Dim jsonText As String
Dim jsonObject As Object, item As Object
Dim i As Long
Dim ws As Worksheet
Set ws = Worksheets("Matchs")
jsonText = ws.Cells(1, 1)
Dim http As Object, JSON As Object
Set http = CreateObject("MSXML2.XMLHTTP")
http.Open "GET", "getMatches.json", False
http.Send
Set JSON = ParseJson(http.responseText)
i = 2
For Each item In JSON
ws.Cells(i, 1) = item("id")
ws.Cells(i, 2) = item("date")
ws.Cells(i, 3) = item("title")
ws.Cells(i, 5) = item("team1")("name")
ws.Cells(i, 6) = item("team1")("id")
i = i + 1
Next
The JSON =>
[{"id":2342835,"date":1594731600000,
"team1":{"name":"FATE","id":9863},
"team2":{"name":"Budapest Five","id":9802},
"format":"bo1",
"event":{"name":"Eden Arena Malta Vibes Cup 3","id":5426},
"stars":0,"live":false},
{"id":2342836,"date":1594731600000,
"team1":{"name":"PACT","id":8248},
"team2":{"name":"Singularity","id":6978},
"format":"bo1",
"event":{"name":"Eden Arena Malta Vibes Cup 3","id":5426},
"stars":0,"live":false},
{"id":2342843,"date":1594735200000,
"title":"Malta Vibes 3 - Group A Winners' Match",
"stars":0,"live":false},
{"id":2342862,"date":1594735200000,
"team1":{"name":"Nexus","id":7187},
"team2":{"name":"BIG Academy","id":10254},"format":"bo3","event":{"name":"Betano Masters Europe 2020","id":5427},"stars":0,"live":false},{"id":2342834,"date":1594746000000,"team1":{"name":"sAw","id":10567},"team2":{"name":"Nexus","id":7187},"format":"bo3","event":{"name":"ESEA Advanced Season 34 Europe","id":5415},"stars":1,"live":false},{"id":2342844,"date":1594746000000,"title":"Malta Vibes 3 - Group A Elimination Match","stars":0,"live":false},{"id":2342863,"date":1594750500000,"team1":{"name":"Unicorns of Love","id":9812},"team2":{"name":"Giants","id":4949},"format":"bo3","event":{"name":"Betano Masters Europe 2020","id":5427},"stars":0,"live":false},{"id":2342801,"date":1594751400000,"team1":{"name":"Secret","id":10488},"team2":{"name":"Tricked","id":4602},"format":"bo3","event":{"name":"ESEA Advanced Season 34 Europe","id":5415},"stars":0,"live":false},{"id":2342845,"date":1594756800000,"title":"Malta Vibes 3 - Group A Decider Match","stars":0,"live":false},{"id":2342803,"date":1594774800000,"team1":{"name":"Thunder Logic","id":9615},"team2":{"name":"RBG","id":10258},"format":"bo3","event":{"name":"ESEA Advanced Season 34 North America","id":5416},"stars":0,"live":false},{"id":2342864,"date":1594776600000,"team1":{"name":"Third Impact","id":10469},"team2":{"name":"Lethal Divide","id":10770},"format":"bo3","event":{"name":"ESEA Advanced Season 34 North America","id":5416},"stars":0,"live":false},{"id":2342816,"date":1594796400000,"team1":{"name":"Hard Legion","id":10421},"team2":{"name":"AGF","id":8704},"format":"bo3","event":{"name":"Nine to Five 1","id":5409},"stars":0,"live":false},{"id":2342817,"date":1594796400000,"team1":{"name":"Gambit Youngsters","id":9976},"team2":{"name":"ALTERNATE aTTaX","id":4501},"format":"bo3","event":{"name":"Nine to Five 1","id":5409},"stars":0,"live":false},{"id":2342818,"date":1594807200000,"title":"Nine to Five 1 Grand Final","stars":0,"live":false},{"id":2342837,"date":1594818000000,"team1":{"name":"HellRaisers","id":5310},"team2":{"name":"HONORIS","id":10737},"format":"bo1","event":{"name":"Eden Arena Malta Vibes Cup 3","id":5426},"stars":0,"live":false},{"id":2342838,"date":1594818000000,"team1":{"name":"AGF","id":8704},"team2":{"name":"CR4ZY","id":10150},"format":"bo1","event":{"name":"Eden Arena Malta Vibes Cup 3","id":5426},"stars":0,"live":false},{"id":2342846,"date":1594821600000,"title":"Malta Vibes 3 - Group B Winners' Match","stars":0,"live":false},{"id":2342847,"date":1594832400000,"title":"Malta Vibes 3 - Group B Elimination Match","stars":0,"live":false},{"id":2342848,"date":1594843200000,"title":"Malta Vibes 3 - Group B Decider Match","stars":0,"live":false},{"id":2342532,"date":1594890000000,"team1":{"name":"Rooster","id":9881},"team2":{"name":"Paradox","id":7983},"format":"bo3","event":{"name":"LPL Pro League Season 5","id":5319},"stars":0,"live":false},{"id":2342839,"date":1594893600000,"team1":{"name":"Gambit Youngsters","id":9976},"team2":{"name":"Lyngby Vikings","id":8963},"format":"bo1","event":{"name":"Eden Arena Malta Vibes Cup 3","id":5426},"stars":0,"live":false},{"id":2342840,"date":1594897200000,"team1":{"name":"Illuminar","id":8813},"team2":{"name":"AVEZ","id":9797},"format":"bo1","event":{"name":"Eden Arena Malta Vibes Cup 3","id":5426},"stars":0,"live":false},{"id":2342800,"date":1594899000000,"team1":{"name":"Mako","id":10507},"team2":{"name":"TRUCKERS WITH ATTITUDE","id":10713},"format":"bo3","event":{"name":"LPL Pro League Season 5","id":5319},"stars":0,"live":false},{"id":2342849,"date":1594900800000,"title":"Malta Vibes 3 - Group C Winners' Match","stars":0,"live":false},{"id":2342850,"date":1594911600000,"title":"Malta Vibes 3 - Group C Elimination Match","stars":0,"live":false},{"id":2342851,"date":1594922400000,"title":"Malta Vibes 3 - Group C Decider Match","stars":0,"live":false},{"id":2342824,"date":1594962000000,"team1":{"name":"Invictus","id":7966},"team2":{"name":"D13","id":8607},"format":"bo3","event":{"name":"Perfect World Asia League Summer 2020","id":5376},"stars":0,"live":false},{"id":2342825,"date":1594971000000,"team1":{"name":"ViCi","id":7606},"team2":{"name":"Lucid Dream","id":8680},"format":"bo3","event":{"name":"Perfect World Asia League Summer 2020","id":5376},"stars":0,"live":false},{"id":2342826,"date":1594980000000,"team1":{"name":"TYLOO","id":4863},"team2":{"name":"Divine Vendetta","id":10396},"format":"bo3","event":{"name":"Perfect World Asia League Summer 2020","id":5376},"stars":0,"live":false},{"id":2342841,"date":1594980000000,"team1":{"name":"Hard Legion","id":10421},"team2":{"name":"SG.pro","id":10105},"format":"bo1","event":{"name":"Eden Arena Malta Vibes Cup 3","id":5426},"stars":0,"live":false},{"id":2342842,"date":1594983600000,"team1":{"name":"ALTERNATE aTTaX","id":4501},"team2":{"name":"Syman","id":8772},"format":"bo1","event":{"name":"Eden Arena Malta Vibes Cup 3","id":5426},"stars":0,"live":false},{"id":2342852,"date":1594987200000,"title":"Malta Vibes 3 - Group D Winners' Match","stars":0,"live":false},{"id":2342827,"date":1594989000000,"team1":{"name":"TIGER","id":10661},"team2":{"name":"Beyond","id":8262},"format":"bo3","event":{"name":"Perfect World Asia League Summer 2020","id":5376},"stars":0,"live":false},{"id":2342853,"date":1594998000000,"title":"Malta Vibes 3 - Group D Elimination Match","stars":0,"live":false},{"id":2342854,"date":1595008800000,"title":"Malta Vibes 3 - Group D Decider Match","stars":0,"live":false},{"id":2342855,"date":1595059200000,"title":"Malta Vibes 3 - Quarter-Final #1","stars":0,"live":false},{"id":2342828,"date":1595062800000,"title":"PAL Summer - Semi-final #1","stars":0,"live":false},{"id":2342856,"date":1595070000000,"title":"Malta Vibes 3 - Quarter-Final #2","stars":0,"live":false},{"id":2342829,"date":1595073600000,"title":"PAL Summer - Semi-final #2","stars":0,"live":false},{"id":2342857,"date":1595080800000,"title":"Malta Vibes 3 - Quarter-Final #3","stars":0,"live":false},{"id":2342858,"date":1595091600000,"title":"Malta Vibes 3 - Quarter-Final #4","stars":0,"live":false},{"id":2342830,"date":1595149200000,"title":"PAL Summer - 3rd Place Decider","stars":0,"live":false},{"id":2342859,"date":1595152800000,"title":"Malta Vibes 3 - Semi-Final #1","stars":0,"live":false},{"id":2342831,"date":1595160000000,"title":"PAL Summer - Grand Final","stars":0,"live":false},{"id":2342860,"date":1595163600000,"title":"Malta Vibes 3 - Semi-Final #2","stars":0,"live":false},{"id":2342861,"date":1595174400000,"title":"Malta Vibes 3 - Grand Final","stars":0,"live":false},{"id":2342643,"date":1597134600000,"team1":{"name":"Ground Zero","id":8536},"team2":{"name":"Paradox","id":7983},"format":"bo3","event":{"name":"ESL Australia & NZ Championship Season 11","id":5318},"stars":0,"live":false},{"id":2342520,"date":1597147200000,"team1":{"name":"Bantz","id":10712},"team2":{"name":"TRUCKERS WITH ATTITUDE","id":10713},"format":"bo3","event":{"name":"ESL Australia & NZ Championship Season 11","id":5318},"stars":0,"live":false}]
You can use the dictionary Exists method:
If item.Exists("team1") Then
If item("team1").Exists("name") Then
'record the name
End If
End If

Scraping html text into table with delimiters that do not have a clear pattern using R (rvest)

I'm just learning how to use R to scrape data from webpages, and I'm running into a couple of issues.
For reference, the website that I am practicing on is here: http://www.rsssf.com/tables/34q.html
As far as I know, the website I am scraping data from is not a table so I can't directly scrape the information into a table, so here is the code I wrote to just have all of the text:
wcq_1934_html <- read_html("http://www.rsssf.com/tables/34q.html")
wcq_1934_node <- html_nodes(wcq_1934_html, "pre")
wcq_1934_text <- html_text(wcq_1934_node, trim = TRUE)
This results in a very long text file with all of the information that I need, just not formatted in an ideal way.
So I am next attempting to substring this text in order to get an output that looks something like this.
Country A - Country A Score - Country B - Country B Score
It doesn't have to be exactly like this, I just basically need for each game the country and how many goals they scored and ideally it should be comparable with the other country from the same game so I can know who won or lost! I do not need any of the other information like where the game was played, etc.
So I've tried three different ways to get this:
First test: split text by dashes:
test <- strsplit(wcq_1934_text, "-")
df_test <- data.frame(test)
This gives me the information I need in a table but the rows don't match the exact scores that I need (i.e. Lithuania 0, and Sweden 2 are in separate rows)
Second test: split text by spaces:
test2 <- strsplit(wcq_1934_text, " ")
df_test2 <- data.frame(test2)
This is helpful because it gives me the scores in one row (0-2 for the first game), but the countries are unevenly spaced out across rows.
Third test: split text by "tabs"
test3 <- strsplit(wcq_1934_text, " ")
df_test3 <- data.frame(test3)
This has a similar issue to the first test.
Any suggestions would be much appreciated. This is my first ever Stack Overflow post, although I've lurked around and this website has been helpful to me for a very long time. Thank you in advance!
Here's a solution that provides you most of what you need, though as MrFlick commented, it is a little fragile to this page. I'll stay with rvest, though as biomiha suggested, it isn't really buying you a lot here (though it does cleanly break out the <pre> block).
Starting with your wcq_1934_text, it's a single long string, let's break it up by newlines (CRLF in this case):
wcq_1934_text <- strsplit(wcq_1934_text, "[\r\n]+")[[1]]
str(wcq_1934_text)
# chr [1:51] "Hosts: Italy (not automatically qualified)" "Holders: Uruguay (did not enter)" "Group 1 [Sweden]" ...
I'll the magrittr package merely because it helps break out each step of the process using the %>% non-pipe; you can convert it non-magrittr by changing (say) func1() %>% func2() %>% func3() to func3(func2(func1())) (yuck) or intermediate assignment of return values, ret1 <- func1(); ret2 <- func2(ret1); ....
library(magrittr)
dat <- Filter(function(a) grepl("^[0-9][0-9]", a), wcq_1934_text) %>%
paste(., collapse = "\n") %>%
textConnection() %>%
read.fwf(file = ., widths = c(10, 16, 17, 4, 99), stringsAsFactors = FALSE) %>%
lapply(trimws) %>%
as.data.frame(stringsAsFactors = FALSE)
The widths are fragile and unique to this page. If other reporting pages have slightly different column layouts, you'll need to use a different function, perhaps one that can automatically determine the breaks.
head(dat)
# V1 V2 V3 V4 V5
# 1 11.06.33 Stockholm Sweden 6-2 Estonia
# 2 29.06.33 Kaunas Lithuania 0-2 Sweden
# 3 11.03.34 Madrid Spain 9-0 Portugal
# 4 18.03.34 Lisboa Portugal 1-2 Spain
# 5 25.03.34 Milano Italy 4-0 Greece
# 6 25.03.34 Sofia Bulgaria 1-4 Hungary
From here, it's up to you which columns you want to use.
For instance, handling of the date, you might want:
dat$V1 <- as.POSIXct(gsub("([0-9]+)$", "19\\1", dat$V1), format = "%d.%m.%Y")
dat$V1
# [1] "1933-06-11 PST" "1933-06-29 PST" "1934-03-11 PST" "1934-03-18 PST" "1934-03-25 PST" "1934-03-25 PST" "1934-04-25 PST" "1934-04-29 PST"
# [9] "1933-10-15 PST" "1934-03-15 PST" "1933-09-24 PST" "1933-10-29 PST" "1934-04-29 PST" "1934-02-25 PST" "1934-04-08 PST" "1934-04-29 PST"
# [17] "1934-03-11 PST" "1934-04-15 PST" "1934-01-28 PST" "1934-02-01 PST" "1934-02-04 PST" "1934-03-04 PST" "1934-03-11 PST" "1934-03-18 PST"
# [25] "1934-05-24 PST" "1934-03-16 PST" "1934-04-06 PST"
The gsub stuff is because as.POSIXct assumes 2-digit years less than 69 are in the 20th century, 19th for 69-99.
It's easy enough to use either strsplit on the scores, but you could also do:
library(tidyr)
dat %>%
separate(V4, c("score1", "score2"), sep="-") %>%
head()
# Warning: Too few values at 1 locations: 10
# V1 V2 V3 score1 score2 V5
# 1 1933-06-11 Stockholm Sweden 6 2 Estonia
# 2 1933-06-29 Kaunas Lithuania 0 2 Sweden
# 3 1934-03-11 Madrid Spain 9 0 Portugal
# 4 1934-03-18 Lisboa Portugal 1 2 Spain
# 5 1934-03-25 Milano Italy 4 0 Greece
# 6 1934-03-25 Sofia Bulgaria 1 4 Hungary
(The warning is expected, since one game was not played so has "n/p" for a score. You might want to handle non-score values in V4 before trying the split, perhaps replacing anything not numeric-dash-numeric with NA.)
Equally specific to this particular site but may be easier to generalize:
library(rvest)
library(purrr)
library(dplyr)
library(stringi)
pg <- read_html("http://www.rsssf.com/tables/34q.html")
Target the <pre> and strip out some things that aren't part of "tables":
html_nodes(pg, "pre") %>%
html_text() %>%
stri_split_lines() %>%
flatten_chr() %>%
discard(stri_detect_regex, "^(NB| )") -> lines
Now, we get the start and end lines indexes of each "group":
starts <- which(grepl("^Group", lines))
ends <- c(starts[-1], length(lines))
We iterate over those starts and ends and:
extract the group info
clean up the table
discard any "empty" tables
turn the tabular data into a data frame, doing some munging along the way
I can annotate the following more if needed:
map2_df(starts, ends, ~{
grp_info <- stri_match_all_regex(lines[.x], "Group ([[:digit:]]+) \\[(.*)]")[[1]][,2:3]
lines[(.x+1):.y] %>%
discard(stri_detect_regex, "(^[^[:digit:]]| round)") %>%
discard(`==`, "") -> grp
if (length(grp) == 0) return(NULL)
stri_split_regex(grp, "\ \ +") %>%
map_df(~{
.x[1:4] %>%
as.list() %>%
set_names(c("date", "team_a", "team_b", "score_team")) %>%
flatten_df() %>%
separate(score_team, c("score", "team_c"), sep=" ") %>%
mutate(group_num = grp_info[1], group_info = grp_info[2]) %>%
separate(date, c("d", "m", "y")) %>%
mutate(date = as.Date(sprintf("19%s-%s-%s", y, m, d))) %>%
select(-d, -m, -y)
})
})
## # A tibble: 27 x 7
## team_a team_b score team_c group_num group_info date
## <chr> <chr> <chr> <chr> <chr> <chr> <date>
## 1 Stockholm Sweden 6-2 Estonia 1 Sweden 1933-06-11
## 2 Kaunas Lithuania 0-2 Sweden 1 Sweden 1933-06-29
## 3 Madrid Spain 9-0 Portugal 2 Spain 1934-03-11
## 4 Lisboa Portugal 1-2 Spain 2 Spain 1934-03-18
## 5 Milano Italy 4-0 Greece 3 Italy 1934-03-25
## 6 Sofia Bulgaria 1-4 Hungary 4 Hungary, Austria 1934-03-25
## 7 Wien Austria 6-1 Bulgaria 4 Hungary, Austria 1934-04-25
## 8 Budapest Hungary 4-1 Bulgaria 4 Hungary, Austria 1934-04-29
## 9 Warszawa Poland 1-2 Czechoslovakia 5 Czechoslovakia 1933-10-15
## 10 Praha Czechoslovakia n/p Poland 5 Czechoslovakia 1934-03-15
## 11 Beograd Yugoslavia 2-2 Switzerland 6 Romania, Switzerland 1933-09-24
## 12 Bern Switzerland 2-2 Romania 6 Romania, Switzerland 1933-10-29
## 13 Bucuresti Romania 2-1 Yugoslavia 6 Romania, Switzerland 1934-04-29
## 14 Dublin Ireland 4-4 Belgium 7 Netherlands, Belgium 1934-02-25
## 15 Amsterdam Netherlands 5-2 Ireland 7 Netherlands, Belgium 1934-04-08
## 16 Antwerpen Belgium 2-4 Netherlands 7 Netherlands, Belgium 1934-04-29
## 17 Luxembourg Luxembourg 1-9 Germany 8 Germany, France 1934-03-11
## 18 Luxembourg Luxembourg 1-6 France 8 Germany, France 1934-04-15
## 19 Port-au-Prince Haiti 1-3 Cuba 11 USA 1934-01-28
## 20 Port-au-Prince Haiti 1-1 Cuba 11 USA 1934-02-01
## 21 Port-au-Prince Haiti 0-6 Cuba 11 USA 1934-02-04
## 22 Cd. de Mexico Mexico 3-2 Cuba 11 USA 1934-03-04
## 23 Cd. de Mexico Mexico 5-0 Cuba 11 USA 1934-03-11
## 24 Cd. de Mexico Mexico 4-1 Cuba 11 USA 1934-03-18
## 25 Roma USA 4-2 Mexico 11 USA 1934-05-24
## 26 Cairo Egypt 7-1 Palestina 12 Egypt 1934-03-16
## 27 Tel Aviv Palestina 1-4 Egypt 12 Egypt 1934-04-06

Processing JSON using rjson

I'm trying to process some data in JSON format. rjson::fromJSON imports the data successfully and places it into a quite unwieldy list.
library(rjson)
y <- fromJSON(file="http://api.lmiforall.org.uk/api/v1/wf/predict/breakdown/region?soc=6145&minYear=2014&maxYear=2020")
str(y)
List of 3
$ soc : num 6145
$ breakdown : chr "region"
$ predictedEmployment:List of 7
..$ :List of 2
.. ..$ year : num 2014
.. ..$ breakdown:List of 12
.. .. ..$ :List of 3
.. .. .. ..$ code : num 1
.. .. .. ..$ name : chr "London"
.. .. .. ..$ employment: num 74910
.. .. ..$ :List of 3
.. .. .. ..$ code : num 7
.. .. .. ..$ name : chr "Yorkshire and the Humber"
.. .. .. ..$ employment: num 61132
...
However, as this is essentially tabular data, I would like it in a succinct data.frame. After much trial and error I have the result:
y.p <- do.call(rbind,lapply(y[[3]], function(p) cbind(p$year,do.call(rbind,lapply(p$breakdown, function(q) data.frame(q$name,q$employment,stringsAsFactors=F))))))
head(y.p)
p$year q.name q.employment
1 2014 London 74909.59
2 2014 Yorkshire and the Humber 61131.62
3 2014 South West (England) 65833.57
4 2014 Wales 33002.64
5 2014 West Midlands (England) 68695.34
6 2014 South East (England) 98407.36
But the command seems overly fiddly and complex. Is there a simpler way of doing this?
Here I recover the geometry of the list
ni <- seq_along(y[[3]])
nj <- seq_along(y[[c(3, 1, 2)]])
nij <- as.matrix(expand.grid(3, ni=ni, 2, nj=nj))
then extract the relevant variable information using the rows of nij as an index into the nested list
data <- apply(nij, 1, function(ij) y[[ij]])
year <- apply(cbind(nij[,1:2], 1), 1, function(ij) y[[ij]])
and make it into a more friendly structure
> data.frame(year, do.call(rbind, data))
year code name employment
1 2014 1 London 74909.59
2 2015 5 West Midlands (England) 69132.34
3 2016 12 Northern Ireland 24313.94
4 2017 5 West Midlands (England) 71723.4
5 2018 9 North East (England) 27199.99
6 2019 4 South West (England) 71219.51
I am not sure it is simpler, but the result is more complete and I think is easier to read. My idea using Map is, for each couple (year,breakdown), aggregate breakdown data into single table and then combine it with year.
dat <- y[[3]]
res <- Map(function(x,y)data.frame(year=y,
do.call(rbind,lapply(x,as.data.frame))),
lapply(dat,'[[','breakdown'),
lapply(dat,'[[','year'))
## transform the list to a big data.frame
do.call(rbind,res)
year code name employment
1 2014 1 London 74909.59
2 2014 7 Yorkshire and the Humber 61131.62
3 2014 4 South West (England) 65833.57
4 2014 10 Wales 33002.64
5 2014 5 West Midlands (England) 68695.34
6 2014 2 South East (England) 98407.36