readHTMLTable collapsing span elements - html

I am trying to fetch game data from this site and using the XML package to do so:
library(XML)
url <- 'http://scores.nbcsports.msnbc.com/cbk/teamstats.asp?team=1115&report=schedule'
raw.schedule <- readHTMLTable(url, which=2)
The problem is that all of the <span> elements in the HTML date column are collapsing together.
R> raw.schedule$Date[1]
[1] "11/142:30 PM PT3:30 PM MT4:30 PM CT5:30 PM ET10:30 PM GMT6:30 PM 北京时间3:30 PM MST5:30 PM EST"
Ideally I would like to have just the date element by itself such as:
R> raw.schedule$Date[1]
[1] "11/14"
I have tried rvest package but am having the same issue. Is it possible to read this table and keep the <span> elements separated, or just select the first element?

Define a custom function to parse the cells of the table:
myFun <- function(x){
if(length(y <- getNodeSet(x, "./span[#class=\"shsGameDate\"]")) > 0){
# date column
return(xmlValue(y[[1]]))
}
if(length(y <- getNodeSet(x, "./span[#class=\"shsTimezone shsETZone\"]")) > 0){
# time column
return(xmlValue(y[[1]]))
}
xmlValue(x, encoding = "UTF-8")
}
You can now call the readHTMLTable function using your custom function to parse the cells:
library(XML)
url <- 'http://scores.nbcsports.msnbc.com/cbk/teamstats.asp?team=1115&report=schedule'
raw.schedule <- readHTMLTable(url, which=2, elFun = myFun)
> head(raw.schedule)
Date Opponent Time TV Result
1 11/14 vs. Yale 5:30 PM ET W 88 - 85
2 11/18 vs. La Salle 8:00 PM ET L 58 - 60
3 11/22 at Albany 7:00 PM ET W 76 - 73
4 11/25 vs. Hartford 7:00 PM ET L 50 - 54
5 11/30 vs. Vermont 1:00 PM ET W 89 - 73
6 12/5 at Siena 7:00 PM ET Tickets

Related

Extracting data from a single variable to multi variable- multi observations by symbols and text. [{..]}

I have a CSV, extracted from an HTML site, many columns hold a lot of information in one cell. for example- this text is from one cell. It holds the name of 3 companies:
[{"company":"Orange","location":"","url":"https://www.xyz","positions":[{"title":"CEO","subtitle":"honelulu","description":"","duration":"Dec 2021 - Present 7 months"}] ,"industry":"Non-profit Organizations"},{"company":"Fig","location":"","url":"https://www.xyz2","positions":[{"title":"Business Development Manager","subtitle":"Fig","duration":"Feb 2019 Dec 2021 2 years 11 months",}],},
{"company":"Papaya","location":"","url":"https://www.xyz3","positions":[{"title":"Business Development Manager","subtitle":"Pragaya","description":"","duration":"Jan 2018 Oct 2018 10 months",}],"industry":"High Tech"},}]
I would like to extract each company into a different row, with the user name, position, duration and industry in different columns.
I also have other date in other columns that I wish would stay the same.
Any ideas for a simple way to do this?
This tidyr approach with extract works for a start:
library(dplyr)
library(tidyr)
data.frame(dat) %>%
# simplify:
mutate(dat = gsub('["\\]\\[}{]', '', dat, perl = TRUE)) %>%
# separate:
separate_rows(dat, sep = '(?<!^)(?=company)') %>%
# extract:
extract(dat, "company", "company:([^,]+).*", remove = FALSE) %>%
extract(dat, "user_name", ".*url:([^,]+).*", remove = FALSE) %>%
extract(dat, "position", ".*\\btitle:([^,]+)", remove = FALSE)
# A tibble: 3 × 5
industry duration position user_name company
<chr> <chr> <chr> <chr> <chr>
1 Non-profit Organizations "Dec 2021 - Present 7 months " CEO https://www.xyz Orange
2 NA "Feb 2019 Dec 2021 2 years 11 months" Business Development Manager https://www.xyz2 Fig
3 High Tech "Jan 2018 Oct 2018 10 months" Business Development Manager https://www.xyz3 Papaya
Data:
dat <- '{"company":"Orange","location":"","url":"https://www.xyz","positions":[{"title":"CEO","subtitle":"honelulu","description":"","duration":"Dec 2021 - Present 7 months"}] ,"industry":"Non-profit Organizations"},{"company":"Fig","location":"","url":"https://www.xyz2","positions":[{"title":"Business Development Manager","subtitle":"Fig","duration":"Feb 2019 Dec 2021 2 years 11 months",}],},{"company":"Papaya","location":"","url":"https://www.xyz3","positions":[{"title":"Business Development Manager","subtitle":"Pragaya","description":"","duration":"Jan 2018 Oct 2018 10 months",}],"industry":"High Tech"},}]'
See also Use tidyr's function `extract` with optional capture group for a more elegant solution

Append information in the th tags to td rows

I am an economist struggling with coding and data scraping.
I am scarping data from the main and unique table on this webpage (https://www.oddsportal.com/basketball/europe/euroleague-2013-2014/results/). I can retrieve all the information of the td HTML tags with python selenium by referring to the class element. The same goes for the th tag where it is stored the information of the date and stage of the competition. In my final dataset, I would like to have the information stored in the th tag in two rows (data and stage of the competition) next to the other rows in the table. Basically, for each match, I would like to have the date and the stage of the competition in rows and not as the head of each group of matches.
The only solution I came up with is to index all the rows (with both th and td tags) and build a while loop to append the information in the th tags to the td rows whose index is lower than the next index for the th tag. Hope I made myself clear (if not I will try to give a more graphical explanation). However, I am not able to code such a logic construct due to my poor coding abilities. I do not know if I need two loops to iterate through different tags (td and th) and in case how to do that. If you have any easier solution, it is more than welcome!
Thanks in advance for the precious help!
code below:
from selenium import webdriver
import time
import pandas as pd
# Season to filter
seasons_filt = ['2013-2014', '2014-2015', '2015-2016','2016-2017', '2017-2018', '2018-2019']
# Define empty data
data_keys = ["Season", "Match_Time", "Home_Team", "Away_Team", "Home_Odd", "Away_Odd", "Home_Score",
"Away_Score", "OT", "N_Bookmakers"]
data = dict()
for key in data_keys:
data[key] = list()
del data_keys
# Define 'driver' variable and launch browser
#path = "C:/Users/ALESSANDRO/Downloads/chromedriver_win32/chromedriver.exe"
#path office pc
path = "C:/Users/aldi/Downloads/chromedriver.exe"
driver = webdriver.Chrome(path)
# Loop through pages based on page_num and season
for season_filt in seasons_filt:
page_num = 0
while True:
page_num += 1
# Get url and navigate it
page_str = (1 - len(str(page_num)))* '0' + str(page_num)
url ="https://www.oddsportal.com/basketball/europe/euroleague-" + str(season_filt) + "/results/#/page/" + page_str + "/"
driver.get(url)
time.sleep(3)
# Check if page has no data
if driver.find_elements_by_id("emptyMsg"):
print("Season {} ended at page {}".format(season_filt, page_num))
break
try:
# Teams
for el in driver.find_elements_by_class_name('name.table-participant'):
el = el.text.strip().split(" - ")
data["Home_Team"].append(el[0])
data["Away_Team"].append(el[1])
data["Season"].append(season_filt)
# Scores
for el in driver.find_elements_by_class_name('center.bold.table-odds.table-score'):
el = el.text.split(":")
if el[1][-3:] == " OT":
data["OT"].append(True)
el[1] = el[1][:-3]
else:
data["OT"].append(False)
data["Home_Score"].append(el[0])
data["Away_Score"].append(el[1])
# Match times
for el in driver.find_elements_by_class_name("table-time"):
data["Match_Time"].append(el.text)
# Odds
i = 0
for el in driver.find_elements_by_class_name("odds-nowrp"):
i += 1
if i%2 == 0:
data["Away_Odd"].append(el.text)
else:
data["Home_Odd"].append(el.text)
# N_Bookmakers
for el in driver.find_elements_by_class_name("center.info-value"):
data["N_Bookmakers"].append(el.text)
# TODO think of inserting the dates list in the dataframe even if it has a different size (19 rows and not 50)
except:
pass
driver.quit()
data = pd.DataFrame(data)
data.to_csv("data_odds.csv", index = False)
I would like to add this information to my dataset as two additional rows:
for el in driver.find_elements_by_class_name("first2.tl")[1:]:
el = el.text.strip().split(" - ")
data["date"].append(el[0])
data["stage"].append(el[1])
Few things I would change here.
Don't overwrite variables. You store elements in your el variable, then you over write the element with your strings. It may work for you here, but you may get yourself into trouble with that practice later on, especially since you are iterating through those elements. It makes it hard to debug too.
I know Selenium has ways to parse the html. But I personally feel BeautifulSoup is a tad easier to parse with and is a little more intuitive if you are simply just trying to pull out data from the html. So I went with BeautifulSoup's .find_previous() to get the tags that precede the games, essentially then able to get your date and stage content.
Lastly, I like to construct a list of dictionaries to make up the data frame. Each item in the list is a dictionary key:value where the key is the column name and value is the data. You sort of do the opposite in creating a dictionary of lists. Now there is nothing wrong with that, but if the lists don't have the same length, you're get an error when trying to create the dataframe. Where as with my way, if for what ever reason there is a value missing, it will still create the dataframe, but will just have a null or nan for the missing data.
There may be more work you need to do with the code to go through the pages, but this gets you the data in the form you need.
Code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
import pandas as pd
from bs4 import BeautifulSoup
import re
# Season to filter
seasons_filt = ['2013-2014', '2014-2015', '2015-2016','2016-2017', '2017-2018', '2018-2019']
# Define 'driver' variable and launch browser
path = "C:/Users/ALESSANDRO/Downloads/chromedriver_win32/chromedriver.exe"
driver = webdriver.Chrome(path)
rows = []
# Loop through pages based on page_num and season
for season_filt in seasons_filt:
page_num = 0
while True:
page_num += 1
# Get url and navigate it
page_str = (1 - len(str(page_num)))* '0' + str(page_num)
url ="https://www.oddsportal.com/basketball/europe/euroleague-" + str(season_filt) + "/results/#/page/" + page_str + "/"
driver.get(url)
time.sleep(3)
# Check if page has no data
if driver.find_elements_by_id("emptyMsg"):
print("Season {} ended at page {}".format(season_filt, page_num))
break
try:
soup = BeautifulSoup(driver.page_source, 'html.parser')
table = soup.find('table', {'id':'tournamentTable'})
trs = table.find_all('tr', {'class':re.compile('.*deactivate.*')})
for each in trs:
teams = each.find('td', {'class':'name table-participant'}).text.split(' - ')
scores = each.find('td', {'class':re.compile('.*table-score.*')}).text.split(':')
ot = False
for score in scores:
if 'OT' in score:
ot == True
scores = [x.replace('\xa0OT','') for x in scores]
matchTime = each.find('td', {'class':re.compile('.*table-time.*')}).text
# Odds
i = 0
for each_odd in each.find_all('td',{'class':"odds-nowrp"}):
i += 1
if i%2 == 0:
away_odd = each_odd.text
else:
home_odd = each_odd.text
n_bookmakers = soup.find('td',{'class':'center info-value'}).text
date_stage = each.find_previous('th', {'class':'first2 tl'}).text.split(' - ')
date = date_stage[0]
stage = date_stage[1]
row = {'Season':season_filt,
'Home_Team':teams[0],
'Away_Team':teams[1],
'Home_Score':scores[0],
'Away_Score':scores[1],
'OT':ot,
'Match_Time':matchTime,
'Home_Odd':home_odd,
'Away_Odd':away_odd,
'N_Bookmakers':n_bookmakers,
'Date':date,
'Stage':stage}
rows.append(row)
except:
pass
driver.quit()
data = pd.DataFrame(rows)
data.to_csv("data_odds.csv", index = False)
Output:
print(data.head(15).to_string())
Season Home_Team Away_Team Home_Score Away_Score OT Match_Time Home_Odd Away_Odd N_Bookmakers Date Stage
0 2013-2014 Real Madrid Maccabi Tel Aviv 86 98 False 18:00 -667 +493 7 18 May 2014 Final Four
1 2013-2014 Barcelona CSKA Moscow 93 78 False 15:00 -135 +112 7 18 May 2014 Final Four
2 2013-2014 Barcelona Real Madrid 62 100 False 19:00 +134 -161 7 16 May 2014 Final Four
3 2013-2014 CSKA Moscow Maccabi Tel Aviv 67 68 False 16:00 -278 +224 7 16 May 2014 Final Four
4 2013-2014 Real Madrid Olympiacos 83 69 False 18:45 -500 +374 7 25 Apr 2014 Play Offs
5 2013-2014 CSKA Moscow Panathinaikos 74 44 False 16:00 -370 +295 7 25 Apr 2014 Play Offs
6 2013-2014 Olympiacos Real Madrid 71 62 False 18:45 +127 -152 7 23 Apr 2014 Play Offs
7 2013-2014 Maccabi Tel Aviv Olimpia Milano 86 66 False 17:45 -217 +179 7 23 Apr 2014 Play Offs
8 2013-2014 Panathinaikos CSKA Moscow 73 72 False 16:30 -106 -112 7 23 Apr 2014 Play Offs
9 2013-2014 Panathinaikos CSKA Moscow 65 59 False 18:45 -125 +104 7 21 Apr 2014 Play Offs
10 2013-2014 Maccabi Tel Aviv Olimpia Milano 75 63 False 18:15 -189 +156 7 21 Apr 2014 Play Offs
11 2013-2014 Olympiacos Real Madrid 78 76 False 17:00 +104 -125 7 21 Apr 2014 Play Offs
12 2013-2014 Galatasaray Barcelona 75 78 False 17:00 +264 -333 7 20 Apr 2014 Play Offs
13 2013-2014 Olimpia Milano Maccabi Tel Aviv 91 77 False 18:45 -286 +227 7 18 Apr 2014 Play Offs
14 2013-2014 CSKA Moscow Panathinaikos 77 51 False 16:15 -303 +247 7 18 Apr 2014 Play Offs

Extract columns

I am trying to parse information from a html page which looks like this:
Column 1 | Column 2 | Column 3 ....
This is the code I have so far:
from bs4 import BeautifulSoup as BS
import urllib.request
html=urllib.request.urlopen(url)
soup=BS(html,"lxml")
But I can't seem to figure out how I can extract, say column 1 from that html page and put it into a dataframe in python.
You can scrape the table data and then add to a dataframe:
from bs4 import BeautifulSoup as soup
import urllib
import pandas as pd
page_data = str(urllib.urlopen('http://mlg.ucd.ie/modules/COMP30760/stocks/tlsa.html').read())
final_data = [i.text for i in soup(page_data, 'html.parser').find_all('td')]
last_data = [final_data[i:i+7] for i in range(0, len(final_data), 7)]
df = pd.DataFrame(last_data[1:], columns = last_data[0])
Output (sample)
Day Month Year Open High Low Close
0 02 01 2013 35 35.450001 34.709999 35.360001
1 03 01 2013 35.18 35.450001 34.75 34.77
2 04 01 2013 34.799999 34.799999 33.919998 34.400002
3 07 01 2013 34.799999 34.799999 33.900002 34.34
4 08 01 2013 34.5 34.5 33.110001 33.68
5 09 01 2013 34.009998 34.189999 33.400002 33.639999
6 10 01 2013 33.869999 33.990002 33.380001 33.529999
7 11 01 2013 34.040001 34.040001 32.110001 32.91
8 14 01 2013 33.080002 33.380001 32.849998 33.259998
I would recommend looking in to pandas. Once you have your html in memory you can try a
import pandas as pd
df = pd.read_html(myHtml)
it works pretty well.

Scraping wikipedia table r

Trying to scrape the first 8 tables (very high, high, medium, low) from the human development index in Wikipedia.
Started with but getting a list of zero. What am I doing wrong? New to R :(
libray(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index#Complete_list_of_countries"
webpage <- read_html(url)
hdi_tables <- html_nodes(webpage, 'table')
head(hdi_tables, n = 10)
scrape <- url %>%
read_html() %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/div[5]/table/tbody/tr/td[1]/table') %>%
html_table()
head(scrape, n=10)
I think it would be easier to work with the original data source:
Select "Human Development Index (HDI)" in both the drop-down select lists, then click the "Download Data" link to get a CSV file named Human Development Index (HDI).csv.
Read it into R:
library(tidyverse)
Human_Development_Index_HDI_ <- read_csv("path/to/Human Development Index (HDI).csv",
skip = 1)
You can reshape the data, get the values for 2015 and classify countries as low, medium, high or very high:
hdi <- Human_Development_Index_HDI_ %>%
gather(Year, HDI, -`HDI Rank (2015)`, -Country) %>%
filter(Year == "2015") %>%
na.omit() %>%
mutate(Year = as.numeric(Year),
classification = cut(HDI,
breaks = c(0, 0.549, 0.699, 0.799, 1),
labels = c("low", "medium", "high", "very_high")))
hdi
# A tibble: 188 x 5
`HDI Rank (2015)` Country Year HDI classification
<int> <chr> <dbl> <dbl> <fctr>
1 169 Afghanistan 2015 0.479 low
2 75 Albania 2015 0.764 high
3 83 Algeria 2015 0.745 high
4 32 Andorra 2015 0.858 very_high
5 150 Angola 2015 0.533 low
6 62 Antigua and Barbuda 2015 0.786 high
7 45 Argentina 2015 0.827 very_high
8 84 Armenia 2015 0.743 high
9 2 Australia 2015 0.939 very_high
10 24 Austria 2015 0.893 very_high
# ... with 178 more rows
You could change the filter to get values for 2014 too, if you want to replicate the "change from previous year" values in the Wikipedia table.
If you're okay with parsing the wikipedia markup language instead, you could try using WikipediR to grab the markup of the page (from skimming the documentation, try page_content with as_wikitext set to true). Then you'll get some lines that all look like this:
| 1 || {{steady}} ||style="text-align:left"| {{flag|Norway}} || 0.949 || {{increase}} 0.001
This should be parseable in R using strsplit or something.

Extract Rows from JSON Google API request in R

I am pulling in data from banks based on a search nearby request from Google's map api. In some instances, It pulls more than one bank (as some banks may be close together). How do I go about extracting each individual JSON object returned (as I need to get the place id) in order that I can do a second api pull (based on the place id) to search more detail about each bank? Here is my code:
require(jsonlite)
require(utils)
plcUrl <- "https://maps.googleapis.com/maps/api/place/nearbysearch/json?"
key <- "myKEY"
location <- paste0("41.0272, -81.51345")
address <- "XXXXXXXXXXXXXXXXX"
type <- "bank"
radius <- "500"
name = "XXXXX"
strurl <- as.character(paste(plcUrl ,
"&location=",location,
"&address=",address,
#"&name=",name,
"&radius=",radius,
"&type=",type,
"&key=",key,
sep=""))
setInternet2(TRUE)
rd <- fromJSON(URLencode(strurl))
rd$results$place_id
As of googleway v2.4 I've added methods that access specific elements of Google API queries.
library(googleway)
key <- "your_api_key"
## search places
res <- google_places(location = c(41.0272, -81.51345),
key = key,
place_type = "bank",
radius = 500)
## get the place_id values using the `place` method to extract the ids
place(res)
# [1] "ChIJDe7R2HQqMYgRvqoszlV6YTA" "ChIJDQwLUXMqMYgR-3Nb2KFhZZ0"
## query the details
details <- google_place_details(place_id = place(res)[1], key = key)
details$result$opening_hours
# $open_now
# [1] FALSE
#
# $periods
# close.day close.time open.day open.time
# 1 1 1600 1 0900
# 2 2 1600 2 0900
# 3 3 1600 3 0900
# 4 4 1600 4 0900
# 5 5 1800 5 0900
# 6 6 1300 6 0900
#
# $weekday_text
# [1] "Monday: 9:00 AM – 4:00 PM" "Tuesday: 9:00 AM – 4:00 PM" "Wednesday: 9:00 AM – 4:00 PM" "Thursday: 9:00 AM – 4:00 PM"
# [5] "Friday: 9:00 AM – 6:00 PM" "Saturday: 9:00 AM – 1:00 PM" "Sunday: Closed"