Is there a way to web scrape HTML table data that keeps showing up as "" when using rvest tools? - html

<td headers="apcl1" data-dyn="1" class="text-center">1<span class="hidden"> authorized course</span></td>
<td headers="apcl2" data-dyn="2" class="text-center">1<span class="hidden"> authorized course</span></td>
<td headers="apcl3" data-dyn="3" class="text-center">1<span class="hidden"> authorized course</span></td>
<td headers="apcl4" data-dyn="4" class="text-center">--<span class="hidden"> no authorized courses</span></td>
For the above HTML code, I am trying to scrape the data in the td tag between > and < span (i.e., 1, 1, 1, --).
I am using R and the rvest package and my code is below:
individual_temp_url <- "https://apcourseaudit.inflexion.org/ledger/school.php?a=MTQ4Mzk=&b=MA=="
read_html(individual_temp_url) %>%
html_nodes('td') %>%
html_text()
However, when I do this, all I get is "" for each of the td tags. Looking for help to extract the numbers for each td tag?

The td elements are blank on the html you download. In the browser, they are populated by javascript after the page loads, from a JSON included in one of the page's script tags. You can extract this and parse the JSON to get a nice data frame:
library(rvest)
#> Loading required package: xml2
individual_temp_url <- "https://apcourseaudit.inflexion.org/ledger/school.php?a=MTQ4Mzk=&b=MA=="
df <- read_html(individual_temp_url) %>%
html_nodes('script') %>%
html_text() %>%
`[`(4) %>%
strsplit("dataSet = |\r\n|;") %>%
unlist() %>%
`[`(3) %>%
jsonlite::fromJSON()
df
#> data data data data data data data data data
#> 1 2007-08 2008-09 2009-10 2010-11 2011-12 2012-13 2013-14 2014-15 2015-16
#> 2 0 0 0 0 0 1 1 1 1
#> 3 2 2 2 2 2 2 2 2 2
#> 4 3 3 3 3 3 2 2 4 3
#> 5 1 1 1 1 1 1 1 1 2
#> 6 2 3 2 2 2 2 2 2 2
#> 7 1 1 1 1 1 1 1 1 1
#> 8 0 0 0 0 0 0 0 0 0
#> 9 1 1 1 1 1 1 1 1 1
#> 10 1 1 1 1 1 1 1 1 1
#> 11 1 1 1 1 1 2 2 3 1
#> 12 0 0 2 2 2 2 2 2 1
#> 13 0 0 1 1 1 1 1 1 1
#> 14 0 0 0 0 0 1 1 1 0
#> 15 0 0 0 0 1 1 1 1 1
#> 16 0 0 0 0 0 0 0 2 2
#> 17 0 0 0 0 0 0 0 0 1
#> 18 0 0 0 0 0 2 2 0 0
#> 19 0 0 0 0 0 0 0 0 0
#> 20 1 1 1 1 1 1 2 2 2
#> 21 1 1 1 1 1 1 1 1 1
#> 22 1 1 1 1 1 1 1 1 1
#> 23 1 1 1 1 1 2 2 2 2
#> 24 1 2 2 1 1 1 1 1 1
#> 25 2 3 4 2 1 1 1 1 2
#> 26 2 3 3 2 1 2 1 1 2
#> data data data data
#> 1 2016-17 2017-18 2018-19 2019-20
#> 2 1 1 1 0
#> 3 2 2 2 1
#> 4 0 0 1 2
#> 5 0 0 0 2
#> 6 2 2 2 1
#> 7 1 1 1 1
#> 8 1 1 1 1
#> 9 1 1 1 1
#> 10 1 2 2 1
#> 11 1 1 1 1
#> 12 2 2 2 2
#> 13 1 1 1 1
#> 14 0 0 0 0
#> 15 1 1 1 1
#> 16 2 2 2 1
#> 17 0 1 1 0
#> 18 0 0 0 0
#> 19 0 0 1 1
#> 20 0 0 1 1
#> 21 1 1 1 1
#> 22 0 0 1 0
#> 23 2 2 2 2
#> 24 1 1 0 1
#> 25 2 2 3 3
#> 26 0 0 1 1
Created on 2020-03-07 by the reprex package (v0.3.0)

Related

Bar chart from many variables where varx = in Stata

I have a bar chart question here. Given that for all the variables in the dataset 1 = yes and 0 = No. I would like to plot a bar graph with the percentages (where var=1) on the y-axis and the variables on the x axis. Thanks in advance.
Dataset
Water
Ice
Fire
Vapor
1
1
0
1
1
0
0
1
0
1
1
1
1
1
1
1
1
1
0
1
1
1
1
0
0
1
1
1
0
1
0
1
0
1
1
1
1
0
1
1
0
1
0
0
0
1
1
0
1
0
1
0
1
0
1
0
1
1
1
1
0
1
0
1
1
0
1
1
1
0
1
0
1
1
0
1
1
0
0
1
0
1
1
1
1
1
0
1
1
0
0
1
0
1
1
1
The percent of 1s in a (0, 1) variable is just the mean multiplied by 100. As you probably want to see the percent as text on the graph, one method is to clone the variables and multiply each by 100.
You could then use graph bar directly as it defaults to showing means. I don't like its default in this case and the code instead uses statplot, which must be installed before you can use it.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(water ice fire vapor)
1 1 0 1
1 0 0 1
0 1 1 1
1 1 1 1
1 1 0 1
1 1 1 0
0 1 1 1
0 1 0 1
0 1 1 1
1 0 1 1
0 1 0 0
0 1 1 0
1 0 1 0
1 0 1 0
1 1 1 1
0 1 0 1
1 0 1 1
1 0 1 0
1 1 0 1
1 0 0 1
0 1 1 1
1 1 0 1
1 0 0 1
0 1 1 1
end
quietly foreach v of var water-vapor {
clonevar `v'2 = `v'
label var `v'2 "`v'"
replace `v'2 = 100 * `v'
}
* ssc install statplot
statplot *2 , recast(bar) ytitle(%) blabel(bar, format(%2.1f))
Try
. ssc install mylabels
checking mylabels consistency and verifying not already installed...
all files already exist and are up to date.
. sysuse nlsw88, clear
(NLSW, 1988 extract)
. mylabels 0(10)70, myscale(#/100) local(labels)
0 "0" .1 "10" .2 "20" .3 "30" .4 "40" .5 "50" .6 "60" .7 "70"
. graph bar (mean) married collgrad south union, showyvars legend(off) nolabel bargap(20) ylabel(`labels')
. table, statistic(mean married collgrad south union)
------------------------------
Married | .6420303
College graduate | .2368655
Lives in the south | .4194123
Union worker | .2454739
------------------------------
This relies on mylabels, and implements the bar gap (which I also like).

Scrapy - how to index and extract from html tables

This is the webpage I am scraping: http://laxreports.sportlogiq.com/nll/GS2200.html
Below is the code for the spider I created:
import scrapy
class MatchesSpider(scrapy.Spider):
name = 'matches'
allowed_domains = ['laxreports.sportlogiq.com']
start_urls = ['http://laxreports.sportlogiq.com/nll/GS2200.html']
def parse(self, response):
tables = response.xpath('//table')
print(tables)
table = tables[0].xpath('//tbody')
I see 22 tables that have been selected for this XPath expression but my problem is that I don't fully understand how to select each individual table and extract its contents.
I am a beginner in scrapy and after searching online for a solution all I see is how to select the tables using the class or ID which in this case is not an option.
You can do that using only pandas
Code:
import pandas as pd
dfs = pd.read_html('https://laxreports.sportlogiq.com/nll/GS2200.html')
df = dfs[10]#.to_csv('d.csv', index = False)
print(df)
Output:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 # Name G A +/- PIM S SOFF LB T CT FO TOF
1 2 W.Malcom 0 0 0 0 1 1 1 4 0 - 11:28
2 3 T.Edwards 0 0 -2 2 0 0 8 1 2 7-18 20:28
3 4 J.Sullivan 0 0 -3 2 0 0 3 0 0 - 15:29
4 11 T.Stuart 0 0 -3 0 0 0 4 1 1 - 21:09
5 14 W.Jeffrey 0 1 -1 0 0 0 9 2 1 - 19:17
6 16 R.Lee 2 1 2 0 9 4 6 6 1 - 23:13
7 17 C.Wardle 2 0 1 2 5 3 4 2 2 - 20:55
8 18 R.Hope (A) 0 0 -2 2 0 0 11 0 0 - 22:02
9 20 J.Ruest 3 2 3 0 8 1 3 2 0 - 24:16
10 23 J.Gilles 0 0 -1 0 0 0 4 0 3 - 14:44
11 27 S.Carnegie 0 0 -1 0 0 0 3 0 0 - 12:19
12 37 D.Coates (C) 0 0 0 0 1 0 1 0 0 1-1 2:31
13 51 E.McLaughlin 0 5 2 0 7 3 5 7 0 - 21:41
14 55 D.Kinnear 0 1 2 0 2 0 2 1 0 0-2 10:14
15 67 K.Killen 1 1 0 0 6 1 4 2 0 - 16:42
16 82 J.Cupido (A) 0 1 -1 0 3 0 4 1 0 - 20:52
17 86 J.Lintz 0 1 -1 0 0 0 4 0 1 - 19:26
18 30 T.Carlson 0 0 NaN 0 0 0 0 0 0 - NaN
19 45 D.Ward 0 0 NaN 0 0 0 0 1 0 - NaN
20 NaN Totals: 8 13 NaN 8 42 13 76 30 11 8-21 NaN

R join 2 data frames

Hello i would like to know how can i merge 2 data frames in R,there is a merge function ,but i would like to do this :
data frame1
X Y Z
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
data frame 2
A B C
1 2 2 2
2 2 2 2
3 2 2 2
mergedataframe
X Y Z A B C
1 1 1 1
2 1 1 1
3 1 1 1 2 2 2
4 1 1 1 2 2 2
5 1 1 1 2 2 2
the think is i must synchronize 3 csv files (dataframe) and i have no idea how to it with R.
if somebody have any idea about it ,thank u
i redit my post i would like my merged data frame like that :
data frame1
X Y Z
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
data frame 2
A B C
1 2 2 2
2 2 2 2
mergedataframe
X Y Z A B C
1 1 1 1
2 1 1 1
3 1 1 1 2 2 2
4 1 1 1 2 2 2
5 1 1 1
6 1 1 1
df1 <- data.frame(X=rep(1,5),Y=1, Z=1)
df2 <- data.frame(A=rep(2,3),B=2, C=2)
#rownames(df2) <- 3:5
rownames(df2) <- tail(rownames(df1), nrow(df2))
mergedataframe <- merge(df1,df2, by=0, all=TRUE)
mergedataframe <- mergedataframe[,-1]
mergedataframe
X Y Z A B C
1 1 1 1 NA NA NA
2 1 1 1 NA NA NA
3 1 1 1 2 2 2
4 1 1 1 2 2 2
5 1 1 1 2 2 2

retrieve integer name in shortest.path function of igraph

First, I have a shortest path matrix generated with igraph (shortest path)
When I want to retreive the node names with "get.shortest.path" it just brings me the number of each column and not its name:
[,a] [,b] [,c] [,d] [,e] [,f] [,g] [,h] [,i] [,j]
[a,] 0 1 2 3 4 5 4 3 2 1
[b,] 1 0 1 2 3 4 5 4 3 2
[c,] 2 1 0 1 2 3 4 5 4 3
[d,] 3 2 1 0 1 2 3 4 5 4
[e,] 4 3 2 1 0 1 2 3 4 5
[f,] 5 4 3 2 1 0 1 2 3 4
[g,] 4 5 4 3 2 1 0 1 2 3
[h,] 3 4 5 4 3 2 1 0 1 2
[i,] 2 3 4 5 4 3 2 1 0 1
[j,] 1 2 3 4 5 4 3 2 1 0
then:
get.shortest.paths(g, 5, 1)
the answer is:
[[1]]
[1] 5 4 3 2
I want the node names not their numbers. Is there any solution? I checked vpath, too.
This does the trick for me:
paths <- get.shortest.paths(g, 5, 1)$vpath
names <- V(g)$name
lapply(paths, function(x) { names[x] })
There is a slightly simpler solution that does not use lapply:
paths <- get.shortest.paths(g, 5, 1)
V(g)$name[paths$vpath[[1]]]

Query to sum duplicated fields

Here is mysql data
id usr good quant delayed cart_ts
------------------------------------------------------
14 4 1 1 0 20100601235348
13 4 11 1 0 20100601235345
12 4 4 1 0 20100601235335
11 4 1 1 0 20100601235051
10 4 11 1 0 20100601235051
9 4 4 1 0 20100601235051
15 4 2 1 0 20100601235350
16 4 7 1 0 20100602000537
17 4 3 1 0 20100602000610
18 4 3 1 0 20100602000616
19 4 8 1 0 20100602000802
20 4 8 1 0 20100602000806
21 4 8 1 0 20100602000828
22 4 8 1 0 20100602000828
23 4 8 1 0 20100602000828
24 4 8 1 0 20100602000828
25 4 8 1 0 20100602000828
26 4 8 1 0 20100602000829
27 4 8 1 0 20100602000829
28 4 9 1 0 20100602001045
29 4 10 1 0 20100602001046
I need to group fields in witch usr & good has duplicated values with summing quant field
for getting smth like this:
id usr good quant delayed cart_ts
------------------------------------------------------
14 4 1 2 0 20100601235348
13 4 11 2 0 20100601235345
12 4 4 2 0 20100601235335
15 4 2 1 0 20100601235350
16 4 7 1 0 20100602000537
17 4 3 2 0 20100602000610
19 4 8 9 0 20100602000802
28 4 9 1 0 20100602001045
29 4 10 1 0 20100602001046
Which MySQL query I need to do to have this effect?
SELECT id,usr,good,SUM(quant),delayed,cart_ts FROM table GROUP BY usr,good