Scrapy - how to index and extract from html tables - html

This is the webpage I am scraping: http://laxreports.sportlogiq.com/nll/GS2200.html
Below is the code for the spider I created:
import scrapy
class MatchesSpider(scrapy.Spider):
name = 'matches'
allowed_domains = ['laxreports.sportlogiq.com']
start_urls = ['http://laxreports.sportlogiq.com/nll/GS2200.html']
def parse(self, response):
tables = response.xpath('//table')
print(tables)
table = tables[0].xpath('//tbody')
I see 22 tables that have been selected for this XPath expression but my problem is that I don't fully understand how to select each individual table and extract its contents.
I am a beginner in scrapy and after searching online for a solution all I see is how to select the tables using the class or ID which in this case is not an option.

You can do that using only pandas
Code:
import pandas as pd
dfs = pd.read_html('https://laxreports.sportlogiq.com/nll/GS2200.html')
df = dfs[10]#.to_csv('d.csv', index = False)
print(df)
Output:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 # Name G A +/- PIM S SOFF LB T CT FO TOF
1 2 W.Malcom 0 0 0 0 1 1 1 4 0 - 11:28
2 3 T.Edwards 0 0 -2 2 0 0 8 1 2 7-18 20:28
3 4 J.Sullivan 0 0 -3 2 0 0 3 0 0 - 15:29
4 11 T.Stuart 0 0 -3 0 0 0 4 1 1 - 21:09
5 14 W.Jeffrey 0 1 -1 0 0 0 9 2 1 - 19:17
6 16 R.Lee 2 1 2 0 9 4 6 6 1 - 23:13
7 17 C.Wardle 2 0 1 2 5 3 4 2 2 - 20:55
8 18 R.Hope (A) 0 0 -2 2 0 0 11 0 0 - 22:02
9 20 J.Ruest 3 2 3 0 8 1 3 2 0 - 24:16
10 23 J.Gilles 0 0 -1 0 0 0 4 0 3 - 14:44
11 27 S.Carnegie 0 0 -1 0 0 0 3 0 0 - 12:19
12 37 D.Coates (C) 0 0 0 0 1 0 1 0 0 1-1 2:31
13 51 E.McLaughlin 0 5 2 0 7 3 5 7 0 - 21:41
14 55 D.Kinnear 0 1 2 0 2 0 2 1 0 0-2 10:14
15 67 K.Killen 1 1 0 0 6 1 4 2 0 - 16:42
16 82 J.Cupido (A) 0 1 -1 0 3 0 4 1 0 - 20:52
17 86 J.Lintz 0 1 -1 0 0 0 4 0 1 - 19:26
18 30 T.Carlson 0 0 NaN 0 0 0 0 0 0 - NaN
19 45 D.Ward 0 0 NaN 0 0 0 0 1 0 - NaN
20 NaN Totals: 8 13 NaN 8 42 13 76 30 11 8-21 NaN

Related

Bar chart from many variables where varx = in Stata

I have a bar chart question here. Given that for all the variables in the dataset 1 = yes and 0 = No. I would like to plot a bar graph with the percentages (where var=1) on the y-axis and the variables on the x axis. Thanks in advance.
Dataset
Water
Ice
Fire
Vapor
1
1
0
1
1
0
0
1
0
1
1
1
1
1
1
1
1
1
0
1
1
1
1
0
0
1
1
1
0
1
0
1
0
1
1
1
1
0
1
1
0
1
0
0
0
1
1
0
1
0
1
0
1
0
1
0
1
1
1
1
0
1
0
1
1
0
1
1
1
0
1
0
1
1
0
1
1
0
0
1
0
1
1
1
1
1
0
1
1
0
0
1
0
1
1
1
The percent of 1s in a (0, 1) variable is just the mean multiplied by 100. As you probably want to see the percent as text on the graph, one method is to clone the variables and multiply each by 100.
You could then use graph bar directly as it defaults to showing means. I don't like its default in this case and the code instead uses statplot, which must be installed before you can use it.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(water ice fire vapor)
1 1 0 1
1 0 0 1
0 1 1 1
1 1 1 1
1 1 0 1
1 1 1 0
0 1 1 1
0 1 0 1
0 1 1 1
1 0 1 1
0 1 0 0
0 1 1 0
1 0 1 0
1 0 1 0
1 1 1 1
0 1 0 1
1 0 1 1
1 0 1 0
1 1 0 1
1 0 0 1
0 1 1 1
1 1 0 1
1 0 0 1
0 1 1 1
end
quietly foreach v of var water-vapor {
clonevar `v'2 = `v'
label var `v'2 "`v'"
replace `v'2 = 100 * `v'
}
* ssc install statplot
statplot *2 , recast(bar) ytitle(%) blabel(bar, format(%2.1f))
Try
. ssc install mylabels
checking mylabels consistency and verifying not already installed...
all files already exist and are up to date.
. sysuse nlsw88, clear
(NLSW, 1988 extract)
. mylabels 0(10)70, myscale(#/100) local(labels)
0 "0" .1 "10" .2 "20" .3 "30" .4 "40" .5 "50" .6 "60" .7 "70"
. graph bar (mean) married collgrad south union, showyvars legend(off) nolabel bargap(20) ylabel(`labels')
. table, statistic(mean married collgrad south union)
------------------------------
Married | .6420303
College graduate | .2368655
Lives in the south | .4194123
Union worker | .2454739
------------------------------
This relies on mylabels, and implements the bar gap (which I also like).

Is there a way to web scrape HTML table data that keeps showing up as "" when using rvest tools?

<td headers="apcl1" data-dyn="1" class="text-center">1<span class="hidden"> authorized course</span></td>
<td headers="apcl2" data-dyn="2" class="text-center">1<span class="hidden"> authorized course</span></td>
<td headers="apcl3" data-dyn="3" class="text-center">1<span class="hidden"> authorized course</span></td>
<td headers="apcl4" data-dyn="4" class="text-center">--<span class="hidden"> no authorized courses</span></td>
For the above HTML code, I am trying to scrape the data in the td tag between > and < span (i.e., 1, 1, 1, --).
I am using R and the rvest package and my code is below:
individual_temp_url <- "https://apcourseaudit.inflexion.org/ledger/school.php?a=MTQ4Mzk=&b=MA=="
read_html(individual_temp_url) %>%
html_nodes('td') %>%
html_text()
However, when I do this, all I get is "" for each of the td tags. Looking for help to extract the numbers for each td tag?
The td elements are blank on the html you download. In the browser, they are populated by javascript after the page loads, from a JSON included in one of the page's script tags. You can extract this and parse the JSON to get a nice data frame:
library(rvest)
#> Loading required package: xml2
individual_temp_url <- "https://apcourseaudit.inflexion.org/ledger/school.php?a=MTQ4Mzk=&b=MA=="
df <- read_html(individual_temp_url) %>%
html_nodes('script') %>%
html_text() %>%
`[`(4) %>%
strsplit("dataSet = |\r\n|;") %>%
unlist() %>%
`[`(3) %>%
jsonlite::fromJSON()
df
#> data data data data data data data data data
#> 1 2007-08 2008-09 2009-10 2010-11 2011-12 2012-13 2013-14 2014-15 2015-16
#> 2 0 0 0 0 0 1 1 1 1
#> 3 2 2 2 2 2 2 2 2 2
#> 4 3 3 3 3 3 2 2 4 3
#> 5 1 1 1 1 1 1 1 1 2
#> 6 2 3 2 2 2 2 2 2 2
#> 7 1 1 1 1 1 1 1 1 1
#> 8 0 0 0 0 0 0 0 0 0
#> 9 1 1 1 1 1 1 1 1 1
#> 10 1 1 1 1 1 1 1 1 1
#> 11 1 1 1 1 1 2 2 3 1
#> 12 0 0 2 2 2 2 2 2 1
#> 13 0 0 1 1 1 1 1 1 1
#> 14 0 0 0 0 0 1 1 1 0
#> 15 0 0 0 0 1 1 1 1 1
#> 16 0 0 0 0 0 0 0 2 2
#> 17 0 0 0 0 0 0 0 0 1
#> 18 0 0 0 0 0 2 2 0 0
#> 19 0 0 0 0 0 0 0 0 0
#> 20 1 1 1 1 1 1 2 2 2
#> 21 1 1 1 1 1 1 1 1 1
#> 22 1 1 1 1 1 1 1 1 1
#> 23 1 1 1 1 1 2 2 2 2
#> 24 1 2 2 1 1 1 1 1 1
#> 25 2 3 4 2 1 1 1 1 2
#> 26 2 3 3 2 1 2 1 1 2
#> data data data data
#> 1 2016-17 2017-18 2018-19 2019-20
#> 2 1 1 1 0
#> 3 2 2 2 1
#> 4 0 0 1 2
#> 5 0 0 0 2
#> 6 2 2 2 1
#> 7 1 1 1 1
#> 8 1 1 1 1
#> 9 1 1 1 1
#> 10 1 2 2 1
#> 11 1 1 1 1
#> 12 2 2 2 2
#> 13 1 1 1 1
#> 14 0 0 0 0
#> 15 1 1 1 1
#> 16 2 2 2 1
#> 17 0 1 1 0
#> 18 0 0 0 0
#> 19 0 0 1 1
#> 20 0 0 1 1
#> 21 1 1 1 1
#> 22 0 0 1 0
#> 23 2 2 2 2
#> 24 1 1 0 1
#> 25 2 2 3 3
#> 26 0 0 1 1
Created on 2020-03-07 by the reprex package (v0.3.0)

MySql join query

want to join two tables and produce a result like ..ie.,
Table : 1
-------------------------------
Text val1 val2 val3 val4
-------------------------------
Test 96 1 4 0
Test 96 3 4 0
Test 96 5 4 0
Test 96 7 4 0
Test 96 9 4 0
Test 96 11 4 0
Test 96 13 4 0
Test 96 15 4 0
Test 87 7 6 1
Test1 87 7 6 1
Test1 95 5 4 0
Test1 95 13 4 0
Test2 109 15 6 0
Test3 109 15 5 0
Test4 109 15 4 0
Test5 109 15 3 0
Test6 107 0 7 0
Test7 107 0 6 0
Test8 107 0 5 0
Test9 107 0 4 0
Table : 2
-------------------------------
ID val1 val2 val3 val4
-------------------------------
10 96 1 4 0
10 96 3 4 0
10 96 5 4 0
10 96 7 4 0
10 96 9 4 0
10 96 11 4 0
10 96 13 4 0
10 96 15 4 0
10 87 7 6 1
11 87 7 6 1
11 95 5 4 0
11 95 13 4 0
12 109 15 6 0
13 109 15 5 0
14 109 15 4 0
15 109 15 3 0
16 107 0 7 0
17 107 0 6 0
18 107 0 4 0
Output Table
-------------------------------
Text ID val1 val2 val3 val4
-------------------------------
Test 10 96 1 4 0
Test 10 96 3 4 0
Test 10 96 5 4 0
Test 10 96 7 4 0
Test 10 96 9 4 0
Test 10 96 11 4 0
Test 10 96 13 4 0
Test 10 96 15 4 0
Test 10 87 7 6 1
Test1 11 87 7 6 1
Test1 11 95 5 4 0
Test1 11 95 13 4 0
Test2 12 109 15 6 0
Test3 13 109 15 5 0
Test4 14 109 15 4 0
Test5 15 109 15 3 0
Test6 16 107 0 7 0
Test7 17 107 0 6 0
Test8 18 107 0 4 0
Kindly help me wih select query for the same.
select table1.TEXT
, table2.id
, table1.val1
, table1.val2
, table1.val3,table1.val4
from table1
join table2
on table1.val1 = table2.val
and table1.val2 = table2.val2
and table1.val3 = table2.val3
and table1.val4 = table2.val4
select table1.Text,table2.Id,table1.val1,table1.val2,table1.val3,table1.val4 from table1 inner join table2 on table1.val1 = table2.val1 and table1.val2 = table2.val2 and table1.val3 = table2.val3 and table1.val4 = table2.val4

Mysql and High CPU IO Wait

I've the following problem. I'm running a MySQL server 5.1.37 on Ubuntu 9.10 x86 on Amazon. For data store I use EBS volume formatted for ext3.
From time to time the following problem occurs. MySQL start processing about queries 10~20 queries and processing of these takes more than 300sec (These SQL are using filesort). During that time no other transaction could be executed.
I've checked CPU Wait and here what is shows:
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
24 9 66 0 0 0| 0 76k| 258k 989k| 0 0 |5970 3014
23 1 75 0 0 1|4096B 28k| 229k 1536k| 0 0 |3249 2308
19 6 74 0 0 0|4096B 316k| 209k 609k| 0 0 |4943 2542
19 17 62 0 0 2|4096B 36k| 230k 718k| 0 0 |5482 2520
21 19 57 2 0 2| 16k 800k| 271k 860k| 0 0 |6549 2923
23 27 44 5 0 1| 480k 40k| 288k 979k| 0 0 |4140 2682
12 0 86 1 0 0| 256k 48k| 237k 771k| 0 0 |3404 2627
22 1 75 0 0 1|8192B 60k| 285k 908k| 0 0 |4009 2786
54 21 19 3 0 2|4096B 3384k| 287k 1556k| 0 0 |3962 2284
49 24 24 1 0 2|4096B 928k| 285k 2795k| 0 0 |3257 2005
61 19 17 2 0 2|8192B 36k| 215k 577k| 0 0 |3246 1922
40 49 8 0 0 3| 0 40k| 312k 905k| 0 0 |3282 1732
56 23 20 1 0 1|4096B 188k| 247k 897k| 0 0 |3102 2238
39 19 27 16 0 0|4096B 77M| 265k 819k| 0 0 |5147 3075
35 35 12 16 0 1|4096B 56M| 259k 1052k| 0 0 |4656 2739
36 27 8 28 0 1|4096B 59M| 259k 1139k| 0 0 |5549 2821
27 13 36 23 0 1|4096B 64M| 251k 1218k| 0 0 |4207 2540
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
26 4 13 57 0 1|4096B 66M| 275k 681k| 0 0 |5205 3291
22 6 27 43 0 1|4096B 52M| 237k 684k| 0 0 |4906 2602
14 3 24 58 0 0|4096B 46M| 278k 1058k| 0 0 |6448 3687
19 3 34 43 0 2| 32k 51M| 233k 685k| 0 0 |5006 2652
27 3 9 61 0 1|4096B 51M| 294k 800k| 0 0 |4428 2384
17 3 30 50 0 1|4096B 42M| 243k 699k| 0 0 |5334 2830
40 18 0 42 0 0| 0 89M| 247k 840k| 0 0 |4698 2977
31 18 11 39 0 2|4096B 42M| 238k 1269k| 0 0 |4270 2474
17 3 13 66 0 0|4096B 49M| 260k 773k| 0 0 |5153 3100
21 2 14 62 0 1|8192B 46M| 269k 948k| 0 0 |6762 3581
24 2 35 39 0 0|4096B 39M| 256k 777k| 0 0 |5313 2761
15 2 10 72 0 1|4096B 49M| 237k 797k| 0 0 |5312 3018
19 4 22 55 0 0|8192B 47M| 307k 1034k| 0 0 |5508 3278
41 3 15 40 0 1|8192B 47M| 293k 727k| 0 0 |5630 3303
16 2 26 54 0 1|4096B 56M| 282k 1750k| 0 0 |5016 2781
17 3 12 67 0 2|8192B 43M| 238k 824k| 0 0 |5751 3147
14 11 50 24 0 1|4096B 39M| 247k 1105k| 0 0 |4454 2389
41 3 20 35 0 1| 0 58M| 152k 481k| 0 0 |4009 2958
52 2 4 41 0 1|4096B 59M| 211k 621k| 0 0 |5449 2846
31 2 0 66 0 1| 0 52M| 255k 1476k| 0 0 |5167 2693
36 2 24 36 0 2| 12k 49M| 311k 888k| 0 0 |4537 2563
47 7 2 43 0 2|4096B 50M| 231k 750k| 0 0 |4083 2165
40 4 6 50 0 0|4096B 86M| 211k 819k| 0 0 |4768 2875
29 5 2 65 0 0| 0 79M| 180k 580k| 0 0 |4271 4461
40 3 0 57 0 0|4096B 58M| 238k 1489k| 0 0 |4366 4480
27 8 26 38 0 1|4096B 33M| 301k 984k| 0 0 |4439 2838
11 2 9 78 0 1|4096B 24M| 230k 646k| 0 0 |4894 4504
10 3 14 72 0 0|4096B 21M| 183k 549k| 0 0 |4066 3952
14 3 27 57 0 0| 0 64M| 147k 339k| 0 0 |3479 2860
10 2 19 69 0 0|4096B 51M| 112k 452k| 0 0 |2847 2300
9 4 18 69 0 0|4096B 37M| 131k 443k| 0 0 |2923 2004
4 2 49 45 0 0|4096B 31M| 97k 230k| 0 0 |2163 1545
1 2 73 24 0 0| 0 33M| 49k 130k| 0 0 |1425 824
1 0 71 28 0 0| 0 26M| 36k 86k| 0 0 |1426 910
0 0 55 45 0 0| 0 32M| 32k 148k| 0 0 |1334 695
4 0 64 32 0 0| 0 39M| 14k 39k| 0 0 |1262 406
0 2 38 60 0 0| 0 44M| 13k 44k| 0 0 |1136 382
1 1 82 16 0 0| 0 47M| 25k 70k| 0 0 |1228 584
1 3 69 27 0 0|4096B 46M| 23k 60k| 0 0 |1576 599
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
3 1 70 27 0 0|4096B 43M| 22k 54k| 0 0 |1065 574
1 1 33 65 0 0| 0 46M|6124B 17k| 0 0 |1190 345
1 1 49 50 0 0| 0 47M| 11k 22k| 0 0 |1258 444
2 11 23 64 0 0| 56k 58M|9749B 47k| 0 0 |1143 379
1 1 64 34 0 0| 0 51M| 198B 5914B| 0 0 |1048 234
0 1 63 36 0 0| 0 58M| 662B 1278B| 0 0 | 976 454
1 0 81 18 0 0| 0 50M| 426B 6022B| 0 0 |1304 600
0 1 70 29 0 0| 0 43M| 132B 1868B| 0 0 |1150 210
1 1 79 19 0 0| 0 51M| 198B 5914B| 0 0 | 986 246
1 2 30 66 0 0| 0 54M| 246B 420B| 0 0 |1150 288
1 0 49 50 0 0| 0 55M| 659B 6752B| 0 0 |1038 280
1 2 37 60 0 0| 0 47M| 66B 354B| 0 0 |1191 227
0 0 80 19 0 0| 0 43M| 561B 6044B| 0 0 |1129 256
5 13 44 38 0 0| 0 49M|1558B 19k| 0 0 |1225 243
3 6 48 42 0 0| 0 52M| 705B 6022B| 0 0 | 948 327
What could cause such a behavior? Are there any techniques to avoid this?
You're showing a high amount of IO_WAIT status on the CPU (65%). It's possible that you're just pulling too much out of the disks. Try running iostat and seeing what the disk activity is like (namely transactions per second).
However, you mention 10 to 20 queries. Are these queries doing any writing at all? Are they using a transaction? If the answer to either is yes, then you're locking because of the transaction lock in MySQL. If that's the case, your problem is that you need to either figure a way to remove the transaction, or make the queries much more efficient.
A good test would be to create another database on the server. Then run your queries and try to query against the different database. If it works, it's due to transaction locks. If it doesn't, it's likely the disk or some other leak from the VM...
The biggest suspect here is the performance of the EBS volumes, the CPUs may be waiting for I/Os to complete. The next question is what is causing the I/O requests.
This question might be better answered on ServerFault.
http://www.mysqlperformanceblog.com/2011/02/21/death-match-ebs-versus-ssd-price-performance-and-qos/
The IO performance of EBS is poor (I recently benchmarked EBS on a Small instance as being half as fast as my laptop's hard drive). However, you can improve it significantly by striping multiple EBS volumes into a software RAID configuration.
http://alestic.com/2009/06/ec2-ebs-raid
EBS comes with lot of its own limitations, if you are running your instance in the US Region, its better you switch to optimized EBS to make the IO faster. Even I was managing a self managed Mysql but later switched to RDS, which gives a lot better performance then EBS.

Query to sum duplicated fields

Here is mysql data
id usr good quant delayed cart_ts
------------------------------------------------------
14 4 1 1 0 20100601235348
13 4 11 1 0 20100601235345
12 4 4 1 0 20100601235335
11 4 1 1 0 20100601235051
10 4 11 1 0 20100601235051
9 4 4 1 0 20100601235051
15 4 2 1 0 20100601235350
16 4 7 1 0 20100602000537
17 4 3 1 0 20100602000610
18 4 3 1 0 20100602000616
19 4 8 1 0 20100602000802
20 4 8 1 0 20100602000806
21 4 8 1 0 20100602000828
22 4 8 1 0 20100602000828
23 4 8 1 0 20100602000828
24 4 8 1 0 20100602000828
25 4 8 1 0 20100602000828
26 4 8 1 0 20100602000829
27 4 8 1 0 20100602000829
28 4 9 1 0 20100602001045
29 4 10 1 0 20100602001046
I need to group fields in witch usr & good has duplicated values with summing quant field
for getting smth like this:
id usr good quant delayed cart_ts
------------------------------------------------------
14 4 1 2 0 20100601235348
13 4 11 2 0 20100601235345
12 4 4 2 0 20100601235335
15 4 2 1 0 20100601235350
16 4 7 1 0 20100602000537
17 4 3 2 0 20100602000610
19 4 8 9 0 20100602000802
28 4 9 1 0 20100602001045
29 4 10 1 0 20100602001046
Which MySQL query I need to do to have this effect?
SELECT id,usr,good,SUM(quant),delayed,cart_ts FROM table GROUP BY usr,good