I am trying to parse 1st table located here using BeautifulSoup in Python. It parsed my First column but for some reason It didn't parsed entire table. Any help is appreciated!
Note: I am trying to parse entire table and convert into pandas dataframe
My Code:
import requests
from bs4 import BeautifulSoup
WIKI_URL = requests.get("https://en.wikipedia.org/wiki/NCAA_Division_I_FBS_football_win-loss_records").text
soup = BeautifulSoup(WIKI_URL, features="lxml")
print(soup.prettify())
my_table = soup.find('table',{'class':'wikitable sortable'})
links=my_table.findAll('a')
print(links)
It only parsed one column because you did a findall for only the items in the first column. To parse the entire table you'd have to do a findall for the table rows <tr> and then a findall within each row for the table divides <td>. Right now you are just doing a findall for the links and then printing the links.
my_table = soup.find('table',{'class':'wikitable sortable'})
for row in mytable.findAll('tr'):
print(','.join([td.get_text(strip=True) for td in row.findAll('td')]))
NOTE: Accept B.Adler's solution as it is good work and sound advice. This solution is simply so you can see some alternatives as you are learning.
Whenever I see <table> tags, I'll usually check out pandas first to see if I can find what I need from the tables that way. pd.read_html() will return a list of dataframes, and you can work/manipulate those to extract what you need.
import pandas as pd
WIKI_URL = "https://en.wikipedia.org/wiki/NCAA_Division_I_FBS_football_win-loss_records"
tables = pd.read_html(WIKI_URL)
You can also look through the dataframes to see which has the data you want.
I just used dataframe in index position 2 for this one, which is the first table you were looking for
table = tables[2]
Output:
print (table)
0 1 ... 6 7
0 Team Won ... Total Games Conference
1 Michigan 953 ... 1331 Big Ten
2 Ohio State 1 911 ... 1289 Big Ten
3 Notre Dame 2 897 ... 1263 Independent
4 Boise State 448 ... 618 Mountain West
5 Alabama 3 905 ... 1277 SEC
6 Oklahoma 896 ... 1274 Big 12
7 Texas 908 ... 1311 Big 12
8 USC 4 839 ... 1239 Pac-12
9 Nebraska 897 ... 1325 Big Ten
10 Penn State 887 ... 1319 Big Ten
11 Tennessee 838 ... 1281 SEC
12 Florida State 5 544 ... 818 ACC
13 Georgia 819 ... 1296 SEC
14 LSU 797 ... 1259 SEC
15 Appalachian State 617 ... 981 Sun Belt
16 Georgia Southern 387 ... 616 Sun Belt
17 Miami (FL) 630 ... 1009 ACC
18 Auburn 759 ... 1242 SEC
19 Florida 724 ... 1182 SEC
20 Old Dominion 76 ... 121 C-USA
21 Coastal Carolina 112 ... 180 Sun Belt
22 Washington 735 ... 1234 Pac-12
23 Clemson 744 ... 1248 ACC
24 Virginia Tech 743 ... 1262 ACC
25 Arizona State 614 ... 1032 Pac-12
26 Texas A&M 741 ... 1270 SEC
27 Michigan State 701 ... 1204 Big Ten
28 West Virginia 750 ... 1292 Big 12
29 Miami (OH) 690 ... 1195 MAC
.. ... ... ... ... ...
101 Memphis 482 ... 1026 The American
102 Kansas 582 ... 1271 Big 12
103 Wyoming 526 ... 1122 Mountain West
104 Louisiana 510 ... 1098 Sun Belt
105 Colorado State 520 ... 1124 Mountain West
106 Connecticut 508 ... 1107 The American
107 SMU 489 ... 1083 The American
108 Oregon State 530 ... 1173 Pac-12
109 UTSA 38 ... 82 C-USA
110 Kansas State 526 ... 1207 Big 12
111 New Mexico 483 ... 1103 Mountain West
112 Temple 468 ... 1094 The American
113 Iowa State 524 ... 1214 Big 12
114 Tulane 520 ... 1197 The American
115 Northwestern 535 ... 1240 Big Ten
116 UAB 126 ... 284 C-USA
117 Rice 470 ... 1108 C-USA
118 Eastern Michigan 453 ... 1089 MAC
119 Louisiana-Monroe 304 ... 727 Sun Belt
120 Florida Atlantic 87 ... 205 C-USA
121 Indiana 479 ... 1195 Big Ten
122 Buffalo 370 ... 922 MAC
123 Wake Forest 450 ... 1136 ACC
124 New Mexico State 430 ... 1090 Independent
125 UTEP 390 ... 1005 C-USA
126 UNLV11 228 ... 574 Mountain West
127 Kent State 341 ... 922 MAC
128 FIU 64 ... 191 C-USA
129 Charlotte 20 ... 65 C-USA
130 Georgia State 27 ... 94 Sun Belt
[131 rows x 8 columns]
Related
I have a tls file such as:
1224 926 1380 688 845 109 118 88 1275 1306 91 796 102 1361 27 995
1928 2097 138 1824 198 117 1532 2000 1478 539 1982 125 1856 139 475 1338
848 202 1116 791 1114 236 183 186 150 1016 1258 84 952 1202 988 866
946 155 210 980 896 875 925 613 209 746 147 170 577 942 475 850
1500 322 43 95 74 210 1817 1631 1762 128 181 716 171 1740 145 1123
3074 827 117 2509 161 206 2739 253 2884 248 3307 2760 2239 1676 1137 3055
183 85 143 197 243 72 291 279 99 189 30 101 211 209 77 198
175 149 259 372 140 250 168 142 146 284 273 74 162 112 78 29
169 578 97 589 473 317 123 102 445 217 144 398 510 464 247 109
3291 216 185 1214 167 495 1859 194 1030 3456 2021 1622 3511 222 3534 1580
2066 2418 2324 93 1073 82 102 538 1552 962 91 836 1628 2154 2144 1378
149 963 1242 849 726 1158 164 1134 658 161 1148 336 826 1303 811 178
3421 1404 2360 2643 3186 3352 1112 171 168 177 146 1945 319 185 2927 2289
543 462 111 459 107 353 2006 116 2528 56 2436 1539 1770 125 2697 2432
1356 208 5013 4231 193 169 3152 2543 4430 4070 4031 145 4433 4187 4394 1754
5278 113 4427 569 5167 175 192 3903 155 1051 4121 5140 2328 203 5653 3233
how can I read it in a list of list of int in haskell?
I have tried few options but I could not manage to do it. I am very new to haskell so please be patience.
First break your input into lines using lines:
let test = "1 2 3 4\n 5 6 7 \n 4 2 5"
let rows = lines test --literally "lines test"! Beautiful, eh?
Result:
["1 2 3 4"," 5 6 7 "," 4 2 5"] :: [[Char]]
Then, extract individual numbers as strings using words:
let nums_as_strings = map words rows
Result:
[["1","2","3","4"],["5","6","7"],["4","2","5"]] :: :: [[[Char]]]
The last thing to do is convert these strings to integers with read:
let numbers = map (map read) nums_as_strings :: [[Int]]
Result:
[[1,2,3,4],[5,6,7],[4,2,5]] :: [[Int]]
Or, squashed into one line:
let numbers = map (map read) (map words $ lines test) :: [[Int]]
Example with your data:
Prelude> let test = "1224 926 1380 688 845 109 118 88 1275 1306 91 796 102 1361 27 995\n1928 2097 138 1824 198 117 1532 2000 1478 539 1982 125 1856 139 475 1338"
Prelude> map (map read) (map words $ lines test) :: [[Int]]
[[1224,926,1380,688,845,109,118,88,1275,1306,91,796,102,1361,27,995],[1928,2097,138,1824,198,117,1532,2000,1478,539,1982,125,1856,139,475,1338]]
You may need to take care of empty lines, but that's really simple.
import System.IO
readListOfLists :: Handle -> IO [[Int]]
readListOfLists handle = do
contents <- hGetContents handle
let ls :: [String]
ls = lines contents
ws :: [[String]]
ws= map words ls
res :: [[Int]]
res = map (map read) ws
return res;
or you can write the same code in one line:
readListOfLists :: Handle -> IO [[Int]]
readListOfLists = fmap (map (map read . words) . lines) . hGetContents
To use it:
do
handle <- openFile fileName ReadMode
table <- readListOfLists handle
hClose handle
print table
table_1
ST_ID NAME MATHS GEOGRAPHY ENGLISH
1001 Alan Wegman 80 85 70
1002 Robert Franko 79 65 60
1003 Francis John 90 75 67
1004 Finn Harry 88 87 93
table_2
ST_ID NAME MATHS GEOGRAPHY ENGLISH
2001 Alan Wegman 69 75 80
2002 Robert Franko 99 85 70
2003 Francis John 80 65 77
2004 Finn Harry 78 97 83
table_3
ST_ID NAME MATHS GEOGRAPHY ENGLISH
3001 Alan Wegman 90 81 72
3002 Robert Franko 97 65 61
3003 Francis John 74 75 67
3004 Finn Harry 77 88 73
From above three tables, i want to to the following, i want to take value of MATHS of student Alan Wegman which is 80 divide by 100 from TABLE 1 then take value of GEOGRAPHY of the same student Alan Wegman which is 85 divide by 100 from TABLE 3 then from last table take value of ENGLISH of same student Alan Wegman which is 70 divide by 100 then they should be added to get one value like this example (80/100+85/100+70/100) and output should be displaying NAME and total value after addition example below
Alan Wegman 2.27
Robert Franko 2.11
Finn Harry 3.29
Is this really possible? i want this to be performed by a single query for all records or if there is an alternative way of doing this please share with me, the query i am trying to achieve this result is this one below but it does not return any thing i don't know where am wrong.
select
table_1.NAME MATHS/100+table_2.NAME GEOGRAPHY/100+table_3.NAME ENGLISH/100
WHERE table_1.NAME = table_2.NAME = table_3.NAME
I am not competent with mysql need help here guys.
Aim: I am trying to scrape the historical daily stock price for all companies from the webpage http://www.nepalstock.com/datanepse/previous.php. The following code works; however, it always generates the daily stock price for the most recent (Feb 5, 2015) date only. In another words, output is the same, irrespective of the date that I entered. I would appreciate if you could help in this regard.
library(RHTMLForms)
library(RCurl)
library(XML)
url <- "http://www.nepalstock.com/datanepse/previous.php"
forms <- getHTMLFormDescription(url)
# we are interested in the second list with date forms
# forms[[2]]
# HTML Form: http://www.nepalstock.com/datanepse/
# Date: [ ]
get_stock<-createFunction(forms[[2]])
#create sequence of dates from start to end and store it as a list
date_daily<-as.list(seq(as.Date("2011-08-24"), as.Date("2011-08-30"), "days"))
# determine the number of elements in the list
num<-length(date_daily)
daily_1<-lapply(date_daily,function(x){
show(x) #displays the particular date
readHTMLTable(htmlParse(get_stock(Date = x)), which = 7)
})
#18 tables out of which 7 is one what we desired
# change the colnames
col_name<-c("SN","Traded_Companies","No_of_Transactions","Max_Price","Min_Price","Closing_Price","Total_Share","Amount","Previous_Closing","Difference_Rs.")
daily_2<-lapply(daily_1,setNames,nm=col_name)
Output:
> head(daily_2[[1]],5)
SN Traded_Companies No_of_Transactions Max_Price Min_Price Closing_Price Total_Share Amount
1 1 Agricultural Development Bank Ltd 24 489 471 473 2,868 1,359,038
2 2 Arun Valley Hydropower Development Company Limited 40 365 360 362 8,844 3,199,605
3 3 Alpine Development Bank Limited 11 297 295 295 150 44,350
4 4 Asian Life Insurance Co. Limited 10 1,230 1,215 1,225 898 1,098,452
5 5 Apex Development Bank Ltd. 23 131 125 131 6,033 769,893
Previous_Closing Difference_Rs.
1 480 -7
2 363 -1
3 303 -8
4 1,242 -17
5 132 -1
> tail(daily_2[[1]],5)
SN Traded_Companies No_of_Transactions Max_Price Min_Price Closing_Price Total_Share Amount Previous_Closing
140 140 United Finance Ltd 4 255 242 242 464 115,128 255
141 141 United Insurance Co.(Nepal)Ltd. 3 905 905 905 234 211,770 915
142 142 Vibor Bikas Bank Limited 7 158 152 156 710 109,510 161
143 143 Western Development Bank Limited 35 320 311 313 7,631 2,402,497 318
144 144 Yeti Development Bank Limited 22 139 132 139 14,355 1,921,511 134
Difference_Rs.
140 -13
141 -10
142 -5
143 -5
144 5
Here's one quick approach. Note that the site uses a POST request to send the date to the server.
library(rvest)
library(httr)
page <- "http://www.nepalstock.com/datanepse/previous.php" %>%
POST(body = list(Date = "2015-02-01")) %>%
html()
page %>%
html_node(".dataTable") %>%
html_table(header = TRUE)
I have the following csv structure:
'Country' '1960' '1961' '1962'
AUS 450 567 723
NZ 125 320
IND 350 375 395
SL
PAK 100 115 218
Using Python Pandas how do I convert(transpose) the above structure to the following?
'Country' 'Year' 'Value'
AUS 1960 450
AUS 1961 567
AUS 1962 723
NZ 1960
NZ 1961 125
...
My attempts at using pivot have been futile.
In [19]: df
Out[19]:
year 1960 1961 1962
Country
AUS 450 567 723
NZ NaN 125 320
IND 350 375 395
SL NaN NaN NaN
PAK 100 115 218
In [20]: df.stack().reset_index()
Out[20]:
Country year 0
0 AUS 1960 450
1 AUS 1961 567
2 AUS 1962 723
3 NZ 1961 125
4 NZ 1962 320
5 IND 1960 350
6 IND 1961 375
7 IND 1962 395
8 PAK 1960 100
9 PAK 1961 115
10 PAK 1962 218
Apparently NaN are dropped along the way.
I have a table like this.
id day1 day2 day3
1 411 523 223
2 413 554 245
3 417 511 209
4 420 515 232
5 422 522 212
6 483 567 212
7 456 512 256
8 433 578 209
9 438 532 234
10 418 555 223
11 460 510 263
12 453 509 245
13 441 524 233
14 430 543 261
15 456 582 222
16 444 524 241
17 478 511 211
18 421 583 222
I want to select all the IDs that have duplicate values in day2.
I'm doing
select day2,count(*) from resultater group by day having count(*)>1;
Is it possible to list all the IDs within the groups?
select day2,count(*), group_concat(id)
from resultater
group by day
having count(*)>1;
should do the trick.