how to extract text from html using beautifulsoup?

how to extract text from html using beautifulsoup? - html

I want to extract some words from this html like
<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
<td align="right" style="width:75px;" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblRowNum" style="display:inline-block;width:50px;">124</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/>SP450017D0007</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrder" style="display:inline-block;width:175px;"><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/>0243 <br/><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/><span style="font-size: 9px;">» Delivery Order Package View</span></span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrderCounter" style="display:inline-block;width:50px;"> </span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate" style="display:inline-block;width:75px;">04-12-2018</span>
</td>
</tr>
this is a section of the my code that generates the html above
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import urllib3
import numpy as np
import re
from datetime import datetime, timedelta
containers = pagesoup.find_all('tr', {'class': ['BgWhite', 'BgSilver']})
for batch in containers:
for item in range(53)[2:]:
try:
// batch is the html above
print(batch)
uid = "ctl00_cph1_grdAwardSearch_ctl"+str(item)+"_lblAwardBasicNumber"
print("uid id ", uid)
awardid = batch.find_all("span", text = re.compile("_lblAwardBasicNumber"))
print("award id is")
print(awardid)
except Exception as e:
print(colorama.Fore.MAGENTA + "award error.."+ str(e) )
# print(container1)
continue
except Exception as e:
raise e
print (batch) is what produces the html above, I wanted to obtain this number SP450017D0007 from this
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/>SP450017D0007</span>
but awardid is outputing none. how can i extract SP450017D0007 ?

Solution:
To get this text SP450017D0007, I used pagesoup.find('a', text=True).text.
Note:
You have the following extra lines in your code above that should be taken out
except Exception as e:
raise e
Code:
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import urllib3
import numpy as np
import re
from datetime import datetime, timedelta
data = '''
<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
<td align="right" style="width:75px;" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblRowNum" style="display:inline-block;width:50px;">124</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/>SP450017D0007</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrder" style="display:inline-block;width:175px;"><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/>0243 <br/><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/><span style="font-size: 9px;">» Delivery Order Package View</span></span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrderCounter" style="display:inline-block;width:50px;"> </span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate" style="display:inline-block;width:75px;">04-12-2018</span>
</td>
</tr>
'''
pagesoup = BeautifulSoup(data, 'html.parser')
containers = pagesoup.find_all('tr', {'class': ['BgWhite', 'BgSilver']})
for batch in containers:
for item in range(53)[2:]:
try:
print(batch)
uid = "ctl00_cph1_grdAwardSearch_ctl" + str(item) + "_lblAwardBasicNumber"
print("uid id ", uid)
awardid = pagesoup.find('a', text=True).text
print("award id is")
print(awardid)
dateid = pagesoup.find('span', id='ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate').text
print("date id is")
print(dateid)
except Exception as e:
print(colorama.Fore.MAGENTA + "award error.." + str(e))
# print(container1)
continue
Output:
<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
<td align="right" style="width:75px;" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblRowNum" style="display:inline-block;width:50px;">124</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/>SP450017D0007</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrder" style="display:inline-block;width:175px;"><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/>0243 <br/><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/><span style="font-size: 9px;">» Delivery Order Package View</span></span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrderCounter" style="display:inline-block;width:50px;"> </span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate" style="display:inline-block;width:75px;">04-12-2018</span>
</td>
</tr>
uid id ctl00_cph1_grdAwardSearch_ctl2_lblAwardBasicNumber
award id is
SP450017D0007
date id is
04-12-2018

Related

Excel VBA Web Scraping Table Elements from a <frameset> and a <frame>

I am trying to scrape some table-looking items from a website into Excel.
I'm no stranger to coding in general, though I'm pretty new to VBA in an Excel sense :)
I have tried using Excel's Data>From Web interface, it's not recognizing the table. I'm guessing it's because it's built using (or at least that's what my Google-Fu has lead me to understand).
Snipping of what the second table looks like
<html>
<frame title="links" ...>...</frame>
<frame title="queue">
#document
<head>...</head>
<body>
<div id="container>
<script>...</script>
<div>
<table id="oTable">
<colgroup>...</colgroup>
<thead>...</thead>
<tbody>
<tr onclick="changeHighlight( 'eid0' )" id="eid0" class="queryshaded">
<td nowrap=""><a onclick="javascript:window.open('IWViewer.jsp?id=3.5599976.5599976');" title="Open Image" href="javascript:doNothing();"><img title="Open Image" border="0" alt="Open Image" src="URL.gif"></a> <a onclick="javascript:window.open('URL','_newtab');" title="Open Workitem" href="javascript:doNothing();"><img title="Open Workitem" border="0" alt="Open Workitem" src="URL.gif"></a>
</td><td scope="row" nowrap="">12345</td>
<td nowrap="">28/08/2018 17:00:49</td>
<td nowrap="">11/09/2018 16:28:39</td>
<td nowrap="">5,599,976</td>
<td nowrap="">dijm</td></tr>
<tr onclick="changeHighlight( 'eid1' )" id="eid1" class="queryunshaded">
<td nowrap=""><a onclick="javascript:window.open('IWViewer.jsp?id=3.6443276.6443276');" title="Open Image" href="javascript:doNothing();"><img title="Open Image" border="0" alt="Open Image" src="URL.gif"></a> <a onclick="javascript:window.open('URL;id=3.6443276.6443276','_newtab');" title="Open Workitem" href="javascript:doNothing();"><img title="Open Workitem" border="0" alt="Open Workitem" src="URL.gif"></a>
</td><td scope="row" nowrap="">67890</td>
<td nowrap="">25/06/2019 11:01:01</td>
<td nowrap="">09/07/2019 10:32:32</td>
<td nowrap="">6,443,276</td>
<td nowrap=""></td></tr>
<tr onclick="changeHighlight( 'eid2' )" id="eid2" class="queryshaded">
<td nowrap=""><a onclick="javascript:window.open('IWViewer.jsp?id=3.6443287.6443287');" title="Open Image" href="javascript:doNothing();"><img title="Open Image" border="0" alt="Open Image" src="URL.gif"></a> <a onclick="javascript:window.open('URL;id=3.6443287.6443287','_newtab');" title="Open Workitem" href="javascript:doNothing();"><img title="Open Workitem" border="0" alt="Open Workitem" src="URL.gif"></a>
</td><td scope="row" nowrap="">23456</td>
<td nowrap="">25/06/2019 11:01:24</td>
<td nowrap="">09/07/2019 10:35:30</td>
<td nowrap="">6,443,287</td>
<td nowrap=""></td></tr>
<tr onclick="changeHighlight( 'eid3' )" id="eid3" class="queryunshaded">
<td nowrap=""><a onclick="javascript:window.open('IWViewer.jsp?id=3.6443339.6443339');" title="Open Image" href="javascript:doNothing();"><img title="Open Image" border="0" alt="Open Image" src="URL.gif"></a> <a onclick="javascript:window.open('URL;id=3.6443339.6443339','_newtab');" title="Open Workitem" href="javascript:doNothing();"><img title="Open Workitem" border="0" alt="Open Workitem" src="URL.gif"></a>
</td><td scope="row" nowrap="">78901</td>
<td nowrap="">25/06/2019 11:06:02</td>
<td nowrap="">09/07/2019 10:40:39</td>
<td nowrap="">6,443,339</td>
<td nowrap=""></td></tr>
<tr onclick="changeHighlight( 'eid4' )" id="eid4" class="queryshaded">
<td nowrap=""><a onclick="javascript:window.open('IWViewer.jsp?id=3.6443344.6443344');" title="Open Image" href="javascript:doNothing();"><img title="Open Image" border="0" alt="Open Image" src="URL.gif"></a> <a onclick="javascript:window.open('URL;id=3.6443344.6443344','_newtab');" title="Open Workitem" href="javascript:doNothing();"><img title="Open Workitem" border="0" alt="Open Workitem" src="URL.gif"></a>
</td><td scope="row" nowrap="">34567</td>
<td nowrap="">25/06/2019 11:06:17</td>
<td nowrap="">09/07/2019 10:40:43</td>
<td nowrap="">6,443,344</td>
<td nowrap=""></td></tr>
I have tried various solutions that look somewhat like this:
https://www.ozgrid.com/forum/forum/other-software-applications/excel-and-web-browsers-help/131683-extracting-data-from-a-grid-on-webpage
and
Scraping data from website using vba
and trying to define the frames themselves to try and get the info from there?
(again: new to Excel VBA)
'set myHTMLDoc to the main pages IE document
Dim myHTMLDoc As HTMLDocument
Set myHTMLDoc = ie.Document
'set myHTMLFrame2 as the 2nd frame of the main page (index starts at 0)
Dim myHTMLFrame2 As HTMLDocument
Set myHTMLFrame2 = myHTMLDoc.Frames(1).Document
With the above block of code I'm getting a "Run-time error '438'
Without the above block I'm getting a "Run-time error '1004'
The info I eventually want is in each row:
</td><td scope="row" nowrap="">67890</td>
<td nowrap="">25/06/2019 11:01:01</td>
<td nowrap="">09/07/2019 10:32:32</td>
<td nowrap="">6,443,276</td>
Ideally I'd like to dump each element into a cell
67890 | 25/06/2019 11:01:01 | 09/07/2019 10:32:32 | 6,443,276
There's 20 of these rows on each page (there's a button to press to get to the next page which I'll figure out later...hopefully haha)
Massive premptive Thank You to anyone who can help :)
-EDIT-
This is the code that I'm currently working with (not precious about it :P )
Private Sub CommandButton1_Click()
Dim ie As Object
Dim html As Object
Dim objElementTR As Object
Dim objTR As Object
Dim objElementsTD As Object
Dim objTD As Object
Dim result As String
Dim intRow As Long
Dim intCol As Long
Set ie = CreateObject("InternetExplorer.Application")
ie.Navigate "URL"
ie.Visible = True ' loop until page is loaded
Do Until (ie.ReadyState = 4 And Not ie.Busy)
DoEvents
Loop
'set myHTMLDoc to the main pages IE document
Dim myHTMLDoc As HTMLDocument
Set myHTMLDoc = ie.Document
'set myHTMLFrame2 as the 2nd frame of the main page (index starts at 0)
Dim myHTMLFrame2 As HTMLDocument
Set myHTMLFrame2 = ie.Document.querySelector("[title=queue]").contentDocument.getElementById("oTable")
result = myHTMLFrame2
Set html = CreateObject("htmlfile")
myHTMLFrame2 = result
Set objElementTR = html.getElementsByTagName("tr")
ReDim myarray(0 To objElementTR.Length, 0 To 10)
For Each objTR In objElementTR
intRow = intRow + 1
Set objElementsTD = objTR.getElementsByTagName("td")
For Each objTD In objElementsTD
myarray(intRow, intCol) = objTD.innerText
intCol = intCol + 1
Next objTD
intCol = 0
Next objTR
With Sheets(1).Cells(1, 1).Cells(Rows.Count, "A").End(xlUp).Offset(1, 0)
.Resize(UBound(myarray), UBound(myarray, 2)).Value = myarray
End With
End Sub

You could try isolating the frame by its title attribute, then go via contentDocument and get the table by id
ie.document.querySelector("[title=queue]").contentDocument.querySelector("#oTable")
Then end .querySelector("#oTable") can be interchanged with .getElementById("oTable")
I would then dump the .outerHTML of the table via clipboard so as to paste table direct into sheet.

parsing/escape in Swift

Currently i have a html string (here is a part of it) in swift where i want to escape a special part
<tr style="color:White;background-color:#32B4FA;border-width:1px;border-style:solid;font-weight:normal;">
<th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;"> </th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:20px;">Park-<br>stätte</th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">Parkmöglichkeit</th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">Anzahl Stellplätze</th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">Freie Stellplätze</th>
</tr>
<tr style="color:#000066;">
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;width:25px;">
<span id="GridView1__Id_0" title="Kennzeichen" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">P1</span>
</td>
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;">
<img src="Images/Symbol_Tiefgarage.jpg" style="width:20px;" />
</td>
<td align="left" style="border-width:1px;border-style:solid;font-size:Small;">
<a id="GridView1_HyperLink1_0" href="http://www.paderborn.de/microsite/asp/parken_in_der_city/TG_Koenigsplatz.php" target="_top" style="display:inline-block;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:150px;">Königsplatz</a>
</td>
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;width:40px;">
<span id="GridView1__AnzahlPlaetze_0" title="Anzahl Plätze" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;">810</span>
</td>
<td align="center" style="border-width:1px;border-style:solid;font-size:Smaller;width:40px;">
<span id="GridView1__AnzahlFreiePlaetze_0" title="Freie Plätze" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;">0</span>
</td>
</tr>
the Part for me that is interesting is the "810"( could be 0-1000 or a text string) from
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;width:40px;">
<span id="GridView1__AnzahlPlaetze_0" title="Anzahl Plätze" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;">810</span>
</td>
i did try to get use to regEx but this did not work out for me.

I suggest you use a XML/HTML parser which supports CSS selectors to retrieve that string, since the span that contains that string has a id = "GridView1__AnzahlPlaetze_0", and you can use query "#GridView1__AnzahlPlaetze_0" to retrieve it.
For example, with a Swift library called Fuzi that wraps libxml2
import Fuzi
let doc = try? HTMLDocument(string: htmlString)
if let result = doc?.firstChild(css: "#GridView1__AnzahlPlaetze_0") {
print(result.stringValue)
}
The above code is tested.

How to fill a cell with database tables

I was wondering how you can fill a whole column of a table from a database table? using HTML and SQL? I am using Toad for Oracle and using MySQL as a compiler. I just want to fill a column using a loop but don't know how to do it?
if I had a table like :
Table1
------
Y
N
Y
Here is the procedure code:
procedure proc1
is
begin
HTP.P('
<HTML>
<BODY>
<b> <font size="4" color=black>Status Log</font> </b>
<table bgcolor="black" width=1020 align="center" border="0" cellspacing="1" class="sortable"><THEAD><tr bgcolor="#CCCCCC">
');
htp.p('
<th width=30 align=left><font size="2">Phase</font></th>
<td> </td>
<th width=50 align=left><font size="2">State</font></th>
<td> </td>
<th width=414 align=left><font size="2">CHG</font></th>
<td> </td>
<th width=30 align=left><font size="2">Changes</font></th>
<td> </td>
<th width=180 align=left><font size="2">Completed</font></th>
<td> </td>
</table>
</BODY>
</HTML>
');
end proc1;

how to add a # in cfoutput?

I have to render the following html tags from an ajax query. Issue is cfml treats any string prefixed with # as an identifier. So I'm getting an error.
<cfoutput>
<table style="display:none;" width="100%" border="1" cellpadding="2" cellspacing="0">
<tr>
<td width="43%" bgcolor="#649DCA"><strong>Class</strong></td>
<td width="20%" bgcolor="#649DCA"><strong>Site</strong></td>
<td width="47%" bgcolor="#649DCA"><strong>Date/Time</strong></td>
</tr>
</cfoutput>

You just need to double up your #'s.
<td width="43%" bgcolor="##649DCA"><strong>Class</strong></td>
Personally, I would probably just use CSS and style the table separately.

Using ruby and nokogiri to parsing HTML using HTML comments as markers

How could I use ruby to extract information from a table consisting of these rows? Is it possible to detect the comments using nokogiri?
<!-- Begin Topic Entry 4134 -->
<tr>
<td align="center" class="row2"><image src='style_images/ip.boardpr/f_norm.gif' border='0' alt='New Posts' /></td>
<td align="center" width="3%" class="row1"> </td>
<td class="row2">
<table class='ipbtable' cellspacing="0">
<tr>
<td valign="middle"><alink href='http://www.xxx.com/index.php?showtopic=4134&view=getnewpost'><image src='style_images/ip.boardpr/newpost.gif' border='0' alt='Goto last unread' title='Goto last unread' hspace=2></a></td>
<td width="100%">
<div style='float:right'></div>
<div> <alink href="http://www.xxx.com/index.php?showtopic=4134&hl=">EXTRACT LINK 1</a> </div>
</td>
</tr>
</table>
<span class="desc">EXTRACT DESCRIPTION</span>
</td>
<td class="row2" width="15%"><span class="forumdesc"><alink href="http://www.xxx.com/index.php?showforum=19" title="Living">EXTRACT LINK 2</a></span></td>
<td align="center" class="row1" width='10%'><alink href='http://www.xxx.com/index.php?showuser=1642'>Mr P</a></td>
<td align="center" class="row2"><alink href="javascript:who_posted(4134);">1</a></td>
<td align="center" class="row1">46</td>
<td class="row1"><span class="desc">Today, 12:04 AM<br /><alink href="http://www.xxx.com/index.php?showtopic=4134&view=getlastpost">Last post by:</a> <b><alink href='http://www.xxx.com/index.php?showuser=1649'>underft</a></b></span></td>
</tr>
<!-- End Topic Entry 4134 -->
-->

Try to use xpath instead:
html_doc = Nokogiri::HTML("<html><body><!-- Begin Topic Entry 4134 --></body></html>")
html_doc.xpath('//comment()')

You could implement a Nokogiri SAX Parser. This is done faster than it might seem at first sight. You get events for Elements, Attributes and Comments.
Within your parser, your should rememeber the state, like #currently_interested = true to know which parts to rememeber and which not.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

how to extract text from html using beautifulsoup? - html

Related

Excel VBA Web Scraping Table Elements from a <frameset> and a <frame>

parsing/escape in Swift

How to fill a cell with database tables

how to add a # in cfoutput?

Using ruby and nokogiri to parsing HTML using HTML comments as markers

Categories

Resources