how to parse a link from a text in python?

how to parse a link from a text in python? - html

I am trying to parse the href inthe anchor tag from a text, i tried the following code
from flask import Flask,render_template
import requests
import re
app = Flask(__name__)
#app.route('/')
def products():
getprd = requests.get('API')
jsonobj = getprd.text
produ= getprd.json()
prd = produ['items'][0]['id']
htmlcode = produ['items'][0]['description']
htmlcodetxt =str(htmlcode)
return render_template('productdisp.html',
prod=jsonobj, prd=prd, htmlcode=htmlcode)
if __name__ =='__main__':
app.run(debug=True)
and the htmlcodetxt containt the text
<p style="text-align: center;"><strong>Part Number:</strong></p><div style="text-align: center;"><span style="font-size: 16px;">product code</span></div><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Lumens:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>6600-7200 LM</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>CCT:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;">5700K</span><br> </p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Input Voltage:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>100-277VAC, 50-60Hz</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong><strong>Certificates:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>UL, DLC</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Warranty:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>5 Years <br></span></p><hr><p style="text-align: center;"><strong>DOWNLOADS:</strong><br></p><p style="text-align: center;"><br></p><p style="text-align: center;"><strong>Specification Sheet<br></strong><br></p><p><br></p><p style="text-align: center;"><strong>Photometric Data<br></strong></p><p style="text-align: center;"><br></p><p style="text-align: center;"><img src="https://ul_png"> <img src="https://300x295_png"> </p><p style="text-align: center;"><br></p>

One way would be to use the HTMLParser module like this to parse the href link from the htmlcodetxt string.
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
# Parse the 'anchor' tag.
if tag == "a":
# Check the list of defined attributes
for name, value in attrs:
# If href is defined, print it.
if name == "href":
print name, "=", value
# Declare it and feed it your HTML content that you want parsed for the href tag.
parser = MyHTMLParser()
parser.feed(htmlcodetxt)
I'm not sure how your app handler works, but perhaps you could try something like this?
from flask import Flask,render_template
import requests
import re
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == "a":
for name, value in attrs:
if name == "href":
print name, "=", value
app = Flask(__name__)
#app.route('/')
def products():
getprd = requests.get('API')
jsonobj = getprd.text
produ= getprd.json()
prd = produ['items'][0]['id']
htmlcode = produ['items'][0]['description']
htmlcodetxt =str(htmlcode)
parser = MyHTMLParser()
parser.feed(htmlcodetxt)
return render_template('productdisp.html',
prod=jsonobj, prd=prd, htmlcode=htmlcode)
if __name__ =='__main__':
app.run(debug=True)
For example, without using flask, and with using the html code sample that you posted, the following works and returns the expected output.
#!/usr/bin/python
content = '<p style="text-align: center;"><strong>Part Number:</strong></p><div style="text-align: center;"><span style="font-size: 16px;">product code</span></div><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Lumens:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>6600-7200 LM</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>CCT:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;">5700K</span><br> </p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Input Voltage:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>100-277VAC, 50-60Hz</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong><strong>Certificates:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>UL, DLC</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Warranty:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>5 Years <br></span></p><hr><p style="text-align: center;"><strong>DOWNLOADS:</strong><br></p><p style="text-align: center;"><br></p><p style="text-align: center;"><strong>Specification Sheet<br></strong><br></p><p><br></p><p style="text-align: center;"><strong>Photometric Data<br></strong></p><p style="text-align: center;"><br></p><p style="text-align: center;"><img src="https://ul_png"> <img src="https://300x295_png"> </p><p style="text-align: center;"><br></p>'
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == "a":
for name, value in attrs:
if name == "href":
print name, "=", value
parser = MyHTMLParser()
parser.feed(content)
Example output:
$ ./html_parse.py
href = https://dl.dropbox.com/s/saa.pdf?dl=1
href = https://dl.dropbox.com/s/ds.png?dl=1

Related

How to extract different types of bold text and the text in between them using BeautifulSoup?

I have to parse html documents having bold text as section identifiers. But bold text is in different forms , some examples shown below .
Using Beautiful soup I am able to parse them but have to write lot of if else to handle different types of bold. Is there an optimal way to find such bold text and the text in between with using so many if else.
<div style="line-height:120%;padding-bottom:12px;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;font-weight:bold;">List 1. Work</font>
</div>
<td style="vertical-align:top;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;">
<div style="text-align:left;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;font-weight:bold;">List 1.</font>
</div>
</td>
<td style="vertical-align:top;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;">
<div style="text-align:left;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;font-weight:bold;">Work.</font>
</div>
</td>
<p style="font-family:times;text-align:justify">
<font size="2">
<a name="de42901_List_1._Work"> </a>
<a name="toc_de42901_2"> </a>
</font>
<font size="2"><b> List 1. Work <br> </b></font>
</p>
<p style="font-family:times;text-align:justify">
<font size="2">
<a name="da18101_List_1._Work"> </a>
<a name="toc_da18101_3"> </a>
</font>
<font size="2"><b> List 1. </b></font>
<font size="2"><b><i>Work <br> </i></b></font>
</p>

use the split and join function to remove unwanted /n /b /t and &nbsp:
soup = BeautifulSoup(data, 'html.parser')
data = soup.find_all('b')
for i in data :
final = ' '.join([x for x in i.text.split()])
print (final)
it will arrange your data in same format, hope it resolves your query

BeautifulSoup4 and HTML

I want to extract from the following html code, the following information using python and bs4;
h2 class placename value,
span class value,
div class="aithousaspec" value
<div class="results-list">
<div class="piatsaname">city center</div>
<table>
<tr class="trspacer-up">
<td>
<a href="hall.aspx?id=1001173">
<h2 class="placename">ARENA
<span class="boldelement"><img src="/images/sun.png" height="16" valign="bottom" style="padding:0px 3px 0px 10px" >Θερινός<br>
25 Richmond Avenue st, Leeds</span>
</h2>
<p>
+4497XXXXXXX<br>
STEREO SOUND
</p>
Every Monday 2 tickets 8,00 pounds
</a>
</td>
</tr>
<tr class="trspacer-down">
<td>
<p class="coloredelement">Italian Job</p>
<div class="aithousaspec">
<b></b> Thu.-Wed.: 20.50/ 23.00
<b></b>
</div>
The code that i m using doesnt seem efficient
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())
mydivs = soup.select('div.results-list')
for info in mydivs:
time= info.select('div.aithousaspec')
print time
listCinemas = info.select("a[href*=hall.aspx]")
print listCinemas
print len(listCinemas)
for times in time:
proj= times.find('div.aithousaspec')
print proj
for names in listCinemas:
theater = names.find('h2', class_='placename')
print(names.find('h2').find(text=True).strip())
print (names.find('h2').contents[1].text.strip())
Is there any better way to get the mentioned info?

data = '''<div class="results-list">
<div class="piatsaname">city center</div>
<table>
<tr class="trspacer-up">
<td>
<a href="hall.aspx?id=1001173">
<h2 class="placename">ARENA
<span class="boldelement"><img src="/images/sun.png" height="16" valign="bottom" style="padding:0px 3px 0px 10px" >Θερινός<br>
25 Richmond Avenue st, Leeds</span>
</h2>
<p>
+4497XXXXXXX<br>
STEREO SOUND
</p>
Every Monday 2 tickets 8,00 pounds
</a>
</td>
</tr>
<tr class="trspacer-down">
<td>
<p class="coloredelement">Italian Job</p>
<div class="aithousaspec">
<b></b> Thu.-Wed.: 20.50/ 23.00
<b></b>
</div>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'lxml')
print(soup.select('h2.placename')[0].contents[0].strip())
print(re.sub(r'\s{2,}', ' ', soup.select('span.boldelement')[0].text.strip()))
print(soup.select('div.aithousaspec')[0].text.strip())
This will print:
ARENA
Θερινός 25 Richmond Avenue st, Leeds
Thu.-Wed.: 20.50/ 23.00

how to extract text from html using beautifulsoup?

I want to extract some words from this html like
<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
<td align="right" style="width:75px;" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblRowNum" style="display:inline-block;width:50px;">124</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/>SP450017D0007</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrder" style="display:inline-block;width:175px;"><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/>0243 <br/><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/><span style="font-size: 9px;">» Delivery Order Package View</span></span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrderCounter" style="display:inline-block;width:50px;"> </span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate" style="display:inline-block;width:75px;">04-12-2018</span>
</td>
</tr>
this is a section of the my code that generates the html above
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import urllib3
import numpy as np
import re
from datetime import datetime, timedelta
containers = pagesoup.find_all('tr', {'class': ['BgWhite', 'BgSilver']})
for batch in containers:
for item in range(53)[2:]:
try:
// batch is the html above
print(batch)
uid = "ctl00_cph1_grdAwardSearch_ctl"+str(item)+"_lblAwardBasicNumber"
print("uid id ", uid)
awardid = batch.find_all("span", text = re.compile("_lblAwardBasicNumber"))
print("award id is")
print(awardid)
except Exception as e:
print(colorama.Fore.MAGENTA + "award error.."+ str(e) )
# print(container1)
continue
except Exception as e:
raise e
print (batch) is what produces the html above, I wanted to obtain this number SP450017D0007 from this
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/>SP450017D0007</span>
but awardid is outputing none. how can i extract SP450017D0007 ?

Solution:
To get this text SP450017D0007, I used pagesoup.find('a', text=True).text.
Note:
You have the following extra lines in your code above that should be taken out
except Exception as e:
raise e
Code:
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import urllib3
import numpy as np
import re
from datetime import datetime, timedelta
data = '''
<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
<td align="right" style="width:75px;" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblRowNum" style="display:inline-block;width:50px;">124</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/>SP450017D0007</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrder" style="display:inline-block;width:175px;"><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/>0243 <br/><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/><span style="font-size: 9px;">» Delivery Order Package View</span></span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrderCounter" style="display:inline-block;width:50px;"> </span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate" style="display:inline-block;width:75px;">04-12-2018</span>
</td>
</tr>
'''
pagesoup = BeautifulSoup(data, 'html.parser')
containers = pagesoup.find_all('tr', {'class': ['BgWhite', 'BgSilver']})
for batch in containers:
for item in range(53)[2:]:
try:
print(batch)
uid = "ctl00_cph1_grdAwardSearch_ctl" + str(item) + "_lblAwardBasicNumber"
print("uid id ", uid)
awardid = pagesoup.find('a', text=True).text
print("award id is")
print(awardid)
dateid = pagesoup.find('span', id='ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate').text
print("date id is")
print(dateid)
except Exception as e:
print(colorama.Fore.MAGENTA + "award error.." + str(e))
# print(container1)
continue
Output:
<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
<td align="right" style="width:75px;" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblRowNum" style="display:inline-block;width:50px;">124</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/>SP450017D0007</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrder" style="display:inline-block;width:175px;"><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/>0243 <br/><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/><span style="font-size: 9px;">» Delivery Order Package View</span></span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrderCounter" style="display:inline-block;width:50px;"> </span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate" style="display:inline-block;width:75px;">04-12-2018</span>
</td>
</tr>
uid id ctl00_cph1_grdAwardSearch_ctl2_lblAwardBasicNumber
award id is
SP450017D0007
date id is
04-12-2018

parsing/escape in Swift

Currently i have a html string (here is a part of it) in swift where i want to escape a special part
<tr style="color:White;background-color:#32B4FA;border-width:1px;border-style:solid;font-weight:normal;">
<th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;"> </th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:20px;">Park-<br>stätte</th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">Parkmöglichkeit</th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">Anzahl Stellplätze</th><th scope="col" style="border-color:Black;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">Freie Stellplätze</th>
</tr>
<tr style="color:#000066;">
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;width:25px;">
<span id="GridView1__Id_0" title="Kennzeichen" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:25px;">P1</span>
</td>
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;">
<img src="Images/Symbol_Tiefgarage.jpg" style="width:20px;" />
</td>
<td align="left" style="border-width:1px;border-style:solid;font-size:Small;">
<a id="GridView1_HyperLink1_0" href="http://www.paderborn.de/microsite/asp/parken_in_der_city/TG_Koenigsplatz.php" target="_top" style="display:inline-block;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;width:150px;">Königsplatz</a>
</td>
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;width:40px;">
<span id="GridView1__AnzahlPlaetze_0" title="Anzahl Plätze" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;">810</span>
</td>
<td align="center" style="border-width:1px;border-style:solid;font-size:Smaller;width:40px;">
<span id="GridView1__AnzahlFreiePlaetze_0" title="Freie Plätze" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;">0</span>
</td>
</tr>
the Part for me that is interesting is the "810"( could be 0-1000 or a text string) from
<td align="center" style="border-width:1px;border-style:solid;font-size:X-Small;width:40px;">
<span id="GridView1__AnzahlPlaetze_0" title="Anzahl Plätze" ReadOnly="true" style="display:inline-block;border-width:0px;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:X-Small;">810</span>
</td>
i did try to get use to regEx but this did not work out for me.

I suggest you use a XML/HTML parser which supports CSS selectors to retrieve that string, since the span that contains that string has a id = "GridView1__AnzahlPlaetze_0", and you can use query "#GridView1__AnzahlPlaetze_0" to retrieve it.
For example, with a Swift library called Fuzi that wraps libxml2
import Fuzi
let doc = try? HTMLDocument(string: htmlString)
if let result = doc?.firstChild(css: "#GridView1__AnzahlPlaetze_0") {
print(result.stringValue)
}
The above code is tested.

HTML to readable text in Swift

I need to make a reader func in my Swift App. I will receive text (with json request) in HTML like this :
<h5 align="LEFT" class="western" style="font-weight: normal;"> </h5>
<font size="3"><h5 align="LEFT" class="western" style="font-weight: normal;">
Albert Einstein publie en 1905, une nouvelle théorie connue sous le nom de relativité restreinte. </h5>
<h5 align="LEFT" class="western" style="font-weight: normal;"></h5>
<h5 align="LEFT" style="font-weight: normal;"><font color="#339933">►</font>
<font size="3">Postulat 1 :</font>
</h5><h5 align="LEFT" class="western" style="font-weight: normal;">
</h5>
And I want to show it in readable text with all the attribute, I don't want to lose the style, for that I can use HTMLReader.
Do you know a way to make it in Swift ?

These are define in html file, then have to use webview:
objweb.delegate = self
var path = NSBundle.mainBundle().bundlePath
var baseUrl = NSURL.fileURLWithPath("\(path)")
let bundle = NSBundle.mainBundle()
let pathhtml = bundle.pathForResource("Armory", ofType: "html")
let content = NSString.stringWithContentsOfFile(pathhtml) as String
objweb.loadHTMLString(content, baseURL: baseUrl)
self.view.addSubview(objweb)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

how to parse a link from a text in python? - html

Related

How to extract different types of bold text and the text in between them using BeautifulSoup?

BeautifulSoup4 and HTML

how to extract text from html using beautifulsoup?

parsing/escape in Swift

HTML to readable text in Swift

Categories

Resources