editing tag.string property with BeautifulSoup (bs4) to string including markup - html

I have an html document that I wish to edit such that any word(s) within it can be highlighted/made bold.
I have the html in memory and have passed to BeautifulSoup. I iteratate through all tags and take their string elements. If any string contains a matching word, I edit the string and replace it into the html with markup wrapping around the desired word.
from flask import Flask, Markup
from bs4 import BeautifulSoup
def match( documentText: str, searchQuery: str) -> Markup:
words = documentText.split( ' ')
if len( words) >= 3:
words[2] = f'<strong>{ words[2]}</strong>'
logger.info( f'{ words=}')
return Markup( ' '.join( words))
for link in html.find_all( True):
if ( link.string):
link.string = match( link.string, searchQuery)
app = Flask( __name__)
#app.route( '/')
def home():
logger.info( 'trying markup and testing logging')
return str( html), 200
app.run( debug=True)
Now, instead of rendering a page with bold words where I would like them, I visually see the html tags, this is because if I view source, the tags are actually represented by >. This would appear to be coming from the line "link.string = match( link.string, searchQuery)" - which I guess could well make sense, in that BeautifulSoup is doing type checking and ensuring the only thing that goes in the tag.string field is indeed a string. The ideal end state I guess then would be to make a branch off the tag to include a child tag.
Is this a problem anybody else has previously solved? My solution to this whole thing seems chunky and inelegant so I wouldn't mind a better route if anybody has one.

For a quick fix, just replace back those html special characters with str.replace()
from flask import Flask, Markup
from bs4 import BeautifulSoup
# ...
#app.route( '/')
def home():
logger.info( 'trying markup and testing logging')
return str(html).replace(">",">").replace("&lt","<"), 200
app.run( debug=True)
Be careful since html special characters is not just < and >
Html special characters reference: https://www.html.am/reference/html-special-characters.cfm
Better Approach:
This approach will change all html special chars back to it's un-escapced form
import html.parser
htmlparser = html.parser.HTMLParser()
html_decoded_string = parser.unescape(str(html))
return html_decoded_string , 200
do note that on Python 2, the import statement (module name) might be a little bit different

Related

How to loop through all items in xpath

I'm new to both xpath and html so I'm probably missing something fundamental here. I have a html where I want to extract all the items displayed below. (I'm using scrapy to do my requests, I just need the proper xpath to get the data)
enter image description here
Here I just want to loop through all these items and from there and get some data from inside each item.
for item in response.xpath("//ul[#class='feedArticleList XSText']/li[#class='item']"):
yield {'name': item.xpath("//div[#class='intro lhNormal']").get()}
The problem is that this get only gives me the first item for all the loops. If I instead use .getall() I then get all the items for every loop (which in my view shouldn't work since I thought I only selected one item at the time in each iteration). Thanks in advance!
It seems you're missing a . in your XPath expression (to "indicate" you're working from a context node).
Replace :
yield {'name': item.xpath("//div[#class='intro lhNormal']").get()}
For :
yield {'name': item.xpath(".//div[#class='intro lhNormal']").get()}
You do miss smth. fundamental. Python by default does not have xpath() function.
You better use bs4 or lxml libraries.
See an example with lxml:
import lxml.html
import os
doc = lxml.html.parse('http://www.websters-online-dictionary.org')
if doc:
table = []
trs = doc.xpath("/html/body/div[1]/div[1]/table[1]/tbody/tr")
for tr in islice(trs, 3):
for td in tr.xpath('td'):
table += td.xpath("/b/text() | /text()")
buffer = ''
for i in range(len(table)):
buffer += table[i]
the full explanation is here.

How to extract html links with a matching word

I am trying to make a crawler that can use a text file list of urls be turned auto assigned as variables to later be added into a list that can be parsed to search for urls containing the word "wp- ". Unfortunately, I am getting stuck at the part where I need to scrape the page to see if any urls bring up " wp- ". I've tried a number of ways but nothing is working. I've tried various semblances of
//a[contains(#href, 'wp-')]
but it does not work. Any suggestions on how to get the parsing for wp- working?
Here is my code so far
'''
#!/usr/#!/usr/bin/python
import urllib.request
import urlopen
# import urls into readable python file
f = open("url-list.txt", "r")
text = f.read()
# turn urls in file into a list by spliting it into lines
text_list = text.splitlines()
f.close()
#print(text_list) #dont need to show the links as list
#make list into variables
count = 0
for breakaway in text_list: #made iterate of a list to set their value
count = count + 1
print(count + 0, " Sending url-list to scraper...")
for url in //a[contains(#href, 'wp-')].extract():
print(url)
'''

render Displacy Entity Recognition Visualization into Plotly Dash

I want to render a piece of Entity Recognition Visualization by Spacy into a Plotly Dash app.
The html of ER Visualization for rendering is as follows:
<div class="entities" style="line-height: 2.5">
<mark class="entities" style="background: ...>
<span>...</span>
</mark>
<mark class="entities" style="background: ...>
<span>...</span>
</mark>
</div>
I have tried parsing the HTML using BeautifulSoup, and converting the HTML to Dash by the following code. But when I run convert_html_to_dash(html_parsed), it is throwing KeyError: 'style'
html_parsed = bs.BeautifulSoup(html, 'html.parser')
def convert_html_to_dash(el, style = None):
if type(el) == bs.element.NavigableString:
return str(el)
else:
name = el.name
style = extract_style(el) if style is None else style
contents = [convert_html_to_dash(x) for x in el.contents]
return getattr(html,name.title())(contents, style=style)
def extract_style(el):
return {k.strip():v.strip() for k,v in [x.split(": ") for x in
el.attrs["style"].split(";")]}
Not every tag has a style attribute. For tags that don't, you are attempting to access a non-existent key in the attrs dictionary. Python's response is a KeyError.
If you use get() instead, it will return a default value instead of raising a KeyError. You can specify a default value as the second argument to get():
return { k.strip() : v.strip() for k, v in
[ x.split(': ') for x in el.attrs.get('style', '').split(';') ]
}
Here I have chosen the empty string as the default value.
With only this change, your code still remains somewhat brittle. What if the input does not exactly match what you expect?
For one thing, there might not be a space after the colon. Changing split(': ') to split(':') will make it work even if there is no space – if there is one it will be removed anyway since you are calling strip() after splitting.
And what if after splitting on ';' you receive something other than a key-value pair in the list? It is best to check if it is a valid pair (contains exactly one colon), and skip it otherwise.
Your code becomes:
return { k.strip() : v.strip() for k, v in
[ x.split(':') for x in el.attrs.get('style', '').split(';')
if x.count(':') == 1 ] }
Note that I have opted for single-quotation marks. Your code uses both, but it is best to pick one and stick with it.

How to control quoting on non-numerical entries in a csv file?

I am using Python3's csv module and am wondering why I cannot control quoting correctly. I am using the option quoting = csv.QUOTE_NONNUMERIC but am still seeing all entries quoted. Any idea as to why that is?
Here's my code. Essentially, I am reading in a csv file and want to remove all duplicate lines that have the same text string:
import sys
import csv
class Row:
def __init__(self, row):
self.text, self.a, self.b = row
self.elements = row
with open(sys.argv[2], 'w', newline='') as output:
writer = csv.writer(output, delimiter=';', quotechar='"',
quoting=csv.QUOTE_NONNUMERIC)
with open(sys.argv[1]) as input:
reader = csv.reader(input, delimiter=';')
header = next(reader)
Row.labels = header
assert Row.labels[1] == 'Label1'
writer.writerow(header)
texts = set()
for row in reader:
row_object = Row(row)
if row_object.text not in texts:
writer.writerow(row_object.elements)
texts.add(row_object.text)
When I look at the generated file, the content looks like this:
"Label1";"Label2";"Label3"
"AAA";"123";"456"
...
But I want this:
"Label1";"Label2";"Label3"
"AAA";123;456
...
OK ... I figured it out myself. The answer, I am afraid, was rather simple - and obvious in retrospect. Since the content of each line is obtained from a csv.reader()its elements are strings by default. As a result, the get quoted by the subsequently employed csv.writer().
To be treated as an int, they first need to be cast to an int:
row_object.elements[1]= int(row_object.a)
This explanation can be proven by inserting a type check before and after this cast:
print('Type: {}'.format(type(row_object.elements[1])))

Issue in scraping data from a html page using beautiful soup

I am scraping some data from a website and I am able to do so using the below referred code:
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from itertools import islice
page = urllib2.urlopen('http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('O2_2012-12-21.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(["Date","Month","Day of Week","OEM","Device Name","Price"])
oems = soup.findAll('span', {"class": "wwFix_h2"},text=True)
items = soup.findAll('div',{"class":"title"})
prices = soup.findAll('span', {"class": "handset"})
for oem, item, price in zip(oems, items, prices):
textcontent = u' '.join(islice(item.stripped_strings, 1, 2, 1))
if textcontent:
spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(oem.string).encode('utf8').strip(),textcontent,unicode(price.string).encode('utf8').strip()])
Now, issue is 2 of the all the price values I am scraping have different html structure then rest of the values. My output csv is showing "None" value for those because of this. Normal html structure for price on webpage is
<span class="handset">
FREE to £79.99</span>
For those 2 values structure is
<span class="handset">
<span class="delivery_amber">Up to 7 days delivery</span>
<br>"FREE on all tariffs"</span>
Out which I am getting right now displays None for the second html structure instead of Free on all tariffs, also price value Free on all tariffs is mentioned under double quotes in second structure while it is outside any quotes in first structure
Please help me solve this issue, Pardon my ignorance as I am new to programming.
Just detect those 2 items with an additional if statement:
if price.string is None:
price_text = u' '.join(price.stripped_strings).replace('"', '').encode('utf8')
else:
price_text = unicode(price.string).strip().encode('utf8')
then use price_text for your CSV file. Note that I removed the " quotes with a simple replace call.