extracting &lt and &gt from html using python - html

I have a HTML in UTF-8 encoding like below. I want to extract OWNER, NVCODE, CKHEWAT tags from this using python and bs4. But <> is converted to &lt and &gt I am not able to extract text from OWNER, NVCODE, CKHEWAT tags.
kindly guide me to extract text from these tags.
<?xml version="1.0" encoding="utf-8"?><html><body><string xmlns="http://tempuri.org/"><root><OWNER>अराजी मतरुका वासीदेह </OWNER><NVCODE>00108</NVCODE><CKHEWAT>811</CKHEWAT></root></string></body></html>
My code
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
soup.find('string').text

Check this
By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into “&”, “<”, and “>”, so that Beautiful Soup doesn’t inadvertently generate invalid HTML or XML:
soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>")
soup.p
# <p>The law firm of Dewey, Cheatem, & Howe</p>
soup = BeautifulSoup('A link')
soup.a
# A link
You can change this behavior by providing a value for the formatter argument to prettify(), encode(), or decode(). Beautiful Soup recognizes six possible values for formatter.
The default is formatter="minimal". Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML:
french = "<p>Il a dit <<Sacré bleu!>></p>"
soup = BeautifulSoup(french)
print(soup.prettify(formatter="minimal"))
# <html>
# <body>
# <p>
# Il a dit <<Sacré bleu!>>
# </p>
# </body>
# </html>

Related

How to parse HTML tbody data into Python in tabular format

I am new to python and I am trying to parse this data into tabular format in Python. I have considered examples but unable to get desired result.
Can someone please help me on this
<tbody>
<tr><td>Kupon in %</td><td>36,520</td></tr>
<tr><td>Erstes Kupondatum</td><td>03.07.2017</td></tr>
<tr><td>Letztes Kupondatum</td><td>03.04.2022</td></tr>
<tr><td>Zahlweise Kupon</td><td>Zinszahlung normal</td></tr>
<tr><td>Spezialkupon Typ</td><td>Zinssatz variabel</td></tr>
Need this data in this way :
Kupon in % 36,520
Erstes Kupondatum 03.07.2017
Letztes Kupondatum 03.04.2022
You can do that in two ways 1. Using list comprehension and 2. using for loop
both produce the same result its on you to choose.
from bs4 import BeautifulSoup
html = """<tbody>
<tr><td>Kupon in %</td><td>36,520</td></tr>
<tr><td>Erstes Kupondatum</td><td>03.07.2017</td></tr>
<tr><td>Letztes Kupondatum</td><td>03.04.2022</td></tr>
<tr><td>Zahlweise Kupon</td><td>Zinszahlung normal</td></tr>
<tr><td>Spezialkupon Typ</td><td>Zinssatz variabel</td></tr>"""
#1
soup = BeautifulSoup(html,'lxml')
print(' '.join([td.text for td in soup.find_all('td')]))
# 2
tags = []
tr = soup.find_all('td')
for td in tr:
tags.append(td.text)
print(' '.join(tags))
Output: Kupon in % 36,520 Erstes Kupondatum 03.07.2017 Letztes Kupondatum 03.04.2022 Zahlweise Kupon Zinszahlung normal Spezialkupon Typ Zinssatz variabel

How to extract value from html via BeautifulSoup

I have parsed my string via BeautifulSoup.
from bs4 import BeautifulSoup
import requests
import re
def otoMoto(link):
URL = link
page = requests.get(URL).content
bs = BeautifulSoup(page, 'html.parser')
for offer in bs.find_all('div', class_= "offer-item__content ds-details-container"):
# print(offer)
# print("znacznik")
linkOtoMoto = offer.find('a', class_="offer-title__link").get('href')
# title = offer.find("a")
titleOtoMoto = offer.find('a', class_="offer-title__link").get('title')
rokProdukcji = offer.find('li', class_="ds-param").get_text().strip()
rokPrzebPojemPali = offer.find_all('li',class_="ds-param")
print(linkOtoMoto+" "+titleOtoMoto+" "+rokProdukcji)
print(rokPrzebPojemPali)
break
URL = "https://www.otomoto.pl/osobowe/bmw/seria-3/od-2016/?search%5Bfilter_float_price%3Afrom%5D=50000&search%5Bfilter_float_price%3Ato%5D=65000&search%5Bfilter_float_year%3Ato%5D=2016&search%5Bfilter_float_mileage%3Ato%5D=100000&search%5Bfilter_enum_financial_option%5D=1&search%5Border%5D=filter_float_price%3Adesc&search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="
otoMoto(URL)
Result:
https://www.otomoto.pl/oferta/bmw-seria-3-x-drive-nowe-opony-ID6Dr4JE.html#d51bf88c70 BMW Seria 3 2016
[<li class="ds-param" data-code="year">
<span>2016 </span>
</li>, <li class="ds-param" data-code="mileage">
<span>50 000 km</span>
</li>, <li class="ds-param" data-code="engine_capacity">
<span>1 998 cm3</span>
</li>, <li class="ds-param" data-code="fuel_type">
<span>Benzyna</span>
</li>]
So I can extract single strings, but if I see this same class
class="ds-param"
I can't assigne, for example, production date to variable. Please let me know if you have any ideas :).
Have a nice day !
from the docs:
Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
You can use these attributes in searches by putting them into a dictionary and passing the dictionary into find_all() as the attrs argument:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
so you could do something like
data_soup.find_all(attrs={"data-code": "year" })[0]. get_text()

How to parse all the text content from the HTML using Beautiful Soup

I wanted to extract an email message content. It is in html content, used the BeautifulSoup to fetch the From, To and subject. On fetching the body content, it fetches the first line alone. It leaves the remaining lines and paragraph.
I miss something over here, how to read all the lines/paragraphs.
CODE:
email_message = mail.getEmail(unreadId)
print (email_message['From'])
print (email_message['Subject'])
if email_message.is_multipart():
for payload in email_message.get_payload():
bodytext = email_message.get_payload()[0].get_payload()
if type(bodytext) is list:
bodytext = ','.join(str(v) for v in bodytext)
else:
bodytext = email_message.get_payload()[0].get_payload()
if type(bodytext) is list:
bodytext = ','.join(str(v) for v in bodytext)
print (bodytext)
parsedContent = BeautifulSoup(bodytext)
body = parsedContent.findAll('p').getText()
print body
Console:
body = parsedContent.findAll('p').getText()
AttributeError: 'list' object has no attribute 'getText'
When I use
body = parsedContent.find('p').getText()
It fetches the first line of the content and it is not printing the remaining lines.
Added
After getting all the lines from the html tag, I get = symbol at the end of each line and also &nbsp ; , &lt is displayed.How to overcome those.
Extracted text:
Dear first,All of us at GenWatt are glad to have xyz as a
customer. I would like to introduce myself as your Account
Manager. Should you = have any questions, please feel free to
call me at or email me at ash= wis#xyz.com. You
can also contact GenWatt on the following numbers: Main:
810-543-1100Sales: 810-545-1222Customer Service & Support:
810-542-1233Fax: 810-545-1001I am confident GenWatt will serve you
well and hope to see our relationship=
Let's inspect the result of soup.findAll('p')
python -i test.py
----------
import requests
from bs4 import BeautifulSoup
bodytext = requests.get("https://en.wikipedia.org/wiki/Earth").text
parsedContent = BeautifulSoup(bodytext, 'html.parser')
paragraphs = soup.findAll('p')
----------
>> type(paragraphs)
<class 'bs4.element.ResultSet'>
>> issubclass(type(paragraphs), list)
True # It's a list
Can you see? It's a list of all paragraphs. If you want to access their content you will need iterate over the list or access an element by an index, like a normal list.
>> # You can print all content with a for-loop
>> for p in paragraphs:
>> print p.getText()
Earth (otherwise known as the world (...)
According to radiometric dating and other sources of evidence (...)
...
>> # Or you can join all content
>> content = []
>> for p in paragraphs:
>> content.append(p.getText())
>>
>> all_content = "\n".join(content)
>>
>> print(all_content)
Earth (otherwise known as the world (...) According to radiometric dating and other sources of evidence (...)
Using List Comprehension your code will looks like:
parsedContent = BeautifulSoup(bodytext)
body = '\n'.join([p.getText() for p in parsedContent.findAll('p')]
When I use
body = parsedContent.find('p').getText()
It fetches the first line of the content and it is not printing the
remaining lines.
Do parsedContent.find('p') is exactly the same that do parsedContent.findAll('p')[0]
>> parsedContent.findAll('p')[0].getText() == parsedContent.find('p').getText()
True

Json Files parsing

So I am trying to open some json files to look for a publication year and sort them accordingly. But before doing this, I decided to experiment on a single file. I am having trouble though, because although I can get the files and the strings, when I try to print one word, it starts printinf the characters.
For example:
print data2[1] #prints
THE BRIDES ORNAMENTS, Viz. Fiue MEDITATIONS, Morall and Diuine. #results
but now
print data2[1][0] #should print THE
T #prints T
This is my code right now:
json_data =open(path)
data = json.load(json_data)
i=0
data2 = []
for x in range(0,len(data)):
data2.append(data[x]['section'])
if len(data[x]['content']) > 0:
for i in range(0,len(data[x]['content'])):
data2.append(data[x]['content'][i])
I probably need to look at your json file to be absolutely sure, but it seems to me that the data2 list is a list of strings. Thus, data2[1] is a string. When you do data2[1][0], the expected result is what you are getting - the character at the 0th index in the string.
>>> data2[1]
'THE BRIDES ORNAMENTS, Viz. Fiue MEDITATIONS, Morall and Diuine.'
>>> data2[1][0]
'T'
To get the first word, naively, you can split the string by spaces
>>> data2[1].split()
['THE', 'BRIDES', 'ORNAMENTS,', 'Viz.', 'Fiue', 'MEDITATIONS,', 'Morall', 'and', 'Diuine.']
>>> data2[1].split()[0]
'THE'
However, this will cause issues with punctuation, so you probably need to tokenize the text. This link should help - http://www.nltk.org/_modules/nltk/tokenize.html

perl HTML::HTMLDoc how to include a base64 img

I am trying to include either a base64 image or a src="getImage.pl?image.jpg" when creating a PDF with HTML::HTMLDoc. No Luck.
Does anybody have experience with this module and have some wisdom to share?
Thank You,
~D
+-------------------------------------------------+
#!/usr/bin/perl
use HTML::HTMLDoc;
$html = new HTML::HTMLDoc('mode'=>'file', 'tmpdir'=>'/tmp'); # Start instance
$html->set_page_size('letter'); # set page size
$html->set_bodyfont('Arial'); # set font
$html->set_fontsize(8.0); # set fontsize
$html->set_permissions('no-copy');
$html->set_permissions('no-modify');
$html->set_permissions('no-annotate');
$html->set_html_content(
qq{
<html><body>Hello World...
<br />
<img src="data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEASABIAAD/4QAWRXhpZgAATU0AKgAAAAgAAAAAAAD/2wBDAAUDBAQEAwUEBAQFBQUGBwwIBwcHBw8LCwkMEQ8SEhEPERETFhwXExQaFRERGCEYGh0dHx8fExciJCIeJBweHx7/2wBDAQUFBQcGBw4ICA4eFBEUHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh7/wAARCABeAF4DASIAAhEBAxEB/8QAHAAAAQUBAQEAAAAAAAAAAAAABQIDBAYHCAEA/8QAQRAAAQMDAwEFAwkFBgcAAAAAAQIDBAAFEQYSITEHE0FRYRQycRUiQlKBkaHB0QgjYnOxFhckM1NyRGOSk6Lh8f/EABkBAQEAAwEAAAAAAAAAAAAAAAABAgMEBf/EACYRAAIBAwIGAgMAAAAAAAAAAAABAgMEEQUhEiIxQVFhE3EjkfD/2gAMAwEAAhEDEQA/ANkQeTnkmnkbcYOKjklIJ5GB5UPVdUISjeptKlJCsbFeNUBoAAcYNRZrqWWVOKSSB4DjNDvllH+ojH8tVMybo082W1vJCTjOGznGelQDz9xQ1IbilAU86rahKFhRzUzaoKOcZHWqolxPy41cEOd2GlqKQUE8H0oubq0rq6jP8oj86AJqPBwBkDJqA9c2mGm1OgEu7dqUrBUc+lMquTZSR36Rkf6R/WhE4svzIy2llKI6kEFSeu3zAoC3HcAEEAHxpK0lNDDd2iSe+Rz/AMo/rSvldrHLif8AtH9aAIAAnPWloOzlAI8Dg0LZuyXblFiI2qU+oj3CMAD40WxnxP2VQOBQKeckHqKFu6ftDii4th8qPlJWPzqeFccjilAj6oAqMAZVksaVFJYkEjrh9w/0qPLsVoLC/Z48sOnAThx08/bXt5fUw3IWhSk4dJ4Vj6IoPcJFxhafXf5kyDGghoukuSV7sZwBgJPJPAHmRQHzsRCr1GhuF/2ZKyl0oJ3YA8wKXNl6CgOFqdfY0VwfQeuexX3E8Uc7NNAzNUMJu2rt7Ec4U3akOKQlIPI79SSFLVjkoyEjxzzWm/Jdm02y2zHstrYjnhKo8RKQD68fjQGN25OkLolXybcBNIH/AA84un7k5qLOhFh2JE3SCla0d9gkqAPUZ61sU6wad1Kyr2jTMOWkHAeQyG1pP8LicKB9QayXXNpnaOmpfamKkWUuJbddnLIegFRwnvFAZW3kgbzyMjORyADBsNg3cJnY/nOc0trT2nnXC2kzC4BuKPaVggefNDJMHUEdtLi3YjjalAYakqUrnyykVI0SpS9Q3LvVKURFSOTnHzk0AVg2K1QJjcxlqQXm87Cp9SgPvogonI6Hilu9OOnwprOOgAqoCEcJ5NLBCsAjNMBRCSocjypkzQEpXwgKTuSStAyD8TUYA+oUIWh9pStoU8Qef4RWPax1c3ce1XTWiO/7y2Wc+1yUkABx9KStCTjqE5B+JrYr9JsyAlcm3sS5LmdiS7yT5nargeuK5K1xZNQWHXD2so7PtcJ6Sp7fHBIbSeCgjkgAcZ58KA7s0HfGpen2nY7uTk78fWz/AO6oHbPF7VLtqKK7pGREVZ092p07nC6ypCiVDaOFpV0PBIpz9mfQt8n2ZrVF3uMqFbpiN0aEjhTqSOHFE+6PIda0W8RbxpNavYJUyVbnVbhlW5bKuv8A0muHUbmtbUXVpQ48dUuuPSysm2jCM5cMngh9mdx1ND0oxF1U9HNyStZ2Nc7WicoBO1OTjqQkVX+2C722TH9glrbcVKZWy8g85bIxz95q62KzMX4yHdQSSu5OtDDDbxS5GbPCTgHgnrzXNH7Sulb/ANm7nyozKlXe2TVd3GfVlTiHD0bX+R4zXRbznOmpVFhvt4++u/kwmoqWIvI72A66Vd9LyLFcpCpEuwPezoVgZcZGUtkknnABH2CtF0Qof2guZKgB7Mnkq/jFc8/s9abulhvzl5v0NtTM5ko9jeJC1n3txwRtPXAPXJrpiyyLI0wZFrhsxw8MLIdAPB6HKjW4xDSvUZHpTK8/VqN8psmW1H3fPeVhAStKv6GpKyrgZqoDB90/7T6VVLwXAuI20oILvdt7sZxkeXjVs2goUPTrQGTZZ78yKtb8FDLK0HgrKiE+m3FRgxft5ut60HLjOvTGZRmNuojqQ2UkbFDAUCSPpA8YzzUXSd4VJ0vbnJCwtbsZJcJHCiRzxWmdu3Z292i2uBFiy2ozsOZ3u9YJBbUMKAx49KwdUhqzOvWllzLUN1bKP9qVkD8KA6K7I9e6jtlnSy28JkJl0obZd52JBwAk9QPtq2XLt3tNrvkZnUkd22QHFNshZaLgU4o8rUrohCcDn1OelZz2IsWmdpBt+TPfiuqeXlXcFaPe9Oav8zSWmrvb3IUzUVsdYcSUqQ80oZH211v4ZwSzhm/8co+yr9oms9RaJ1w3q3TEeROsD5U/cYIDau/QsDDza+ckADHOMcUC7TdVztQTbZcje03ayyWlvwHW0bW14VjlPGFp5BB5BohbNKq0RMctCNRW+/6QLankwnGnnnYRKujBRyAVL5Tgj4VnPavY7Lo6fFNhDohXDe+FF3e24o7cqTg7R64A5Brnyl0/vaMJSilygHtI1XJskO3zYasqErCxj3k7VZ/+1qPZJGm37Qzepl3MtNSVOrQ0GNxVtVtzknj3T0H51iirONbrNqEpDDjbK3mnFcgKGOoHhjNdMdmNifsnZrabHIwhTUfZkJxwTncfiSawNZ5AYcjaqtjbikqWlStxSMAnbVsUScYz8DQiNaHhembnJnNuBokhCGCCcjHXJouBjr19RVQPDkq2hJ4pLyu6bW6rcQkFRAPJp07gcZPNebDUYBbtzjd13negpIzsbOSfiojA+6qt/dxp3Wc911VgjMk8qdYX3GfUkH5x9asEnT1uQtbzntD6MlQj79iM+uOSPTNIteohbF91Itcju08JLXOB8DXia5dahbUozsqXHvuu+PWTts4UJt/M8eAtpXs4TpWzC3W5bbjKFKUEuSgtWSc4zgV84Gkl1tY2utHCkKxkfrTczXtlajn/AAtw3fVLeM/jVKvWspLjy3mLSpCSCEqdXgkH0xzXk6Tqus3dzGNW2cId3LC/WyN9ajaRptqXN2wXlDLEh5uMkBx50cJyBgepPQfpVP1R2CN3a6sXSNcoTbbRJXbX5ThjqBznG33DznI4yOc0NhdoSosxBeszjq1YSUNqys/AYq9Rtf20x0d5abi27j/LKUkp+OFYrHU9U1y2uZRpW3HDs4tdPecvJaVKzlTjzb98mbztIzOzu8reRZo9xs6nA+tplQLqTxkg4ytPHuEjzGCcHRtP6gtuoYYm2+S24PpI+k2R9Ep6gjyIFD7hdzfSqMLY6iO7lK95GSD4cUxZ9DWi1y25kIPR3kud6dr6xk4xg88j06V7ei19Qr0XK+pqDzss74942OW8jQjJKi8+fstJUn1HrXynE+FI65zyK8AHlivZRyEjJxzj7KZfUvuld2RvAOM+dPqTsO0+FIP4+FAAU3ZBbSpQklWOQXEgA+P0aaeusYJy4hsDzceV+RohJs9oedU67C3KUcq/erAJ+ANOxrbamCFM2uElQ8VMhZ/8s0wAC3OMte23wC+vplhn+qj0++nG9NSpSt9xdbiIPVDP7x37Ve6PxqzLUrATuOB0A4A+A6U2QMZxz51MAgQbRa4DSkwYaGs+8tXznF/FR/LinUxGc7g0geQxUtKd5606EYGcA1QNIQEY2ox8BT6QfMY8hSynKQelJ6HgCmANqGVeH3V4EDzpxXHBp1tvvOBxgUB//9k=" border="0" alt="Hello Image">
</body></html>});
$html->title();
$html->set_header('.', 't', '.');
$html->set_footer('D', '.', '/');
$pdf = $html->generate_pdf(); # generate document
$http_headers_out{'Content-Type'} = 'application/pdf';
print $pdf->to_string();
It looks like HTML::HTMLDoc will NOT handle img src from base64 data NOR a cgi script.
There was a great response to this question here:
http://www.perlmonks.org/?node_id=1081554