How to extract value from html via BeautifulSoup - html

I have parsed my string via BeautifulSoup.
from bs4 import BeautifulSoup
import requests
import re
def otoMoto(link):
URL = link
page = requests.get(URL).content
bs = BeautifulSoup(page, 'html.parser')
for offer in bs.find_all('div', class_= "offer-item__content ds-details-container"):
# print(offer)
# print("znacznik")
linkOtoMoto = offer.find('a', class_="offer-title__link").get('href')
# title = offer.find("a")
titleOtoMoto = offer.find('a', class_="offer-title__link").get('title')
rokProdukcji = offer.find('li', class_="ds-param").get_text().strip()
rokPrzebPojemPali = offer.find_all('li',class_="ds-param")
print(linkOtoMoto+" "+titleOtoMoto+" "+rokProdukcji)
print(rokPrzebPojemPali)
break
URL = "https://www.otomoto.pl/osobowe/bmw/seria-3/od-2016/?search%5Bfilter_float_price%3Afrom%5D=50000&search%5Bfilter_float_price%3Ato%5D=65000&search%5Bfilter_float_year%3Ato%5D=2016&search%5Bfilter_float_mileage%3Ato%5D=100000&search%5Bfilter_enum_financial_option%5D=1&search%5Border%5D=filter_float_price%3Adesc&search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="
otoMoto(URL)
Result:
https://www.otomoto.pl/oferta/bmw-seria-3-x-drive-nowe-opony-ID6Dr4JE.html#d51bf88c70 BMW Seria 3 2016
[<li class="ds-param" data-code="year">
<span>2016 </span>
</li>, <li class="ds-param" data-code="mileage">
<span>50 000 km</span>
</li>, <li class="ds-param" data-code="engine_capacity">
<span>1 998 cm3</span>
</li>, <li class="ds-param" data-code="fuel_type">
<span>Benzyna</span>
</li>]
So I can extract single strings, but if I see this same class
class="ds-param"
I can't assigne, for example, production date to variable. Please let me know if you have any ideas :).
Have a nice day !

from the docs:
Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
You can use these attributes in searches by putting them into a dictionary and passing the dictionary into find_all() as the attrs argument:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
so you could do something like
data_soup.find_all(attrs={"data-code": "year" })[0]. get_text()

Related

How to parse HTML tbody data into Python in tabular format

I am new to python and I am trying to parse this data into tabular format in Python. I have considered examples but unable to get desired result.
Can someone please help me on this
<tbody>
<tr><td>Kupon in %</td><td>36,520</td></tr>
<tr><td>Erstes Kupondatum</td><td>03.07.2017</td></tr>
<tr><td>Letztes Kupondatum</td><td>03.04.2022</td></tr>
<tr><td>Zahlweise Kupon</td><td>Zinszahlung normal</td></tr>
<tr><td>Spezialkupon Typ</td><td>Zinssatz variabel</td></tr>
Need this data in this way :
Kupon in % 36,520
Erstes Kupondatum 03.07.2017
Letztes Kupondatum 03.04.2022
You can do that in two ways 1. Using list comprehension and 2. using for loop
both produce the same result its on you to choose.
from bs4 import BeautifulSoup
html = """<tbody>
<tr><td>Kupon in %</td><td>36,520</td></tr>
<tr><td>Erstes Kupondatum</td><td>03.07.2017</td></tr>
<tr><td>Letztes Kupondatum</td><td>03.04.2022</td></tr>
<tr><td>Zahlweise Kupon</td><td>Zinszahlung normal</td></tr>
<tr><td>Spezialkupon Typ</td><td>Zinssatz variabel</td></tr>"""
#1
soup = BeautifulSoup(html,'lxml')
print(' '.join([td.text for td in soup.find_all('td')]))
# 2
tags = []
tr = soup.find_all('td')
for td in tr:
tags.append(td.text)
print(' '.join(tags))
Output: Kupon in % 36,520 Erstes Kupondatum 03.07.2017 Letztes Kupondatum 03.04.2022 Zahlweise Kupon Zinszahlung normal Spezialkupon Typ Zinssatz variabel

Webscrape using BeautifulSoup to Dataframe

This is the html code:
<div class="wp-block-atomic-blocks-ab-accordion ab-block-accordion ab-font-size-18"><details><summary class="ab-accordion-title"><strong>American Samoa</strong></summary><div class="ab-accordion-text">
<ul><li><strong>American Samoa Department of Health Travel Advisory</strong></li><li>March 2, 2020—Governor Moliga <a rel="noreferrer noopener" href="https://www.rnz.co.nz/international/pacific-news/410783/american-samoa-establishes-govt-taskforce-to-plan-for-coronavirus" target="_blank">appointed</a> a government taskforce to provide a plan for preparation and response to the covid-19 coronavirus. </li></ul>
<ul><li>March 25, 2020 – The Governor issued an Executive Order 001 recognizing the Declared Public Health Emergency and State of Emergency, and imminent threat to public health. The order requires the immediate and comprehensive enforcement by the Commissioner of Public Safety, Director of Health, Attorney General, and other agency leaders.
<ul>
<li>Business are also required to provide necessary supplies to the public and are prohibited from price gouging.</li>
</ul>
</li></ul>
</div></details></div>
I want to extract State, date and text and add to a dataframe with these three columns
State: American Samoa
Date: 2020-03-25
Text: The Governor Executive Order 001 recognizing the Declared Public Health Emergency and State of Emergency, and imminent threat to public health
My code so far:
soup = bs4.BeautifulSoup(data)
for tag in soup.find_all("summary"):
print("{0}: {1}".format(tag.name, tag.text))
for tag1 in soup.find_all("li"):
#print(type(tag1))
ln = tag1.text
dt = (ln.split(' – ')[0])
dt = (dt.split('—')[0])
#txt = ln.split(' – ')[1]
print(dt)
Need Help:
How to get the text till a "." only, I dont need the entire test
How to add to the dataframe as new row as I loop through (I have only attached a part if the source code of webpage)
Appreciate your help!
As a start I have added the code below. Unfortunately the web page is not uniform in it's use of HTML lists some ul elements contain nested uls others don't. This code is not perfect but a starting point, for example American Samoa has an absolute mess of nested ul elements so only appears once in the df.
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
}
# You need to specify User Agent headers or else you get a 403
data = requests.get("https://www.nga.org/coronavirus-state-actions-all/", headers=HEADERS).text
soup = BeautifulSoup(data, 'lxml')
rows_list = []
for detail in soup.find_all("details"):
state = detail.find('summary')
ul = detail.find('ul')
for li in ul.find_all('li', recursive=False):
# Three types of hyphen are used on this webpage
split = re.split('(?:-|–|—)', li.text, 1)
if len(split) == 2:
rows_list.append([state.text, split[0], split[1]])
else:
print("Error", li.text)
df = pd.DataFrame(rows_list)
with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', -1):
print(df)
It creates and prints a data frame with 547 rows and prints some error messages for text it can not split. You will have to work out exactly which data you need and how to tweak the code to suit your purpose.
You can use 'html.parser' if you don't have 'lxml' installed.
UPDATED
Another approach is to use regex to match any string beginning with a date:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
}
# You need to specify User Agent headers or else you get a 403
data = requests.get("https://www.nga.org/coronavirus-state-actions-all/", headers=HEADERS).text
soup = BeautifulSoup(data, 'html.parser')
rows_list = []
for detail in soup.find_all("details"):
state = detail.find('summary')
for li in detail.find_all('li'):
p = re.compile(r'(\s*(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s*(\d{1,2}),*\s*(\d{4}))', re.IGNORECASE)
m = re.match(p, li.text)
if m:
rows_list.append([state.text, m.group(0), m.string.replace(m.group(0), '')])
else:
print("Error", li.text)
df = pd.DataFrame(rows_list)
df.to_csv('out.csv')
this gives far more records 4,785. Again it is a starting point some data gets missed but far less. It writes the data to a csv file, out.csv.

extracting &lt and &gt from html using python

I have a HTML in UTF-8 encoding like below. I want to extract OWNER, NVCODE, CKHEWAT tags from this using python and bs4. But <> is converted to &lt and &gt I am not able to extract text from OWNER, NVCODE, CKHEWAT tags.
kindly guide me to extract text from these tags.
<?xml version="1.0" encoding="utf-8"?><html><body><string xmlns="http://tempuri.org/"><root><OWNER>अराजी मतरुका वासीदेह </OWNER><NVCODE>00108</NVCODE><CKHEWAT>811</CKHEWAT></root></string></body></html>
My code
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
soup.find('string').text
Check this
By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into “&”, “<”, and “>”, so that Beautiful Soup doesn’t inadvertently generate invalid HTML or XML:
soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>")
soup.p
# <p>The law firm of Dewey, Cheatem, & Howe</p>
soup = BeautifulSoup('A link')
soup.a
# A link
You can change this behavior by providing a value for the formatter argument to prettify(), encode(), or decode(). Beautiful Soup recognizes six possible values for formatter.
The default is formatter="minimal". Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML:
french = "<p>Il a dit <<Sacré bleu!>></p>"
soup = BeautifulSoup(french)
print(soup.prettify(formatter="minimal"))
# <html>
# <body>
# <p>
# Il a dit <<Sacré bleu!>>
# </p>
# </body>
# </html>

parsing nested structures in R

I have a json-like string that represents a nested structure. it is not a real json in that the names and values are not quoted. I want to parse it to a nested structure, e.g. list of lists.
#example:
x_string = "{a=1, b=2, c=[1,2,3], d={e=something}}"
and the result should be like this:
x_list = list(a=1,b=2,c=c(1,2,3),d=list(e="something"))
is there any convenient function that I don't know that does this kind of parsing?
Thanks.
If all of your data is consistent, there is a simple solution involving regex and jsonlite package. The code is:
if(!require(jsonlite, quiet=TRUE)){
#if library is not installed: installs it and loads it into the R session for use.
install.packages("jsonlite",repos="https://ftp.heanet.ie/mirrors/cran.r-project.org")
library(jsonlite)
}
x_string = "{a=1, b=2, c=[1,2,3], d={e=something}}"
json_x_string = "{\"a\":1, \"b\":2, \"c\":[1,2,3], \"d\":{\"e\":\"something\"}}"
fromJSON(json_x_string)
s <- gsub( "([A-Za-z]+)", "\"\\1\"", gsub( "([A-Za-z]*)=", "\\1:", x_string ) )
fromJSON( s )
The first section checks if the package is installed. If it is it loads it, otherwise it installs it and then loads it. I usually include this in any R code I'm writing to make it simpler to transfer between pcs/people.
Your string is x_string, we want it to look like json_x_string which gives the desired output when we call fromJSON().
The regex is split into two parts because it's been a while - I'm pretty sure this could be made more elegant. Then again, this depends on if your data is consistent so I'll leave it like this for now. First it changes "=" to ":", then it adds quotation marks around all groups of letters. Calling fromJSON(s) gives the output:
fromJSON(s)
$a
[1] 1
$b
[1] 2
$c
[1] 1 2 3
$d
$d$e
[1] "something"
I would rather avoid using JSON's parsing for the lack of extendibility and flexibility, and stick to a solution of regex + recursion.
And here is an extendable base code that parses your input string as desired
The main recursion function:
# Parse string
parse.string = function(.string){
regex = "^((.*)=)??\\{(.*)\\}"
# Recursion termination: element parsing
if(iselement(.string)){
return(parse.element(.string))
}
# Extract components
elements.str = gsub(regex, "\\3", .string)
elements.vector = get.subelements(elements.str)
# Recursively parse each element
parsed.elements = list(sapply(elements.vector, parse.string, USE.NAMES = F))
# Extract list's name and return
name = gsub(regex, "\\2", .string)
names(parsed.elements) = name
return(parsed.elements)
}
.
Helping functions:
library(stringr)
# Test if the string is a base element
iselement = function(.string){
grepl("^[^[:punct:]]+=[^\\{\\}]+$", .string)
}
# Parse element
parse.element = function(element.string){
splits = strsplit(element.string, "=")[[1]]
element = splits[2]
# Parse numeric elements
if(!is.na(as.numeric(element))){
element = as.numeric(element)
}
# TODO: Extend here to include vectors
# Reformat and return
element = list(element)
names(element) = splits[1]
return(element)
}
# Get subelements from a string
get.subelements = function(.string){
# Regex of allowed elements - Extend here to include more types
elements.regex = c("[^, ]+?=\\{.+?\\}", #Sublist
"[^, ]+?=\\[.+?\\]", #Vector
"[^, ]+?=[^=,]+") #Base element
str_extract_all(.string, pattern = paste(elements.regex, collapse = "|"))[[1]]
}
.
Parsing results:
string = "{a=1, b=2, c=[1,2,3], d={e=something}}"
string_2 = "{a=1, b=2, c=[1,2,3], d=somthing}"
named_string = "xyz={a=1, b=2, c=[1,2,3], d={e=something, f=22}}"
named_string_2 = "xyz={d={e=something, f=22}}"
parse.string(string)
# [[1]]
# [[1]]$a
# [1] 1
#
# [[1]]$b
# [1] 2
#
# [[1]]$c
# [1] "[1,2,3]"
#
# [[1]]$d
# [[1]]$d$e
# [1] "something"

How to parse all the text content from the HTML using Beautiful Soup

I wanted to extract an email message content. It is in html content, used the BeautifulSoup to fetch the From, To and subject. On fetching the body content, it fetches the first line alone. It leaves the remaining lines and paragraph.
I miss something over here, how to read all the lines/paragraphs.
CODE:
email_message = mail.getEmail(unreadId)
print (email_message['From'])
print (email_message['Subject'])
if email_message.is_multipart():
for payload in email_message.get_payload():
bodytext = email_message.get_payload()[0].get_payload()
if type(bodytext) is list:
bodytext = ','.join(str(v) for v in bodytext)
else:
bodytext = email_message.get_payload()[0].get_payload()
if type(bodytext) is list:
bodytext = ','.join(str(v) for v in bodytext)
print (bodytext)
parsedContent = BeautifulSoup(bodytext)
body = parsedContent.findAll('p').getText()
print body
Console:
body = parsedContent.findAll('p').getText()
AttributeError: 'list' object has no attribute 'getText'
When I use
body = parsedContent.find('p').getText()
It fetches the first line of the content and it is not printing the remaining lines.
Added
After getting all the lines from the html tag, I get = symbol at the end of each line and also &nbsp ; , &lt is displayed.How to overcome those.
Extracted text:
Dear first,All of us at GenWatt are glad to have xyz as a
customer. I would like to introduce myself as your Account
Manager. Should you = have any questions, please feel free to
call me at or email me at ash= wis#xyz.com. You
can also contact GenWatt on the following numbers: Main:
810-543-1100Sales: 810-545-1222Customer Service & Support:
810-542-1233Fax: 810-545-1001I am confident GenWatt will serve you
well and hope to see our relationship=
Let's inspect the result of soup.findAll('p')
python -i test.py
----------
import requests
from bs4 import BeautifulSoup
bodytext = requests.get("https://en.wikipedia.org/wiki/Earth").text
parsedContent = BeautifulSoup(bodytext, 'html.parser')
paragraphs = soup.findAll('p')
----------
>> type(paragraphs)
<class 'bs4.element.ResultSet'>
>> issubclass(type(paragraphs), list)
True # It's a list
Can you see? It's a list of all paragraphs. If you want to access their content you will need iterate over the list or access an element by an index, like a normal list.
>> # You can print all content with a for-loop
>> for p in paragraphs:
>> print p.getText()
Earth (otherwise known as the world (...)
According to radiometric dating and other sources of evidence (...)
...
>> # Or you can join all content
>> content = []
>> for p in paragraphs:
>> content.append(p.getText())
>>
>> all_content = "\n".join(content)
>>
>> print(all_content)
Earth (otherwise known as the world (...) According to radiometric dating and other sources of evidence (...)
Using List Comprehension your code will looks like:
parsedContent = BeautifulSoup(bodytext)
body = '\n'.join([p.getText() for p in parsedContent.findAll('p')]
When I use
body = parsedContent.find('p').getText()
It fetches the first line of the content and it is not printing the
remaining lines.
Do parsedContent.find('p') is exactly the same that do parsedContent.findAll('p')[0]
>> parsedContent.findAll('p')[0].getText() == parsedContent.find('p').getText()
True