Add line break to HTML after text addition - html

I am adding text to an existing string in HTML.
added = soup.find(text=re.compile('Summary|Experience'))
added.insert(0, NavigableString(code))
I would like to also add a line break after the text inserted so each string is on a different line.
I tried:
added.insert(0, NavigableString(code)+'<br/>')
And some other variations too...
Thanks,

You need to use .new_tag method to create your <br> tag
Demo
In [22]: from bs4 import BeautifulSoup
In [23]: soup = BeautifulSoup("""<p>Experience</p><strong>Summary</strong>""")
In [24]: newtg = soup.new_tag('br')
In [25]: soup.insert(0, newtg)
In [26]: soup
Out[26]: <br/><html><body><p>Experience</p><strong>Summary</strong></body></html>

Related

Extract tex from href (BeautifulSoup)

I want to extract the TEXT from this HTML element:
mail#1st-architects.com
all_profiles.find("a", {"???":"???"}).get_text(strip=True)
Consider that I have a list of 1000 companies and each company has a href="mailto:mail#1st-architects.com" different.
You could combine attribute = value css selector using starts with ^ and ends with $ operators to match on hrefs with specified substrings
emails = [i.text for i in all_profiles.select("[href^=mailto][href$='#1st-architects.com']")]
You could try something like this.
This code will print the text of all <a> with href as an email.
import re
from bs4 import BeautifulSoup
s = '''
mail#1st-architects.com
second_mail#2nd-architects.com
Some Link
mail#example.com
'''
soup = BeautifulSoup(s, 'lxml')
a = soup.find_all('a', attrs= {'href': re.compile(r'^mailto:')})
for i in a:
print(i.text.strip())
mail#1st-architects.com
second_mail#2nd-architects.com
mail#example.com

Extract text across HTML tags as a single string

I'm trying to extract the text content from the HTML code below as a complete sentence however I am not able to. I tried using both Beautifulsoup.prettify() and Beautifulsoup.get_text() but those gave me 3 sentences. I would like to read the HTML below as a single proper sentence like
Recognized by Microsoft & Google, Inc., offices.
<li>Recognized by
<em>Microsoft</em> &
<em>Google, Inc.</em>, offices.</li>
I really don't understand what you need , but It'll help you to extract Content from the url of the website
import requests
import xlsxwriter
from bs4 import BeautifulSoup
#Text File where the content will be written
file = open("test.txt","w")
#Url from where the data will be extracted
urls ="https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python"
page = requests.get(urls)
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.find_all('p'): #extracting all content of <P> tag from the url
#You can put the desired tag according to your need
file.write(link.get_text())
file.close()
You can use an HTML parser like BeautifulSoup to extract the text without tags (soup.text), then strip the text of duplicate whitespaces/newlines etc:
input_str = '''
<li>Recognized by
<em>Microsoft</em> &
<em>Google, Inc.</em>, offices.</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(input_str,"html.parser")
text = " ".join(soup.text.split())
print(text)
Output:
Recognized by Microsoft & Google, Inc., offices.
Edit: based on your comments, in order to get a list of strings as an output (one for each li tag, you can do:
input_str = '''<ul> <li>This is sentence one in a order</li> <li>This is sentence two in a order</li> <li>This is sentence <em>Three</em> in a order </li> <li>This is sentence <em>four</em> in a order </li> </ul>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(input_str,"html.parser")
result = []
for li in soup.find_all('li'):
text = " ".join(li.text.split())
result.append(text)
print(result)
Output:
['This is sentence one in a order', 'This is sentence two in a order', 'This is sentence Three in a order', 'This is sentence four in a order']

Can't seem to extract text from element using BS4

I am trying to extract the name on this web page: https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29
the element i am trying to grab it from is
<h1 class="hover_item_name" id="largeiteminfo_item_name" style="color:
rgb(210, 210, 210);">AK-47 | Redline</h1>
I am able to search for the ID "largeiteminfo_item_name" using selenium and retrieve the text that way but when i duplicate this with bs4 I can't seem to find the text.
Ive tried searching class "item_desc_description" but no text could be found there either. What am I doing wrong?
a = soup.find("h1", {"id": "largeiteminfo_item_name"})
a.get_text()
a = soup.find('div', {'class': 'item_desc_description'})
a.get_text()
I expected "AK-47 | Redline" but received '' for the first try and '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n' for the second try.
The data you are trying to extract is not present in the HTML page, I guess it might be generated aside with JavaScript (just guessing).
However I managed to find the info in the div "market_listing_nav".
from bs4 import BeautifulSoup as bs4
import requests
lnk = "https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29"
res = requests.get(lnk)
soup = bs4(res.text, features="html.parser")
elem = soup.find("div", {"class" : "market_listing_nav"})
print(elem.get_text())
This will output the following
Counter-Strike: Global Offensive
>
AK-47 | Redline (Field-Tested)
Have a look at the web page source for tag with better formatting or just clean up the on generated by my code.

Having trouble finding Span tag (Python 3)

I'm trying to strip out the Span tags from a html file.
I am using a page which has a lot of Span tags in it. I need to extract some numbers and add them together. However, I can't even get the lines I need out, so I am hoping someone can offer some advice.
My code is below:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
# url = input('Enter - ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
# print(soup)
spans = soup.findAll('span')
for span in spans:
print span
Thanks

Python3.5 BeautifulSoup4 get text from 'p' in div

I am trying to pull all the text from the div class 'caselawcontent searchable-content'. This code just prints the HTML without the text from the web page. What am I missing to get the text?
The following link is in the 'finteredcasesdoc.text' file:
http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html
import requests
from bs4 import BeautifulSoup
with open('filteredcasesdoc.txt', 'r') as openfile1:
for line in openfile1:
rulingpage = requests.get(line).text
soup = BeautifulSoup(rulingpage, 'html.parser')
doctext = soup.find('div', class_='caselawcontent searchable-content')
print (doctext)
from bs4 import BeautifulSoup
import requests
url = 'http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
I've added a much more reliable .find method ( key : value)
whole_section = soup.find('div',{'class':'caselawcontent searchable-content'})
the_title = whole_section.center.h2
#e.g. Missouri Court of Appeals,Southern District,Division Two.
second_title = whole_section.center.h3.p
#e.g. STATE of Missouri, Plaintiff-Appellant v....
number_text = whole_section.center.h3.next_sibling.next_sibling
#e.g.
the_date = number_text.next_sibling.next_sibling
#authors
authors = whole_section.center.next_sibling
para = whole_section.findAll('p')[1:]
#Because we don't want the paragraph h3.p.
# we could aslso do findAll('p',recursive=False) doesnt pickup children
Basically, I've dissected this whole tree
as for the Paragraphs (e.g. Main text, the var para), you'll have to loop
print(authors)
# and you can add .text (e.g. print(authors.text) to get the text without the tag.
# or a simple function that returns only the text
def rettext(something):
return something.text
#Usage: print(rettext(authorts))
Try printing doctext.text. This will get rid of all the HTML tags for you.
from bs4 import BeautifulSoup
cases = []
with open('filteredcasesdoc.txt', 'r') as openfile1:
for url in openfile1:
# GET the HTML page as a string, with HTML tags
rulingpage = requests.get(url).text
soup = BeautifulSoup(rulingpage, 'html.parser')
# find the part of the HTML page we want, as an HTML element
doctext = soup.find('div', class_='caselawcontent searchable-content')
print(doctext.text) # now we have the inner HTML as a string
cases.append(doctext.text) # do something useful with this !