Extract tex from href (BeautifulSoup) - html

I want to extract the TEXT from this HTML element:
mail#1st-architects.com
all_profiles.find("a", {"???":"???"}).get_text(strip=True)
Consider that I have a list of 1000 companies and each company has a href="mailto:mail#1st-architects.com" different.

You could combine attribute = value css selector using starts with ^ and ends with $ operators to match on hrefs with specified substrings
emails = [i.text for i in all_profiles.select("[href^=mailto][href$='#1st-architects.com']")]

You could try something like this.
This code will print the text of all <a> with href as an email.
import re
from bs4 import BeautifulSoup
s = '''
mail#1st-architects.com
second_mail#2nd-architects.com
Some Link
mail#example.com
'''
soup = BeautifulSoup(s, 'lxml')
a = soup.find_all('a', attrs= {'href': re.compile(r'^mailto:')})
for i in a:
print(i.text.strip())
mail#1st-architects.com
second_mail#2nd-architects.com
mail#example.com

Related

How to write a css/xpath for dynamically changing element?

I am using beautiful soup and below is my selector to scrape href.
html = ''' <a data-testid="Link" class="sc-pciXn eUevWj JobTile___StyledJobLink-sc-
1nulpkp-0 gkKKqP JobTile___StyledJobLink-sc-1nulpkp-0 gkKKqP"
href="https://join.com/companies/talpasolutions/4978529-project-customer-
success-manager-heavy-industries-d-f-m">'''
soup = beautifulsoup(HTML , "lxml")
jobs = soup.find_all( "a" ,class_= "sc-pciXn eUevWj JobTile___StyledJobLink-sc-1nulpkp-0
gkKKqP JobTile___StyledJobLink-sc-1nulpkp-0 gkKKqP")
for job in jobs:
job_url = job.get("href")
I am using find_all because there is a total of 3 elements with hrefs.
Above method is working but the website keeps changing the classes on a daily basis. I need a different way to design CSS/XPath
Try:
import requests
from bs4 import BeautifulSoup
url = "https://join.com/companies/talpasolutions"
soup = BeautifulSoup(requests.get(url).content, "lxml")
for a in soup.select("a:has(h3)"):
print(a.get("href"))
Prints:
https://join.com/companies/talpasolutions/4978529-project-customer-success-manager-heavy-industries-d-f-m
https://join.com/companies/talpasolutions/4925936-senior-data-engineer-d-f-m
https://join.com/companies/talpasolutions/4926107-senior-data-scientist-d-f-m

beautiful soup unable to find elements from website

It's my first time working with web scraping so cut me some slack. I'm trying to pull the "card_tag" from a website. I triple checked that the card tag is inside their respected tags as seen in the code.
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.anime-planet.com/users/mistersenpai/anime/dropped")
src = result.content
soup = BeautifulSoup(src, features="html.parser")
urls = []
for div_tag in soup.find_all('div id="siteContainer"'):
ul_tag = div_tag.find("ul class")
li_tag = ul_tag.find("li")
card_tag = li_tag.find("h3")
urls.append(card_tag)
print(urls)
When I go to print the url list it outputs nothing. You can see the thing I'm looking for by visiting the link as seen in the code and inspecting element on "Blood-C". As you can see it's listed in the tag I'm trying to find, yet my code can't seem to find it.
Any help would be much appreciated.
just minor syntax you need to change with the tags and attributes.
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.anime-planet.com/users/mistersenpai/anime/dropped")
src = result.content
soup = BeautifulSoup(src, features="html.parser")
urls = []
containers = soup.find_all('div', {'id':'siteContainer'})
for div_tag in containers:
ul_tag = div_tag.find("ul", {'data-type':'anime'})
li_tag = ul_tag.find_all("li")
for each in li_tag:
card_tag = each.find("h3")
urls.append(card_tag)
print(card_tag)
Also, you could just skip all that and go straight to those <h3> tags with the class attribute cardName:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.anime-planet.com/users/mistersenpai/anime/dropped")
src = result.content
soup = BeautifulSoup(src, features="html.parser")
urls = []
for card_tag in soup.find_all('h3', {'class':'cardName'}):
print(card_tag)
urls.append(card_tag)
Output:
<h3 class="cardName">Black Butler</h3>
<h3 class="cardName">Blood-C</h3>
<h3 class="cardName">Place to Place</h3>

Can't seem to extract text from element using BS4

I am trying to extract the name on this web page: https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29
the element i am trying to grab it from is
<h1 class="hover_item_name" id="largeiteminfo_item_name" style="color:
rgb(210, 210, 210);">AK-47 | Redline</h1>
I am able to search for the ID "largeiteminfo_item_name" using selenium and retrieve the text that way but when i duplicate this with bs4 I can't seem to find the text.
Ive tried searching class "item_desc_description" but no text could be found there either. What am I doing wrong?
a = soup.find("h1", {"id": "largeiteminfo_item_name"})
a.get_text()
a = soup.find('div', {'class': 'item_desc_description'})
a.get_text()
I expected "AK-47 | Redline" but received '' for the first try and '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n' for the second try.
The data you are trying to extract is not present in the HTML page, I guess it might be generated aside with JavaScript (just guessing).
However I managed to find the info in the div "market_listing_nav".
from bs4 import BeautifulSoup as bs4
import requests
lnk = "https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29"
res = requests.get(lnk)
soup = bs4(res.text, features="html.parser")
elem = soup.find("div", {"class" : "market_listing_nav"})
print(elem.get_text())
This will output the following
Counter-Strike: Global Offensive
>
AK-47 | Redline (Field-Tested)
Have a look at the web page source for tag with better formatting or just clean up the on generated by my code.

Python3.5 BeautifulSoup4 get text from 'p' in div

I am trying to pull all the text from the div class 'caselawcontent searchable-content'. This code just prints the HTML without the text from the web page. What am I missing to get the text?
The following link is in the 'finteredcasesdoc.text' file:
http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html
import requests
from bs4 import BeautifulSoup
with open('filteredcasesdoc.txt', 'r') as openfile1:
for line in openfile1:
rulingpage = requests.get(line).text
soup = BeautifulSoup(rulingpage, 'html.parser')
doctext = soup.find('div', class_='caselawcontent searchable-content')
print (doctext)
from bs4 import BeautifulSoup
import requests
url = 'http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
I've added a much more reliable .find method ( key : value)
whole_section = soup.find('div',{'class':'caselawcontent searchable-content'})
the_title = whole_section.center.h2
#e.g. Missouri Court of Appeals,Southern District,Division Two.
second_title = whole_section.center.h3.p
#e.g. STATE of Missouri, Plaintiff-Appellant v....
number_text = whole_section.center.h3.next_sibling.next_sibling
#e.g.
the_date = number_text.next_sibling.next_sibling
#authors
authors = whole_section.center.next_sibling
para = whole_section.findAll('p')[1:]
#Because we don't want the paragraph h3.p.
# we could aslso do findAll('p',recursive=False) doesnt pickup children
Basically, I've dissected this whole tree
as for the Paragraphs (e.g. Main text, the var para), you'll have to loop
print(authors)
# and you can add .text (e.g. print(authors.text) to get the text without the tag.
# or a simple function that returns only the text
def rettext(something):
return something.text
#Usage: print(rettext(authorts))
Try printing doctext.text. This will get rid of all the HTML tags for you.
from bs4 import BeautifulSoup
cases = []
with open('filteredcasesdoc.txt', 'r') as openfile1:
for url in openfile1:
# GET the HTML page as a string, with HTML tags
rulingpage = requests.get(url).text
soup = BeautifulSoup(rulingpage, 'html.parser')
# find the part of the HTML page we want, as an HTML element
doctext = soup.find('div', class_='caselawcontent searchable-content')
print(doctext.text) # now we have the inner HTML as a string
cases.append(doctext.text) # do something useful with this !

Add line break to HTML after text addition

I am adding text to an existing string in HTML.
added = soup.find(text=re.compile('Summary|Experience'))
added.insert(0, NavigableString(code))
I would like to also add a line break after the text inserted so each string is on a different line.
I tried:
added.insert(0, NavigableString(code)+'<br/>')
And some other variations too...
Thanks,
You need to use .new_tag method to create your <br> tag
Demo
In [22]: from bs4 import BeautifulSoup
In [23]: soup = BeautifulSoup("""<p>Experience</p><strong>Summary</strong>""")
In [24]: newtg = soup.new_tag('br')
In [25]: soup.insert(0, newtg)
In [26]: soup
Out[26]: <br/><html><body><p>Experience</p><strong>Summary</strong></body></html>