Can't seem to extract text from element using BS4 - html

I am trying to extract the name on this web page: https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29
the element i am trying to grab it from is
<h1 class="hover_item_name" id="largeiteminfo_item_name" style="color:
rgb(210, 210, 210);">AK-47 | Redline</h1>
I am able to search for the ID "largeiteminfo_item_name" using selenium and retrieve the text that way but when i duplicate this with bs4 I can't seem to find the text.
Ive tried searching class "item_desc_description" but no text could be found there either. What am I doing wrong?
a = soup.find("h1", {"id": "largeiteminfo_item_name"})
a.get_text()
a = soup.find('div', {'class': 'item_desc_description'})
a.get_text()
I expected "AK-47 | Redline" but received '' for the first try and '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n' for the second try.

The data you are trying to extract is not present in the HTML page, I guess it might be generated aside with JavaScript (just guessing).
However I managed to find the info in the div "market_listing_nav".
from bs4 import BeautifulSoup as bs4
import requests
lnk = "https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29"
res = requests.get(lnk)
soup = bs4(res.text, features="html.parser")
elem = soup.find("div", {"class" : "market_listing_nav"})
print(elem.get_text())
This will output the following
Counter-Strike: Global Offensive
>
AK-47 | Redline (Field-Tested)
Have a look at the web page source for tag with better formatting or just clean up the on generated by my code.

Related

How to scrape only texts from specific HTML elements?

I have a problem with selecting the appropriate items from the list.
For example - I want to omit "1." then the first "5" (as in the example)
Additionally, I would like to write a condition that the letter "W" should be changed to "WIN".
import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep
driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
sleep(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})
res = []
for i in content:
line = i.text.split()[0]
if re.search('Ajax', line):
res.append(line)
print(res)
results
['1.Ajax550016:315?WWWWW']
I need
Ajax;5;5;0;16;3;W;W;W;W;W
I would recommend to select your elements more specific:
for e in soup.select('.ui-table__row'):
Iterate the ResultSet and decompose() unwanted tag:
e.select_one('.wld--tbd').decompose()
Extract texts with stripped_strings and join() them to your expected string:
data.append(';'.join(e.stripped_strings))
Example
Also making some replacements, based on dict just to demonstrate how this would work, not knowing R or P.
...
soup = BS2(page,'html.parser')
data = []
for e in soup.select('.ui-table__row'):
e.select_one('.wld--tbd').decompose()
e.select_one('.tableCellRank').decompose()
e.select_one('.table__cell--points').decompose()
e.select_one('.table__cell--score').string = ';'.join(e.select_one('.table__cell--score').text.split(':'))
pattern = {'W':'WIN','R':'RRR','P':'PPP'}
data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data
To only get result for Ajax:
data = []
for e in soup.select('.ui-table__row:-soup-contains("Ajax")'):
e.select_one('.wld--tbd').decompose()
e.select_one('.tableCellRank').decompose()
e.select_one('.table__cell--points').decompose()
e.select_one('.table__cell--score').string = ';'.join(e.select_one('.table__cell--score').text.split(':'))
pattern = {'W':'WIN','R':'RRR','P':'PPP'}
data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data
Output
Based on actually data it may differ from questions example.
['Ajax;6;6;0;0;21;3;WIN;WIN;WIN;WIN;WIN']
you had the right start by using bs4 to find the table div, but then you gave up and just tried to use re to extract from the text. as you can see that's not going to work. Here is a simple way to hack and get what you want. I keep grabinn divs from the table div you find, and the grab the text of the next eight divs after finding Ajax. then I do some dirty string manipulation thing because the WWWWW is all in the same toplevel div.
import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
#driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
driver.implicitly_wait(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})
res = []
found = 0
for i in content.find('div'):
line = i.text.split()[0]
if re.search('Ajax', line):
found = 8
if found:
found -= 1
res.append(line)
# change field 5 into separate values and skip field 6
res = res[:4] +res[5].split(':') + res[7:]
# break the last field into separate values and drop the first '?'
res = res[:-1] + [ i for i in res[-1]][1:]
print(";".join(res))
returns
Ajax;5;5;0;16;3;W;W;W;W;W
This works, but it is very brittle, and will break as soon as the website changes their content. you should put in a lot of error checking. I also replaced the sleep with a wait call, and added chromedrivermamager, which allows me to use selenium with chrome.

beautiful soup unable to find elements from website

It's my first time working with web scraping so cut me some slack. I'm trying to pull the "card_tag" from a website. I triple checked that the card tag is inside their respected tags as seen in the code.
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.anime-planet.com/users/mistersenpai/anime/dropped")
src = result.content
soup = BeautifulSoup(src, features="html.parser")
urls = []
for div_tag in soup.find_all('div id="siteContainer"'):
ul_tag = div_tag.find("ul class")
li_tag = ul_tag.find("li")
card_tag = li_tag.find("h3")
urls.append(card_tag)
print(urls)
When I go to print the url list it outputs nothing. You can see the thing I'm looking for by visiting the link as seen in the code and inspecting element on "Blood-C". As you can see it's listed in the tag I'm trying to find, yet my code can't seem to find it.
Any help would be much appreciated.
just minor syntax you need to change with the tags and attributes.
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.anime-planet.com/users/mistersenpai/anime/dropped")
src = result.content
soup = BeautifulSoup(src, features="html.parser")
urls = []
containers = soup.find_all('div', {'id':'siteContainer'})
for div_tag in containers:
ul_tag = div_tag.find("ul", {'data-type':'anime'})
li_tag = ul_tag.find_all("li")
for each in li_tag:
card_tag = each.find("h3")
urls.append(card_tag)
print(card_tag)
Also, you could just skip all that and go straight to those <h3> tags with the class attribute cardName:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.anime-planet.com/users/mistersenpai/anime/dropped")
src = result.content
soup = BeautifulSoup(src, features="html.parser")
urls = []
for card_tag in soup.find_all('h3', {'class':'cardName'}):
print(card_tag)
urls.append(card_tag)
Output:
<h3 class="cardName">Black Butler</h3>
<h3 class="cardName">Blood-C</h3>
<h3 class="cardName">Place to Place</h3>

Scrape table with no ids or classes using only standard libraries?

I want to scrape two pieces of data from a website:
https://www.moneymetals.com/precious-metals-charts/gold-price
Specifically I want the "Gold Price per Ounce" and the "Spot Change" percent two columns to the right of it.
Using only Python standard libraries, is this possible? A lot of tutorials use the HTML element id to scrape effectively but inspecting the source for this page, it's just a table. Specifically I want the second and fourth <td> which appear on the page.
It's possible to do it with standard python libraries; ugly, but possible:
import urllib
from html.parser import HTMLParser
URL = 'https://www.moneymetals.com/precious-metals-charts/gold-price'
page = urllib.request.Request(URL)
result = urllib.request.urlopen(page)
resulttext = result.read()
class MyHTMLParser(HTMLParser):
gold = []
def handle_data(self, data):
self.gold.append(data)
parser = MyHTMLParser()
parser.feed(str(resulttext))
for i in parser.gold:
if 'Gold Price per Ounce' in i:
target= parser.gold.index(i) #get the index location of the heading
print(parser.gold[target+2]) #your target items are 2, 5 and 9 positions down in the list
print(parser.gold[target+5].replace('\\n',''))
print(parser.gold[target+9].replace('\\n',''))
Output (as of the time the url was loaded):
$1,566.70
8.65
0.55%

Can't seem to scrape the website "Forbes" properly

I'm trying to scrape the links and titles of the articles on the frontpage of the website https://www.forbes.com/ .
I'm not proficient in html, but I'm been following some beautfiul soup tutorials and have been getting by with the knowledge I'm picking up along the way.
Here is what I have so far:
source = urllib.request.urlopen('https://www.forbes.com').read()
soup = bs.BeautifulSoup(source,'lxml') # Tried 'html.parser' as well
##print(soup.findAll('div',{'class':"c-entry-box--compact c-entry-box--compact--article"}))
for url in soup.findAll('a',{'class':"exit_trigger_set"}):
print (url.get('href'))
Inspecting the site's html, I seem to have the class and 'a' (not sure what you call 'a' in this case) correct.
However, instead of getting all the links of the articles on the frontpage, I'm only getting one.
https://www.amazon.com/Intelligent-REIT-Investor-Wealth-Investment/dp/1119252717
Not sure what I'm doing wrong.
Thank you.
EDIT:
This seems to find some of the top stories but I don't know how to pull out the links only
for i in soup.findAll('h4', {'class': "editable editable-hed"}):
print (i)
Here's how I would do it:
import urllib2
from bs4 import BeautifulSoup
import pandas as pd
source = urllib2.urlopen('https://www.forbes.com')
soup = BeautifulSoup(source,'lxml')
lst = []
for i in soup.findAll('h4', {'class': "editable editable-hed"}):
title = i.text
link = i.find('a')['href'][2:]
title = title.replace('\t','')
title = title.replace('\n','')
title = title.strip()
lst.append({'title':title, 'link':link})
df = pd.DataFrame.from_dict(lst)
And you get 15 articles and their links.

Using BeautifulSoup for html scraping

So i'm trying to make a program that tells the user how far away voyager 1 is from the Earth, NASA has this info on their website here http://voyager.jpl.nasa.gov/where/index.html...
I can't seem to manage to get the information within the div, here's the div: <div id="voy1_km">Distance goes here</div>
my current program is as follows : `
import requests
from BeautifulSoup import BeautifulSoup
url = "http://voyager.jpl.nasa.gov/where/index.html"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
test = soup.find('div', {'id' : 'voy1_km'})
print test
So long story short, How do I get the div contents?
as you can see from the webpage itself, the distance keep changing which is actually driven by a Javascript. You can maybe just read the javascrip code so you don't even need to scrape to get the distance... (I hate websites using Javascript as much as you:) )
If you really want to get the number off their website. You can use Selenium.
# pip install selenium
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("http://voyager.jpl.nasa.gov/where/index.html")
time.sleep(5)
elem = driver.find_element_by_class_name("tr_dark")
print elem.text
driver.close()
Here is the output:
Distance from Earth
19,964,147,071 KM
133.45208042 AU
Of course, please refer to the terms&conditions of their website regarding to what level you can scrape their website and distribute the data.
The bigger question is why even bother scraping it. If you dive a bit deeper into the Javascript file, you can repeat its calculation in a very simple manner:
import time
epoch_0 = 1445270400
epoch_1 = 1445356800
dist_0_v1 = 19963672758.0152
dist_1_v1 = 19966727483.2612
current_time = time.time()
current_dist_km_v1 = ( ( ( current_time - epoch_0 ) / ( epoch_1 - epoch_0 ) ) * ( dist_1_v1 - dist_0_v1 ) ) + dist_0_v1
print("{:,.0f} KM".format(current_dist_km_v1))