What is the most efficient way to get divs with BeautifulSoup4 if they have multiple classes?
I have an html structure like this:
<div class='class1 class2 class3 class4'>
<div class='class5 class6 class7'>
<div class='comment class14 class15'>
<div class='date class20 showdate'> 1/10/2017</div>
<p>comment2</p>
</div>
<div class='comment class25 class9'>
<div class='date class20 showdate'> 7/10/2017</div>
<p>comment1</p>
</div>
</div>
</div>
I want to get div with comment. Usually there is no problem with nested classes, but I don't know why the command:
html = BeautifulSoup(content, "html.parser")
comments = html.find_all("div", {"class":"comment"})
doesn't work. It gives empty array.
And I guess this happens because there are a lot of classes, so he looks for div with only comment class and it doesn't exist. How can I find all the comments?
Apparently, the URL that fetches the comments section is different from the original URL that retrieves the main contents.
This is the original URL you gave:
http://community.sparknotes.com/2017/10/06/find-out-your-colleges-secret-mantra-we-hack-college-life-at-the-100-of-the-best
Behind the scenes, if you record the network log in the network tab of Chrome's developer menu, you'll see a list of all URLs that are sent by the browser. Most of them are for fetching images and scripts. Few relate to other sites such as Facebook or Google (for analytics, etc.). The browser sends another request to this particular site (sparknotes), which gives you the comments section. This is the URL:
http://community.sparknotes.com/commentlist?post_id=1375724&page=1&comment_type=&_=1507467541548
The value for post_id can be found in the web page returned when we request the first URL. It is contained in an input tag which has a hidden attribute.
<input type="hidden" id="postid" name="postid" value="1375724">
You can extract this info from the first web page using a simple soup.find('input', {'id': 'postid'})['value']. Of course, since this identifies the post uniquely, you need not worry about its changing dynamically on each request.
I couldn't find the '1507467541548' value passed to '_' parameter (last parameter of the URL) anywhere in the main page or anywhere in the cookies set by response headers of any of the pages.
However, I went out on a limb and tried to fetch the URL by passing it without the '_' parameter, and it worked.
So, here's the entire script that worked for me:
from bs4 import BeautifulSoup
import requests
req_headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'community.sparknotes.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
with requests.Session() as s:
url = 'http://community.sparknotes.com/2017/10/06/find-out-your-colleges-secret-mantra-we-hack-college-life-at-the-100-of-the-best'
r = s.get(url, headers=req_headers)
soup = BeautifulSoup(r.content, 'lxml')
post_id = soup.find('input', {'id': 'postid'})['value']
# url = 'http://community.sparknotes.com/commentlist?post_id=1375724&page=1&comment_type=&_=1507467541548' # the original URL found in network tab
url = 'http://community.sparknotes.com/commentlist?post_id={}&page=1&comment_type='.format(post_id) # modified by removing the '_' parameter
r = s.get(url)
soup = BeautifulSoup(r.content, 'lxml')
comments = soup.findAll('div', {'class': 'commentCite'})
for comment in comments:
c_name = comment.div.a.text.strip()
c_date_text = comment.find('div', {'class': 'commentBodyInner'}).text.strip()
print(c_name, c_date_text)
As you can see, I haven't used headers for the second requests.get. So I'm not sure if it's required at all. You can experiment omitting them in the first request as well. But make sure you use requests, as I haven't tried using urllib. Cookies might play a vital role here.
Related
I am trying to extract the estimated monthly cost of "$1,773" from this url:
https://www.zillow.com/homedetails/4651-Genoa-St-Denver-CO-80249/13274183_zpid/
Upon inspecting that part of the page, I see this data:
<div class="sc-qWfCM cdZDcW">
<span class="Text-c11n-8-48-0__sc-aiai24-0 dQezUG">Estimated monthly cost</span>
<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$1,773</span></div>
To extract $1,773, I have tried this:
from bs4 import BeautifulSoup
import requests
url = 'https://www.zillow.com/homedetails/4651-Genoa-St-Denver-CO-80249/13274183_zpid/'
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html")
print(soup.findAll('span', {'class': 'Text-c11n-8-48-0__sc-aiai24-0 jLucLe'}))
This returns a list of three elements, with no mention of $1,773.
[<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$463,300</span>,
<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$1,438</span>,
<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$2,300<!-- -->/mo</span>]
Can someone please explain how to return $1,773?
I think you have to find the first parent element.
for example:
parent_div = soup.find('div', {'class': 'sc-fzqBZW bzsmsC'})
result = parent_div.findAll('span', {'class': 'Text-c11n-8-48-0__sc-aiai24-0 jLucLe'})
While parsing a web page we need to separate components of the page in the way they are rendered. There are components that are statically or dynamically rendered. The dynamic content also takes some time to load, as the page calls for backend API of some sort.
Read more here
I tried parsing your page using Selenium ChromeDriver
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://www.zillow.com/homedetails/4651-Genoa-St-Denver-CO-80249/13274183_zpid/")
time.sleep(3)
time.sleep(3)
el = driver.find_elements_by_xpath("//span[#class='Text-c11n-8-48-0__sc-aiai24-0 jLucLe']")
for e in el:
print(e.text)
time.sleep(3)
driver.quit()
#OUTPUT
$463,300
$1,773
$2,300/mo
My code below:
import requests
from bs4 import BeautifulSoup
def investopedia():
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'}
ticker = 'TSLA'
url = f'https://www.investopedia.com/markets/quote?tvwidgetsymbol={ticker.lower()}'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
ip_price = soup.find_all('div', {'class':'tv-symbol-price-quote__value js-symbol-last'})[0].find('span').text
print(ip_price)
investopedia()
The class I used while inspecting element (in html):
<div class="tv-symbol-price-quote__value js-symbol-last"><span>736.27</span></div>
736.27 in "span" is the number I need
Please help out a web scraping beginnger here. Thanks in advance!
You get index out of range error because your code can't find any HTML elements you are looking for right now.
Information you are looking for is kept within an iframe. In order to retrieve the data you want, we have to switch to that iframe. One way to do it is using Selenium.
from selenium import webdriver
def investopedia():
ticker = 'TSLA'
url = f'https://www.investopedia.com/markets/quote?tvwidgetsymbol={ticker.lower()}'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(5) # it takes time to download the webpage
iframe = driver.find_elements_by_css_selector('.tradingview-widget-container > iframe')[0]
driver.switch_to.frame(iframe)
time.sleep(1)
ip_price = driver.find_elements_by_xpath('.//div[#class="tv-symbol-price-quote__value js-symbol-last"]')[0].get_attribute('innerText').strip()
print(ip_price)
investopedia()
I know, this is the question that have been already asked much. So I tried some solutions, and it worked for my other works.
But this site is different, I think.
I tried this at first.
html = requests.get(url = "http://loawa.com")
soup = BeautifulSoup(html.content.decode('utf-8','replace'), 'html.parser')
print(soup)
It fetches me a head, and slight of body.
<body class="p-0 bg-theme-6" style="overflow-x:hidden"><script>window.location.reload(true);</script></body>
So I used prerender as
html = requests.get(url = "http://service.prerender.io/http://loawa.com")
soup = BeautifulSoup(html.content.decode('utf-8','replace'), 'html.parser')
print(soup)
It gives me the same result.
So I tried it with headers.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36','Content-Type': 'text/html',}
response = requests.get("http://loawa.com",headers=headers)
html = response.text
soup = BeautifulSoup(html.content.decode('utf-8','replace'), 'html.parser')
print(soup)
The html comes out as empty. Not sure I did a right job with headers.
What can I try more with? I don't want to use selenium for this work.
Hope someone can enlighten me. Thanks!
I would love to get some help in logging into the Fidelity website and navigate within it. My attempts so far have not led me to anywhere significant. So here is the code that I have written, after much consultation with answers around the web. The steps are:
Login to Fidelity
Check if response code not 200, but is 302 or 303 and my code passes this test (with a code of 302).
Then I check the number of cookies returned (there were 5) and for each cookie I try to navigate to a different web page within Fidelity (I do this five times, once for each cookie, simply because I do not know which subscript "j" of the variable "cookie" will work).
function loginToFidelity(){
var url = "https://www.fidelity.com";
var payload = {
"username":"*********",
"password":"*********"
};
var opt = {
"payload":payload,"method":"post","followRedirects" : false
};
var response = UrlFetchApp.fetch(encodeURI(url),opt);
if ( response.getResponseCode() == 200 ) {
Logger.log("Couldn't login.");
return
}
else if (response.getResponseCode() == 303 || response.getResponseCode() == 302) {
Logger.log("Logged in successfully. " + response.getResponseCode());
var cookie = response.getAllHeaders()['Set-Cookie']
for (j = 0; j < cookie.length; j++) {
var downloadPage = UrlFetchApp.fetch("https://oltx.fidelity.com/ftgw/fbc/oftop/portfolio#activity",
{"Cookie" : cookie[j],"method" : "post","followRedirects" : false,"payload":payload});
Logger.log(downloadPage.getResponseCode())
Logger.log(downloadPage.getContentText())
}
}
}
For each choice of the subscript "j", I get the same answer for the ResponseCode (always 302) as well as the same answer for ContentText. The answer for ContentText is obviously incorrect as it is not what it is supposed to be. The ContentText is shown below:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved here.</p>
</body></html>
Based on this, I have two questions:
Have I logged into the Fidelity site correctly? If not, why do I get a response code of 302 in the login process? What do I need to do differently to login correctly?
Why am I getting such a strange and obviously incorrect answer for my ContentText while getting a perfectly reasonable ResponseCode of 302? What do I need to do differently, so that I can get the password-controlled page within Fidelity, whose url is "https://oltx.fidelity.com/ftgw/fbc/oftop/portfolio#activity"?
NOTE: Some other tests have been done in addition to the one stated above. Results from these tests are provided in the discussion below.
Here is something which worked for me. You may have found the solution already, not sure. Remember to fill in your loginid where the XXXX is and the pin number for YYYY.
I understand this is python code, not the google script, but you get the idea about the code flow.
import requests, sys, lxml.html
s = requests.Session()
r = s.get('https://login.fidelity.com')
payload = {
'DEVICE_PRINT' : 'version%3D3.5.2_2%26pm_fpua%3Dmozilla%2F5.0+(x11%3B+linux+x86_64%3B+rv%3A41.0)+gecko%2F20100101+firefox%2F41.0%7C5.0+(X11)%7CLinux+x86_64',
'SavedIdInd' : 'N',
'SSN' : 'XXXXX',
'PIN' : 'YYYYY'
}
r = s.post(login_url, data=payload, headers=dict(referer='https://login.fidelity.com'))
response = s.get('https://oltx.fidelity.com/ftgw/fbc/oftop/portfolio')
print response.content
mwahal, you left out the critical form action url (your login_url is undefined)
this works (if added to your python code)
login_url = 'https://login.fidelity.com/ftgw/Fas/Fidelity/RtlCust/Login/Response/dj.chf.ra'
btw here's the result of the print after the post showing successful login
{"status":
{
"result": "success",
"nextStep": "Finish",
"context": "RtlCust"
}
}
or adding some code:
if r.status_code == requests.codes.ok:
status = r.json().get('status')
print(status["result"])
gets you "success"
Unfortunately the answer from #mwahal doesn't work anymore - I've been trying to figure out why, will update if I do. One issue is that the login page now requires a cookie from the cfa.fidelity.com domain, which only gets set when one of the linked JavaScript files is loaded.
One alternative is to use selenium, if you just want to navigate the site, or seleniumrequests if you want to tap into Fidelity's internal APIs.
There is a hitch with seleniumreqeusts for the transactions API... the API requires Content-Type: application/json and seleniumrequests doesn't seem to support custom headers in requests. So I use selenium to log in, call one of the APIs that doesn't need that header, copy then edit the response's request header, and use regular requests to get the transactions:
from seleniumrequests import Chrome
import requests
# Log into Fidelity
driver = Chrome()
driver.get("https://www.fidelity.com")
driver.find_element_by_id("userId-input").send_keys(username)
driver.find_element_by_name("PIN").send_keys(password)
driver.find_element_by_id("fs-login-button").click()
r = driver.request('GET', 'https://digital.fidelity.com/ftgw/digital/rsc/api/profile-data')
headers = r.request.headers
headers['accept'] = "application/json, text/plain, */*"
headers['content-type'] = "application/json"
payload = '{"acctDetails":[{"acctNum":"<AcctId>"}],"searchCriteriaDetail":{"txnFromDate":1583639342,"txnToDate":1591411742}}'
api = "https://digital.fidelity.com/ftgw/digital/dc-history/api"
r = requests.post(api, headers=headers, data=payload)
transactions = r.json()
In this question How can I get a url from Chrome by Python?, it was brought up that you could grab the url from python in pywinauto 0.6. How is it done?
Using inspect.exe (which is mentioned in Getting Started) you can find Chrome's address bar element, and that its parameter "value" contains the current url.
I found two ways to get this url:
from __future__ import print_function
from pywinauto import Desktop
chrome_window = Desktop(backend="uia").window(class_name_re='Chrome')
address_bar_wrapper = chrome_window['Google Chrome'].main.Edit.wrapper_object()
Here's the first way:
url_1 = address_bar_wrapper.legacy_properties()['Value']
Here's the second:
url_2 = address_bar_wrapper.iface_value.CurrentValue
print(url_1)
print(url_2)
Also if protocol is "http" Chrome removes "http://" prefix. U can add sth like:
def format_url(url):
if url and not url.startswith("https://"):
return "http://" + url
return url