Different HTTP response from inspector HTML - html

I am trying to get data for the following website using requests and Scrapy Selector.
import requests
from scrapy import Selector
url="https://seekingalpha.com/article/4312816-exxon-mobil-dividend-problems"
headers = {'user-agent': 'AppleWebKit/537.36'}
req = requests.get(url, headers=headers)
sel = Selector(text=req.text)
I could extract the text body but when tried to get the XPath for comments,
I noticed that the HTML returned from requests is different from the inspector, therefore selecting the class='b-b' like,
sel.xpath("//div[#class='b-b']")
returns an empty list in Python. It seems that I'm missing something or the HTML is partially hidden from the bots.
After view(response) I found out the following is rendered,
My Questions
Why the same HTML cannot be seen in the HTTP response?
How to get the comments data using XPath expressions for this page

Run your url link in scrapy shell and view the page by that command:
view(response)
your url link open in browser there you can see the source code and if the item is available there you can get it by xpath, simply inspect that element and copy that xpath you can get that element. i did not have my system. so i cannot send you exact code try the above things. your problem will be solved.

Related

Selenium returns "[]" when trying to print text found by xpath

I have this extension that ensures the xpath is correct, so as far as the xpath validity goes, it should be correct. When I attempt to find it using selenium and then try to print the variable in which I have scrapped the xpath address for I just end up with a "[]" in console. Trying removing /text() and adding .text in print would return me an attribute error. I don't understand why nothing would return.
from selenium import webdriver
from csv import DictReader
import time
url = 'https://my.te.eg/offering/usage'
browser = webdriver.Chrome(executable_path=r'C:\Users\User\Desktop\we_internet\chromedriver.exe')
browser.get(url)
print("Number: ")
input("Enter done when done writing password and web loaded")
remain_no = browser.find_elements_by_xpath("/html/body/app-root/div/div[1]/app-offering/app-usage/div/div/p-card/div/div/div/div[1]/div[1]/app-gauge/div/span[1]/text()")
#[link.get_attribute('outerHTML') for link in browser.find_elements_by_xpath("/html/body/app-root/div/div[1]/app-offering/app-usage/div/div/p-card/div/div/div/div[1]/div[1]/app-gauge/div/span[1]/text()")]
print(remain_no)
Here is an image of the html page I'm trying to scrape info from, and as you can see above, the xpath address shows the result with the number in result area. In the code you may have noticed a different attempt at capturing outerHTML but I still end up with the same result an empty "[]".
//foo/bar/text() will return an object where as Selenium expects back an element.
Hence you encounter an empty list.
Instead you may like to print the text within the element using list comprehension as follows:
print([link.text for link in browser.find_elements_by_xpath("/html/body/app-root/div/div[1]/app-offering/app-usage/div/div/p-card/div/div/div/div[1]/div[1]/app-gauge/div/span[1]")])

Get full HTML for page with dynamic expanded containers with python

I am trying to pull the full HTML from ratemyprofessors.com however at the bottom of the page, there is a "Load More Ratings" button that allows you to see more comments.
I am using requests.get(url) and beautifulsoup, but that only gives the first 20 comments. Is there a way to have the page load all the comments before it returns?
Here is what I am currently doing that gives the top 20 comments, but not all of them.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
comments = []
for j in soup.findAll('div', attrs={'class': 'Comments__StyledComments-dzzyvm-0 dEfjGB'}):
comments.append(j.text)
BeautifulSoup is more of an HTML parser for static pages than renderer for more dynamic web apps.
You could achieve what you want using a headless browser via Selenium by rendering the full page and repeatedly clicking the more link until there is no more to load.
Example: Clicking on a link via selenium
Since you're already using Requests, another option that might work is Requests-HTML which also supports dynamic rendering By calling .html.render() on the response object.
Example: https://requests-html.kennethreitz.org/index.html#requests_html.HTML.render
Reference: Clicking link using beautifulsoup in python

LXML xpath does not detect in a "dirty" html file, however after indenting and cleaning it, it succeeds

Every sort of help will be extremely appreciated. I am building a parser to a web-site. I am trying to detect an element using lxml package, the element has a pretty simple relative xpath: '//div[#id="productDescription"]'. When I am manually going to the web page, making 'view page source' and copying the html string to local html file, everything works perfectly. However, if I download the file automatically:
headers = {"user-Agent": "MY SCRAPER USER-AGENT", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT": "1","Connection": "close", "Upgrade-Insecure-Requests": "1"}
product_HTML_bytes = requests.get(product_link, headers=headers, proxies={'http': "***:***"}).content
product_HTML_str = product_HTML_bytes.decode()
main_data = html.fromstring(product_HTML_str)
product_description_tags = main_data.xpath('//div[#id="productDescription"]')
...
I get nothing (and the data does exist in the file). I had also tried to first scrape a sample of pages using the same request.get with the same headers and so on, saving the files locally and then cleaning the extra spaces and indenting the document manually using this html formatter: https://www.freeformatter.com/html-formatter.html and then boom, it works again. However, I couldn't put my finger on what exactly changes in the files, but I was pretty sure extra spaces and indented tabs should not make a difference.
What am I missing here?
Thanks in Advance
Edit:
URL: https://www.amazon.com/Samsung-MicroSDXC-Adapter-MB-ME128GA-AM/dp/B06XWZWYVP
cause pasting it here is impossible because the file exceeds the length limit, I uploaded them to the web.
The not working HTML: https://easyupload.io/231pdd
The indented, clean, and formatted HTML page: https://easyupload.io/a9oiyh
For some strange reason, it seems the the lxml library mangles the text output of requests.get() when the output is filtered through the lxml.html.fromstring() method. I have no idea why.
The target data is still there, no doubt:
from bs4 import BeautifulSoup as bs
soup = bs(product_HTML_str,'lxml') #note that the lxml parser is used here!
for elem in soup.select_one('#productDescription p'):
print(elem.strip())
Output:
Simply the right card. With stunning speed and reliability, the...
etc.
I personally much prefer using xpath in lxml to find() and css selectors methods used by BeautifulSoup, but this time BeautifulSoup wins...

Scrapy can't find form on page

I'm trying to write a spider that will automatically log in to this website. However, when I try using scrapy.FormRequest.from_response in the shell I get the error:
No <form> element found in <200 https://www.athletic.net/account/login/?ReturnUrl=%2Fdefault.aspx>
I can definitely see the form when I inspect element on the site, but it just did not show up in Scrapy when I tried finding it using response.xpath() either. Is it possible for the form content to be hidden from my spider somehow? If so, how do I fix it?
The form is created using Javascript, it's not part of the static HTML source code. Scrapy does not parse Javascript, thus it cannot be found.
The relevant part of the static HTML (where they inject the form using Javascript) is:
<div ng-controller="AppCtrl as appC" class="m-auto pt-3 pb-5 container" style="max-width: 425px;">
<section ui-view></section>
</div>
To find issues like this, I would either:
compare the source code from "View Source Code" and "Inspect" to each other
browse the web page with a browser without Javascript (when I develop scrapers I usually have one browser with Javascript for research and documentations and another one for checking web pages without Javascript)
In this case, you have to manually create your FormRequest for this web page. I was not able to spot any form of CSRF protection on their form, so it might be as simple as:
FormRequest(url='https://www.athletic.net/account/auth.ashx',
formdata={"e": "foo#example.com", "pw": "secret"})
However, I think you cannot use formdata, but instead they expect you to send JSON. Not sure if FormRequest can handle this, I guess you just want to use a standard Request.
Since they heavily use Javascript on their front end, you cannot use the source code of the page to find these parameters either. Instead, I used the developer console of my browser and checked the request/response that happened when I tried to login with invalid credentials.
This gave me:
General:
Request URL: https://www.athletic.net/account/auth.ashx
[...]
Request Payload:
{e: "foo#example.com", pw: "secret"}
Scrapy has a JsonRequest class to help with posting JSON. See here [https://docs.scrapy.org/en/latest/topics/request-response.html]
So something like the below should work
data = {"password": "pword", "username": "user"}
# JSON POST to API login URL
return JsonRequest(
url=url,
callback=self.after_login,
data=data,
)

Extracting tags from a HTML with data hidden using python

I'm trying to learn scraping from different webpages. I tried to scrape data from a page containing tabs as follows:
url = "https://www.bc.edu/bc-web/schools/mcas/departments/art/people/#par-bc_tabbed_content-tab-0"
page = requests.get(url)
content = page.content
tree = html.fromstring(page.content)
soup = BeautifulSoup(content,"html.parser")
p = soup.find_all('div',{"id":'e6bde0e9_358d_4966_8fde_be96e9dcad0b'})
print p
This returns empty result
Though inspecting the element displays the content but the source page doesn't display this data. Any pointers on how to extract the content.
this is because of javascript rendering, which means that the data you want doesn't come with the original request, but requests generated by the javascript of that response.
To check ALL the requests that were generated by the original request, you'll have to use something like developer tools in Chrome.
For this particular case the actual request you need is to this site, which will give you the information you need.