I have an html file on my HD that I want to do an xpath search on like you do when scraping a website.
I have used the following code to scrape from websites:
from lxml import html
import requests
response = requests.get('http://www.website.com/')
if (response.status_code == 200):
pagehtml = html.fromstring(response.text)
for elt in pagehtml.xpath('//div[#class="content"]/ul/li/a'):
print("**",'"',elt.text_content(),'"',"****", elt.attrib['href'])
Now this works well when getting something from a website, but how do I go about when the HTML file is on my HD. I have tried about 10 things and at the moment my code looks like this:
with open(r'website.html', 'rb') as infile:
data = infile.read()
for elt in data.xpath('//h3/a'):
print("**",'"',elt.text_content(),'"',"****", elt.attrib['href'])
I keep getting different errors and sometimes '_io.BufferedReader' errors, but I just don't get the code right.
Any suggestions? Regards
You could use the following code:
from lxml import html
pagehtml = html.parse('index.html')
for elt in pagehtml.xpath('//a'):
print("**",'"',elt.text_content(),'"',"****", elt.attrib['href'])
This makes sure that the decoding of the file data is handled automatically.
Related
Every sort of help will be extremely appreciated. I am building a parser to a web-site. I am trying to detect an element using lxml package, the element has a pretty simple relative xpath: '//div[#id="productDescription"]'. When I am manually going to the web page, making 'view page source' and copying the html string to local html file, everything works perfectly. However, if I download the file automatically:
headers = {"user-Agent": "MY SCRAPER USER-AGENT", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT": "1","Connection": "close", "Upgrade-Insecure-Requests": "1"}
product_HTML_bytes = requests.get(product_link, headers=headers, proxies={'http': "***:***"}).content
product_HTML_str = product_HTML_bytes.decode()
main_data = html.fromstring(product_HTML_str)
product_description_tags = main_data.xpath('//div[#id="productDescription"]')
...
I get nothing (and the data does exist in the file). I had also tried to first scrape a sample of pages using the same request.get with the same headers and so on, saving the files locally and then cleaning the extra spaces and indenting the document manually using this html formatter: https://www.freeformatter.com/html-formatter.html and then boom, it works again. However, I couldn't put my finger on what exactly changes in the files, but I was pretty sure extra spaces and indented tabs should not make a difference.
What am I missing here?
Thanks in Advance
Edit:
URL: https://www.amazon.com/Samsung-MicroSDXC-Adapter-MB-ME128GA-AM/dp/B06XWZWYVP
cause pasting it here is impossible because the file exceeds the length limit, I uploaded them to the web.
The not working HTML: https://easyupload.io/231pdd
The indented, clean, and formatted HTML page: https://easyupload.io/a9oiyh
For some strange reason, it seems the the lxml library mangles the text output of requests.get() when the output is filtered through the lxml.html.fromstring() method. I have no idea why.
The target data is still there, no doubt:
from bs4 import BeautifulSoup as bs
soup = bs(product_HTML_str,'lxml') #note that the lxml parser is used here!
for elem in soup.select_one('#productDescription p'):
print(elem.strip())
Output:
Simply the right card. With stunning speed and reliability, the...
etc.
I personally much prefer using xpath in lxml to find() and css selectors methods used by BeautifulSoup, but this time BeautifulSoup wins...
I am trying to scrape surfline.com using python3 with beautiful soup and requests. I am using this bit of code. Additionally I am using spyder 3.7. Also I am fairly new to webscraping.
import requests
from bs4 import BeautifulSoup
url = 'https://www.surfline.com/surf-report/salt-creek/5842041f4e65fad6a770882e'
r = requests.get(url)
html_soup = BeautifulSoup(r.text,'html.parser')
print(html_soup.prettify())
The goal is to scrape the surf height for each day. Using inspect I found the HTML section that contains the wave height. screen shot of surfline website & HTML . I run the code and it prints out the HMTL of the website. When I do ctrl find to look for the section I want to scrape it is not there. My question is why is it not being printed out and how do I fix it. I am aware that some website use java to load data onto website, is that the case here.
Thank you for any help you can provide.
I am trying to get data for the following website using requests and Scrapy Selector.
import requests
from scrapy import Selector
url="https://seekingalpha.com/article/4312816-exxon-mobil-dividend-problems"
headers = {'user-agent': 'AppleWebKit/537.36'}
req = requests.get(url, headers=headers)
sel = Selector(text=req.text)
I could extract the text body but when tried to get the XPath for comments,
I noticed that the HTML returned from requests is different from the inspector, therefore selecting the class='b-b' like,
sel.xpath("//div[#class='b-b']")
returns an empty list in Python. It seems that I'm missing something or the HTML is partially hidden from the bots.
After view(response) I found out the following is rendered,
My Questions
Why the same HTML cannot be seen in the HTTP response?
How to get the comments data using XPath expressions for this page
Run your url link in scrapy shell and view the page by that command:
view(response)
your url link open in browser there you can see the source code and if the item is available there you can get it by xpath, simply inspect that element and copy that xpath you can get that element. i did not have my system. so i cannot send you exact code try the above things. your problem will be solved.
I have a script that is supposed to make a query on the following website. https://searchwww.sec.gov/EDGARFSClient/.
import requests
keyword = "something"
body = {"query":keyword}
page = requests.post('https://searchwww.sec.gov/EDGARFSClient/', data=body)
print(page3.content)
This just returns the webpages html code as it is, without the keyword search. Any ideas on what am i doing wrong? Also is there a way to filter out only the links that the search reutrns?
The way i wanted to do it was to go through the html code and isolate all strings that start that look like this:
https://example-link.com
I think my main issue is that i need to pull up "advanced search" before i search for my keyword. That seems to be messing things up for me. I am not entirely sure as i've never done this before. Any help would be much appreciated.
I'm not sure how you got the "query" tag, but the correct tag for searches on this website is "search_text".
from bs4 import BeautifulSoup
import requests
keyword = "something"
body = {"search_text":keyword}
page = requests.post('https://searchwww.sec.gov/EDGARFSClient/', data=body)
soup = BeautifulSoup(page.content, features='lxml')
for a in soup.find_all('a', href=True):
if a.has_attr('class') and 'filing' in a['class']:
print(a['href'])
javascript:opennew('http://www.sec.gov/Archives/edgar/data/<...>/<...>.htm','<...>','');
javascript:opennew('http://www.sec.gov/Archives/edgar/data/<...>/<...>.htm','<...>','');
javascript:opennew('http://www.sec.gov/Archives/edgar/data/<...>/<...>.htm','<...>','');
javascript:opennew('http://www.sec.gov/Archives/edgar/data/<...>/<...>.htm','<...>','');
I'm having problems getting to the rss link that tells the browser where the rss is for the site. The link is found in the <head> tag of the html here is an example of what the link looks like.
<link rel="alternate" type="application/rss+xml" title="CNN - Top Stories [RSS]" href="http://rss.cnn.com/rss/cnn_topstories.rss" />
My original approach was to treat the site like an XML file and look through the tags, but most sites have an arbitrary number of <meta> tags that forget to have a ending /> so the <link> tag I'm looking for becomes a child of a random <meta> tag.
Now I'm thinking of just treating the site like a string and looking for the <link> tag in it, but this causes problems since the <link> tag can have its attributes in any order possible. Of course I can work around this, but I would prefer something a bit neater than look for type="application/rss+xml" then look to the left and right of it for the first href it sees.
HTML parsing is hard! Even if you find a solution that works for one site, it will likely break in another. if you can find a library to help you your life will be a lot easier.
If you can not find a html parser for actionscript 2, maybe you could set up a server script to it for you? Like:
myXML.load("http://yourserver.com/cgi-bin/findrss?url=foo.com");
and then have it return the url as xml
If you try this approach, I recommend the python library Beautiful Soup. I've used it before and, in my opinion, it's amazing. It will work on any website you give it, no matter how horrible the markup is.
It would look something like this:
#!/usr/bin/python
import cgi
import cgitb; cgitb.enable() # Optional; for debugging only
import urllib2
from BeautifulSoup import BeautifulSoup
def getRssFromUrl(url):
try:
Response = urllib2.urlopen(url)
except Exception:
print "<error>error getting url</error>"
return []
html = Response.read()
soup = BeautifulSoup(html)
rssFeeds = soup.findAll('link', attrs={"type" : "application/rss+xml"})
return rssFeeds
print "Content-type: text/xml\n\n"
form = cgi.FieldStorage()
if form.has_key("url") is True:
url = form["url"].value
else:
url = ""
print "<xml>"
rssFeeds = getRssFromUrl(url)
for feed in rssFeeds:
print ("<url>%s</url>" % feed["href"])
print "</xml>"