Isolating an html element with python - html

Hi I'm using beautiful soup to parse html on python3.4, and I cant seem to find the right code to properly display the information inside these html tags. I've successfully parsed and extracted info from other sites but for some reason when I finish the loop to display content with this code, empty brackets appear [] as if there were no info.
web=requests.get('https://www.scutify.com/company.aspx?ticker=AAPL')
Info=web.content
Scutify=BeautifulSoup(Info,'html.parser')
price=Scutify.find_all('span',{"id":"latest-price"})
print(price)
for item in price:
print(item.content)

It's because there isn't any content. The prices are dynamically generated by javascript on the page. Requests and BeautifulSoup can't get that data because they don't execute javascript, they just read the code as strings.
That said, you're in luck. Reading the javascript reveals a predictable URL you can use to get all the ticker information in JSON: /service/get-quote.ashx?ticker=
So to get AAPL's info all you do is GET https://www.scutify.com/service/get-quote.ashx?ticker=AAPL

Related

Trying to determine why my xpath is failing in Scrapy

I'm trying to run a Scrapy spider on pages like this:
https://careers.mitre.org/us/en/job/R104514/Chief-Engineer-Technical-Analysis-Department
And I'd like the spider to retrieve the bullet points with qualifications and responsibilities. I can write an xpath expression that gets exactly that, and it works in my browsers:
//*/section/div/ul/li
But when I try to use the Scrapy shell:
response.xpath("//*/section/div/ul/li")
It returns an empty list. Based on copying the response.text and loading it in a browser, it seems like the text is accessible, but I still can't access those bullets.
Any help would be much appreciated!
Looking at the page you have linked, the list items you are targeting are not actually in the document response itself but later loaded into the DOM by JavaScript.
To access these I'd recommend looking at scrapy's documentation on Selecting dynamically-loaded content. The section that applies here in particuler is the Parsing JavaScript code section.
Following the second example, we can use chompjs (you'll need to first install it with pip) to extract the JavaScript data, unescape the html string, and then load it into scrapy for parsing. e.g.:
scrapy shell https://careers.mitre.org/us/en/job/R104514/Chief-Engineer-Technical-Analysis-Department
Then:
import html # Used to unescape the HTML stored in JS
import chompjs # Used to parse the JS
javascript = response.css('script::text').get()
data = chompjs.parse_js_object(javascript)
description_html = html.unescape(data['description'])
description = scrapy.Selector(text=description_html, type="html")
description.xpath("//*/ul/li")
This should output your desired list items:
[<Selector xpath='//*/ul/li' data='<li>Ensure the strength ...

Cannot collect all nodes of Google search result with goquery: some nodes are missing

I am trying to collect results of a google search page in GoLang using the goquery library. In order to achieve this, I am collecting all nodes of a goquery selection with goquery. The problem is that the selection returned by Find("*") does not seem to contain all the nodes of the HTML document. Question: does the method collect ALL nodes with the whole tree structure or not ? If not, is there a method to collect them all ?
I tried using the goquery Find("*") method applied to the whole document selection. So nodes with certain attributes are not returned, although they are in the HTML document. For instance, nodes with are not recognized
alltags := doc.Find("*") //doc is the HTML doc with the Google search
The selection does not contain the div tags with class="srg". The same applies to other class values such as "bkWMgd", "rc" for example.
This has happened to me before. I was trying to web scrape with python beautiful soup package and the same thing was happening.
Later it turned out that the html markup returned when trying to fetch it was actually the markup the server returned after finding a bot. I solved this by setting the User-Agent to Mozilla/5.0.
Hope this helps in your quest to solve this.
You can start by updating the code for the fetch request you have performed.

Why is lxml html parser not parsing the complete file?

I am trying to parse a 16Mb html file using lxml. My actual task is to get all the doc tags and for each doc tag if the value of docno tag matches my doc list I extract the content of doc tag.
self.doc_file_list is a list containing paths of such 16Mb files that I need to parse.
file is absolute path of the file.
This is the code I am using currently
for file in file(self.doc_file_list,'r'):
tree = etree.parse(file.strip(), parser)
doc = tree.findall('.//doc')
for elem in doc:
docno = elem.find('.//docno').text
if docno in self.doc_set:
print >> out, etree.tostring(elem)
I checked the content of tree using etree.tostring(tree) and it does not parse the complete file and only parses some kb of the actual file.
Note: I am not getting any error message but the parsed content of tree is incomplete so I am not able to get the whole list.
I was finally able to solve this problem. I checked the tree generated and it was not parsing the whole document. This is because the document was heavily broken. You can check this information on the link: lxml.de/parsing.html (removed http as stackoverflow did not let me add more than 2 links).
This issue of broken html document can be resolved using one of the following two approaches:
1. Instead of using html parser you can either use ElementSoup provided by lxml. It uses BeautifulSoup parser to handle broken html docs. Link: http://lxml.de/lxmlhtml.html
Note: This approach did not work out for me.
2. Another approach is to directly use BeautifulSoup directly and using the parsers provided by it. There are many parser options provided and you need to find out which one suits you the best. For me, html.parser worked.
Link: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
Thanks all for the help.

Unable to locate html tag for scraping

I'm not great in HTML, so am a bit stumbled for this.
I'm trying to scrape instagram datetime posts using python, and realised that the datetime information isn't without the html document of the post. However, I am able to query it using inspect element. See below screen shot.
Where is this datetime information located exactly, and how can I obtain it?
The example I took from is this random post "https://www.instagram.com/p/BEtMWWbjoPh/". Element is at the "12h" displayed in the page.
[Update] I am using urllib to grab the url, and bs4 in python to scrape. The output did not return anything with datetime. The code is below. I also printed out the entire html and I was surprised that it does not contain datetime in it.
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup.select('time')
for tag in tags:
dateT = tag.get('datetime').getText()
print dateT
In your developer console, type this:
document.getElementsByTagName('time')[0].getAttribute('datetime');
This will return the data you are looking for. The above code is simply looking through the HTML for the tag name time, of which there is only one, then grabbing the datetime property from it.
As for python, check out BeautifulSoup if you haven't already. This library will allow you to do a similar thing in python:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.time['datetime']
Where html_doc is your raw HTML. To obtain the raw HTML, use the requests library.
I think the problem that you are experiencing is that urllib.urlopen(url).read() does not execute any javascript that is on the page.
Because Instagram is a client side javascript app that uses your browser to render their site, you'll need some sort of browser client to evaluate the javascript and then find the element on the page. For this, I usually use phantomjs (I usually use it with the ruby driver Capybara, but I would assume that there is a python package that would work similarly)
HOWEVER, if you execute urllib.urlopen(url).read(), you should see a block of JSON in a script tag that begins with <script type="text/javascript">window._sharedData = {...
That block of JSON will include the data you are looking for. If you were to evaluate that JSON, and parse it, you should be able access the time data you are looking for.
That being said, the better way to do this is to use instagram's api to do the the crawling. They make all of this data available to developers, so you don't have to crawl an ever-changing webpage.
(Apparently Instagram's API will only return public data for users who have explicitly given your app permission)

Python Web Scrape Index

I am VERY new to web scraping in any shape or form, I've been trying to get into Python and I heard that web scraping was a good way to expose myself to Python. So, after many Google searches I finally came down to the use of two highly recommended modules: Requests and BeautifulSoup. I've read up a fair amount on both and have a basic understanding on how to use them.
I found a very basic website (basic in that there isn't much content or javascript and the like, making parsing the HTML a lot easier) and I have the following code:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('http://www.basicwebs.co.uk/contact.htm').text)
for row in soup('div',{'id': 'Layer1'})[0].h2('font'):
tds = row.text
print tds
This code works. It produces the following result:
BASIC
WEBS
Contact details
Contact details
Which, if you spend a few minutes inspecting the code on this page, is the correct result (I assume). Now, the thing is, while this code works, what if I wanted to get a different part of the page? Like the little paragraph on the page that states "If you are interested in having a website designed and hosted by us, please contact us either by e-mail or telephone." - my understanding would be to simply change the index number to the corresponding header that this text is found under, but when I change it I get a message that the list index is out of range.
Can anybody help? (as simple as you can make it, if possible)
I'm using Python 2.7.8
The text you require surrounded by the font tag with an attribute size=3, so one way to do it is by selecting the first occurrence of it like this:
font_elements = soup('font', {'size': 3})
if font_elements:
print font_elements[0].text
RESULT:
If you are interested in having a website designed
and hosted by us, please contact us either by e-mail or telephone.
You can directly do this :
soup('font',{'size': '3'})[0].text
However, I want to draw your attention towards the mistake you made before.
soup('div',{'id': 'Layer1'})
this returns the div tag with id='Layer1' which can be more than one. So it basically returns a list of all HTML elements whose div tags have id='Layer1' but unfortunately the HTML you were trying to parse has one such element. So it went out of bound.
You can probably use some interactive interpreter of python like bpython or ipython to test what are you getting in an object.? Happy Hacking!!!
from urllib.request import urlopen
from bs4 import BeautifulSoup
web_address=' http://www.basicwebs.co.uk/contact.htm'
html = urlopen(web_address)
bs = BeautifulSoup(html.read(), 'html.parser')
contact_info = bs.findAll('h2', {'align':'left'})[0]
for info in contact_info:
print(info.get_text())