Why is lxml html parser not parsing the complete file? - html

I am trying to parse a 16Mb html file using lxml. My actual task is to get all the doc tags and for each doc tag if the value of docno tag matches my doc list I extract the content of doc tag.
self.doc_file_list is a list containing paths of such 16Mb files that I need to parse.
file is absolute path of the file.
This is the code I am using currently
for file in file(self.doc_file_list,'r'):
tree = etree.parse(file.strip(), parser)
doc = tree.findall('.//doc')
for elem in doc:
docno = elem.find('.//docno').text
if docno in self.doc_set:
print >> out, etree.tostring(elem)
I checked the content of tree using etree.tostring(tree) and it does not parse the complete file and only parses some kb of the actual file.
Note: I am not getting any error message but the parsed content of tree is incomplete so I am not able to get the whole list.

I was finally able to solve this problem. I checked the tree generated and it was not parsing the whole document. This is because the document was heavily broken. You can check this information on the link: lxml.de/parsing.html (removed http as stackoverflow did not let me add more than 2 links).
This issue of broken html document can be resolved using one of the following two approaches:
1. Instead of using html parser you can either use ElementSoup provided by lxml. It uses BeautifulSoup parser to handle broken html docs. Link: http://lxml.de/lxmlhtml.html
Note: This approach did not work out for me.
2. Another approach is to directly use BeautifulSoup directly and using the parsers provided by it. There are many parser options provided and you need to find out which one suits you the best. For me, html.parser worked.
Link: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
Thanks all for the help.

Related

Trying to determine why my xpath is failing in Scrapy

I'm trying to run a Scrapy spider on pages like this:
https://careers.mitre.org/us/en/job/R104514/Chief-Engineer-Technical-Analysis-Department
And I'd like the spider to retrieve the bullet points with qualifications and responsibilities. I can write an xpath expression that gets exactly that, and it works in my browsers:
//*/section/div/ul/li
But when I try to use the Scrapy shell:
response.xpath("//*/section/div/ul/li")
It returns an empty list. Based on copying the response.text and loading it in a browser, it seems like the text is accessible, but I still can't access those bullets.
Any help would be much appreciated!
Looking at the page you have linked, the list items you are targeting are not actually in the document response itself but later loaded into the DOM by JavaScript.
To access these I'd recommend looking at scrapy's documentation on Selecting dynamically-loaded content. The section that applies here in particuler is the Parsing JavaScript code section.
Following the second example, we can use chompjs (you'll need to first install it with pip) to extract the JavaScript data, unescape the html string, and then load it into scrapy for parsing. e.g.:
scrapy shell https://careers.mitre.org/us/en/job/R104514/Chief-Engineer-Technical-Analysis-Department
Then:
import html # Used to unescape the HTML stored in JS
import chompjs # Used to parse the JS
javascript = response.css('script::text').get()
data = chompjs.parse_js_object(javascript)
description_html = html.unescape(data['description'])
description = scrapy.Selector(text=description_html, type="html")
description.xpath("//*/ul/li")
This should output your desired list items:
[<Selector xpath='//*/ul/li' data='<li>Ensure the strength ...

Unable to locate html tag for scraping

I'm not great in HTML, so am a bit stumbled for this.
I'm trying to scrape instagram datetime posts using python, and realised that the datetime information isn't without the html document of the post. However, I am able to query it using inspect element. See below screen shot.
Where is this datetime information located exactly, and how can I obtain it?
The example I took from is this random post "https://www.instagram.com/p/BEtMWWbjoPh/". Element is at the "12h" displayed in the page.
[Update] I am using urllib to grab the url, and bs4 in python to scrape. The output did not return anything with datetime. The code is below. I also printed out the entire html and I was surprised that it does not contain datetime in it.
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup.select('time')
for tag in tags:
dateT = tag.get('datetime').getText()
print dateT
In your developer console, type this:
document.getElementsByTagName('time')[0].getAttribute('datetime');
This will return the data you are looking for. The above code is simply looking through the HTML for the tag name time, of which there is only one, then grabbing the datetime property from it.
As for python, check out BeautifulSoup if you haven't already. This library will allow you to do a similar thing in python:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.time['datetime']
Where html_doc is your raw HTML. To obtain the raw HTML, use the requests library.
I think the problem that you are experiencing is that urllib.urlopen(url).read() does not execute any javascript that is on the page.
Because Instagram is a client side javascript app that uses your browser to render their site, you'll need some sort of browser client to evaluate the javascript and then find the element on the page. For this, I usually use phantomjs (I usually use it with the ruby driver Capybara, but I would assume that there is a python package that would work similarly)
HOWEVER, if you execute urllib.urlopen(url).read(), you should see a block of JSON in a script tag that begins with <script type="text/javascript">window._sharedData = {...
That block of JSON will include the data you are looking for. If you were to evaluate that JSON, and parse it, you should be able access the time data you are looking for.
That being said, the better way to do this is to use instagram's api to do the the crawling. They make all of this data available to developers, so you don't have to crawl an ever-changing webpage.
(Apparently Instagram's API will only return public data for users who have explicitly given your app permission)

Extracting JSON data from html source for use with jsonlite in R

I have a background in data and have just been getting into scraping so forgive me if my web standards and languages is not up to scratch.
I am trying to scrape some data from a javascript component of a website I use. Viewing the page source I can actually see the data I need already there within javascript function calls in JSON format. For example it looks a little like this.
<script type="text/javascript">
$(document).ready(function () {
gameState = 4;
atView.init("/Data/FieldView/20152220150142207",{"a":[{"co":true,"col:"Red"}],"b":false,...)
meLine.init([{"c":100,"b":true,...)
</script>
Now, I only need the JSON data in meLine.init. If I physically copy/paste only the JSON data into a file I can then convert that with jsonlite in R and have exactly what I need.
However I don't want to have to copy/paste multiple pages so I need a way of extracting only this data and leaving everything else behind. I originally thought to save the html source code to R, convert to text and try and regex match "meLine.init(", but I'm not really getting anywhere with that. Could anyone offer some help?
Normally I'd use XML and xpath to parse an html page but in this case (since you know the exact structure you're looking for) you might be able to do it directly with a bit of regular expressions (this is generally not a good idea as emphasized here). Not sure if this gets you exactly to your goal but
sub("[ ]+meLine.init\\((.+)\\)" , "\\1",
grep("meLine.init", readLines("file://test.html"), value=TRUE),
perl=TRUE)
will return the line you're looking for and then you can work your magic with jsonlite. The idea is to read the page line by line. grep the (hopefully) single line that contains the string meLine.init and then extract the JSON string from that. Replace file://test.html with the URL you want to use

Isolating an html element with python

Hi I'm using beautiful soup to parse html on python3.4, and I cant seem to find the right code to properly display the information inside these html tags. I've successfully parsed and extracted info from other sites but for some reason when I finish the loop to display content with this code, empty brackets appear [] as if there were no info.
web=requests.get('https://www.scutify.com/company.aspx?ticker=AAPL')
Info=web.content
Scutify=BeautifulSoup(Info,'html.parser')
price=Scutify.find_all('span',{"id":"latest-price"})
print(price)
for item in price:
print(item.content)
It's because there isn't any content. The prices are dynamically generated by javascript on the page. Requests and BeautifulSoup can't get that data because they don't execute javascript, they just read the code as strings.
That said, you're in luck. Reading the javascript reveals a predictable URL you can use to get all the ticker information in JSON: /service/get-quote.ashx?ticker=
So to get AAPL's info all you do is GET https://www.scutify.com/service/get-quote.ashx?ticker=AAPL

How convert Html into Prolog

How convert Html into Prolog?
I need to extract from an html page its tag and i describe it into Prolog.
Example, if my file contains this html code
<title>Prove<title>
<select id="data_nastere_zi" name="data_nastere_zi">
i should get
title(Prove),
select(id(data_nastere_zi)).
I tried to see various library but i couldn't.
Thanks.
You can parse well formed HTML using SWI-Prolog library(sgml), in particular load_html/2.
My experience, scraping 'real world' websites, isn't really pleasant, because of insufficient error handling.
Anyway, when you will have loaded the page structure, you will have available library(xpath) to inspect such complex data.
edit getting a table inside a div:
xpath(Page, //div, Div),
xpath(Div, //table, Table)...
SWI-Prolog has a package for SGML/XML parsing based on the SWI-Prolog interface to SP by Anjo Anjewierden: "SWI-Prolog SGML/XML parser".