HTML parsing individual tables/not all data being parsed?

HTML parsing individual tables/not all data being parsed? - windows-phone-8

I'm rather new when it comes to Windows Phone 8 development and I've been toying around with a few things as part of the application I'm developing.
Right now I'm trying to parse information from a website such as the RuneScape 07 High Scores - http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=zezima
I'm using HTML Agility Pack and I'm able to parse some data (down to Woodcutting), but anything passed that doesn't appear? (Is that down to the size of my ListBox?)
Ideally, I'd like to be able to parse the table information individually rather than in one block like so:
public MainPage()
{
InitializeComponent();
HtmlWeb.LoadAsync("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=zezima", DownLoadCompleted);
}
void DownLoadCompleted(object sender, HtmlDocumentLoadCompleted e)
{
if(e.Error == null)
{
HtmlDocument doc = e.Document;
if (doc != null)
{
var result = doc.DocumentNode.SelectNodes("//div[#id='contentHiscores']");
foreach (var htmlNode in result)
{
lBox.Items.Add(htmlNode.InnerText);
}
}
}
But if I try and access an individual table such as this one using
var result = doc.DocumentNode.SelectNodes("//div[#id='contentHiscores']/table/tbody/tr[5]/td[2]");
I get a NullReferenceException.
Is this possible or am I doing something exceptionally wrong?

You probably relied on a developper tools such as FireBug or Chrome, etc... to determine the XPATH for the nodes you're after.
You can' really do this as the XPATH given by such tools correspond to the in memory HTML DOM while the Html Agility Pack only knows about the raw HTML sent back by the server.
What you need to do is look at what's sent back (or just do a view source). You'll see there is no TBODY element for example. So you want to find anything discriminant, and use XPATH axes for example.
Here is a code that seems to work:
// get all TD nodes with ALIGN attribute set to left
foreach (var node in doc.DocumentNode.SelectNodes("//div[#id='contentHiscores']//td[#align='left']"))
{
var item = lBox.Items.Add(node.InnerText.Trim());
// use an 'XPATH axe': get all sibling TD nodes with ALIGN attribute set to 'right'
foreach (var sibling in node.SelectNodes("following-sibling::td[#align='right']"))
{
item.SubItems.Add(sibling.InnerText.Trim());
}
}

Related

how to position elements?

How to make this following image and paragraph tag come next to each other (image and p tag in left and right respectively) like inline block elements.
I used the span tag bec its inline but i still couldnt figure it out

The response object is an object that contains all the methods for manipulating the outgoing response and for queuing up data that will be part of that response (when it is finally sent). The specific method you asked about:
res.setHeader(name, value)
is one such method for preparing the outgoing response and is documented here. It allows you to configure a header on that response. It will store that header inside the res object and then when the response is finally sent out over the network, this header item will be streamed as part of the outgoing http headers.
The Express library adds a different variation of this method with:
res.set(field, value)
or
res.header(field, value)
These are both identical in the code.
Internally, both of these just add a little bit of extra processing before eventually calling the underlying res.setHeader() from the regular http library. You can use any one of them. The Express version allows you to call res.set(obj) where obj is a set of key/value pairs that are turned into headers.
You can see the code for Express' res.set() here and see how it eventually calls the underlying res.setHeader().
res.set =
res.header = function header(field, val) {
if (arguments.length === 2) {
var value = Array.isArray(val)
? val.map(String)
: String(val);
// add charset to content-type
if (field.toLowerCase() === 'content-type') {
if (Array.isArray(value)) {
throw new TypeError('Content-Type cannot be set to an Array');
}
if (!charsetRegExp.test(value)) {
var charset = mime.charsets.lookup(value.split(';')[0]);
if (charset) value += '; charset=' + charset.toLowerCase();
}
}
this.setHeader(field, value);
} else {
for (var key in field) {
this.set(key, field[key]);
}
}
return this;
};

How to access html element text in chrome devtools console

I have a page with a whole bunch of blackquote tags.
In dev console I am typing document.getElementsByTagName("blockquote") that giving me an array.
But if I do document.getElementsByTagName("blockquote").innerText document.getElementsByTagName("blockquote").innerHTML document.getElementsByTagName("blockquote").textContent
document.getElementsByTagName("blockquote").outerText document.getElementsByTagName("blockquote").outerHTML
All return undefined
However if I inspect elements of the array document.getElementsByTagName("blockquote") I can see all above properties in place.
How to access at least one of them (innerText, outerHTLM, innerText, outerHTML, textContent) ?

Or if you want to access any specific element you can use index in array
for (var i=0; i <document.getElementsByTagName("blockquote").length; i++ ){
var singleElement = document.getElementsByTagName("blockquote")[i];
console.log(singleElement.innerHTML);
}

You need to iterate the array in order to access those properties. Something like this will work for them:
var elements = document.getElementsByTagName("blockquote");
for (var prop in elements)
{
if(elements.hasOwnProperty(prop)) {
console.log(elements[prop].innerHTML);
}
}

You can also try the following commands:
var elements = document.getElementsByTagName("blockquote");.This will return the list of elements.To access the text of your required element at i index
elements[i].value.

Parsing html page content without using selector

I am going to parse some web pages using Java program. For this purpose I wrote a small code for parsing page content by using xpath as selector. For parsing different sites you need to find the appropriate xpath per each site. The problem is for doing that you need an operator to find the write xpath for you. (for example using firepath firefox addon) Suppose you dont know what page you should parse or the number of sites get really big for operator to find right xpath. In this case you need a way for parsing pages without using any selector. (same scenario exist for CSS selector) Or there should be a way to find xpath automatically! I was wondering what is the method of parsing web pages in this way?
Here is the small code which I wrote for this purpose, please feel free to extend that in presenting your solutions.
public downloadHTML(String url) throws IOException{
CleanerProperties props = new CleanerProperties();
// set some properties to non-default values
props.setTranslateSpecialEntities(true);
props.setTransResCharsToNCR(true);
props.setOmitComments(true);
// do parsing
TagNode tagNode = new HtmlCleaner(props).clean(
new URL(url)
);
// serialize to xml file
new PrettyXmlSerializer(props).writeToFile(
tagNode, "c:\\TEMP\\clean.xml", "utf-8"
);
}
public static void testJavaxXpath(String pattern)
throws ParserConfigurationException, SAXException, IOException,
FileNotFoundException, XPathExpressionException {
DocumentBuilder b = DocumentBuilderFactory.newInstance()
.newDocumentBuilder();
org.w3c.dom.Document doc = b.parse(new FileInputStream(
"c:\\TEMP\\clean.xml"));
// Evaluate XPath against Document itself
javax.xml.xpath.XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xPath.evaluate(pattern,
doc.getDocumentElement(), XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); ++i) {
Element e = (Element) nodes.item(i);
System.out.println(e.getFirstChild().getTextContent());
}
}

Accessing the composed DOM for a polymer element

I'm trying to access the browser-rendered DOM for a polymer element, without caring in the slightest which bits from from the "light" or "shadow" DOM as described in http://www.polymer-project.org/platform/shadow-dom.html, but I cannot find any API documentation on access functions that would let me do this.
I can see the .shadowRoots list that is on polymer elements, and I can see the regular .children property, but neither of these is particularly useful to get the content that is actually being shown on the page.
What is the correct way to get the DOM fragment that is currently visible in the browser? (or if someone knows where this is documented on the polymer site, I'd love to know that too. Google couldn't find it for me)

The general idea is to only consider your virtual DOM (and even better, only consider local DOM [divide and conquer ftw]).
Therefore, today, there is no (easy) way to examine the composed DOM.
There will be such a way eventually for certain specialized uses, but I encourage you to try to adopt a mental model that allows you to avoid asking this question.

Here is something I wrote for the same purpose:
function getComposedDOMChildren (root) {
if (root.shadowRoot) {
root = root.shadowRoot;
}
var children = [];
for (var i = 0; i < root.children.length; i++) {
if (root.children[i].tagName === 'CONTENT') {
children.push.apply(children, root.children[i].getDistributedNodes());
} else if (root.children[i].tagName === 'SHADOW') {
var shadowRoot = root;
while (!shadowRoot.host) {
shadowRoot = shadowRoot.parentNode;
}
children.push.apply(children, getComposedDOMChildren(shadowRoot.olderShadowRoot));
} else {
children.push(root.children[i]);
}
}
return children;
}
It returns an array with the children of the root as per the composed DOM.
It works by going into the shadow root and replacing all <content> elements with what is actually rendered in them and recursing with the earlier shadow root when <shadow> elements are found.
Of course, to get the entire DOM tree, one has to iterate through the returned array and invoke getComposedDOMChildren on each element.
Something like:
function traverse (root) {
var children = getComposedDOMChildren(root);
for (var i = 0; i < children.length; i++) {
traverse(children[i]);
}
}
Please correct me if this isn't right.

Remove attributes using HtmlAgilityPack

I'm trying to create a code snippet to remove all style attributes regardless of tag using HtmlAgilityPack.
Here's my code:
var elements = htmlDoc.DocumentNode.SelectNodes("//*");
if (elements!=null)
{
foreach (var element in elements)
{
element.Attributes.Remove("style");
}
}
However, I'm not getting it to stick? If I look at the element object immediately after Remove("style"). I can see that the style attribute has been removed, but it still appears in the DocumentNode object. :/
I'm feeling a bit stupid, but it seems off to me? Anyone done this using HtmlAgilityPack? Thanks!
Update
I changed my code to the following, and it works properly:
public static void RemoveStyleAttributes(this HtmlDocument html)
{
var elementsWithStyleAttribute = html.DocumentNode.SelectNodes("//#style");
if (elementsWithStyleAttribute!=null)
{
foreach (var element in elementsWithStyleAttribute)
{
element.Attributes["style"].Remove();
}
}
}

Your code snippet seems to be correct - it removes the attributes. The thing is, DocumentNode .InnerHtml(I assume you monitored this property) is a complex property, maybe it get updated after some unknown circumstances and you actually shouldn't use this property to get the document as a string. Instead of it HtmlDocument.Save method for this:
string result = null;
using (StringWriter writer = new StringWriter())
{
htmlDoc.Save(writer);
result = writer.ToString();
}
now result variable holds the string representation of your document.
One more thing: your code may be improved by changing your expression to "//*[#style]" which gets you only elements with style attribute.

Here is a very simple solution
VB.net
element.Attributes.Remove(element.Attributes("style"))
c#
element.Attributes.Remove(element.Attributes["style"])

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

HTML parsing individual tables/not all data being parsed? - windows-phone-8

Related

how to position elements?

How to access html element text in chrome devtools console

Parsing html page content without using selector

Accessing the composed DOM for a polymer element

Remove attributes using HtmlAgilityPack

Categories

Resources