Remove attributes using HtmlAgilityPack

Remove attributes using HtmlAgilityPack - html

I'm trying to create a code snippet to remove all style attributes regardless of tag using HtmlAgilityPack.
Here's my code:
var elements = htmlDoc.DocumentNode.SelectNodes("//*");
if (elements!=null)
{
foreach (var element in elements)
{
element.Attributes.Remove("style");
}
}
However, I'm not getting it to stick? If I look at the element object immediately after Remove("style"). I can see that the style attribute has been removed, but it still appears in the DocumentNode object. :/
I'm feeling a bit stupid, but it seems off to me? Anyone done this using HtmlAgilityPack? Thanks!
Update
I changed my code to the following, and it works properly:
public static void RemoveStyleAttributes(this HtmlDocument html)
{
var elementsWithStyleAttribute = html.DocumentNode.SelectNodes("//#style");
if (elementsWithStyleAttribute!=null)
{
foreach (var element in elementsWithStyleAttribute)
{
element.Attributes["style"].Remove();
}
}
}

Your code snippet seems to be correct - it removes the attributes. The thing is, DocumentNode .InnerHtml(I assume you monitored this property) is a complex property, maybe it get updated after some unknown circumstances and you actually shouldn't use this property to get the document as a string. Instead of it HtmlDocument.Save method for this:
string result = null;
using (StringWriter writer = new StringWriter())
{
htmlDoc.Save(writer);
result = writer.ToString();
}
now result variable holds the string representation of your document.
One more thing: your code may be improved by changing your expression to "//*[#style]" which gets you only elements with style attribute.

Here is a very simple solution
VB.net
element.Attributes.Remove(element.Attributes("style"))
c#
element.Attributes.Remove(element.Attributes["style"])

Related

How can I change the html img element source witch dart?

I have this img element in my HTML project:
<img id="themeToggle" src="./images/moon.svg">
and i want to change the source of this element to be "./images/sun.svg".
I tried with:
void main() {
var themeToggleButton = querySelector('#themeToggle');
themeToggleButton?.onClick.listen((event) {
themeToggleButton.dataset['src'] = './images/sun.svg';
});
}
since the .dataset attribute is the only one that lets you to access the selected element's attributes, but it does not work. Any suggestion, please?

You should be able to use the .attributes getter to get the attributes Map and set there the value:
themeToggleButton?.attributes['src'] = './images/sun.svg'
Note: .dataset (as specified in the docs) is used only for the element properties that start with data-

Parsing html page content without using selector

I am going to parse some web pages using Java program. For this purpose I wrote a small code for parsing page content by using xpath as selector. For parsing different sites you need to find the appropriate xpath per each site. The problem is for doing that you need an operator to find the write xpath for you. (for example using firepath firefox addon) Suppose you dont know what page you should parse or the number of sites get really big for operator to find right xpath. In this case you need a way for parsing pages without using any selector. (same scenario exist for CSS selector) Or there should be a way to find xpath automatically! I was wondering what is the method of parsing web pages in this way?
Here is the small code which I wrote for this purpose, please feel free to extend that in presenting your solutions.
public downloadHTML(String url) throws IOException{
CleanerProperties props = new CleanerProperties();
// set some properties to non-default values
props.setTranslateSpecialEntities(true);
props.setTransResCharsToNCR(true);
props.setOmitComments(true);
// do parsing
TagNode tagNode = new HtmlCleaner(props).clean(
new URL(url)
);
// serialize to xml file
new PrettyXmlSerializer(props).writeToFile(
tagNode, "c:\\TEMP\\clean.xml", "utf-8"
);
}
public static void testJavaxXpath(String pattern)
throws ParserConfigurationException, SAXException, IOException,
FileNotFoundException, XPathExpressionException {
DocumentBuilder b = DocumentBuilderFactory.newInstance()
.newDocumentBuilder();
org.w3c.dom.Document doc = b.parse(new FileInputStream(
"c:\\TEMP\\clean.xml"));
// Evaluate XPath against Document itself
javax.xml.xpath.XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xPath.evaluate(pattern,
doc.getDocumentElement(), XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); ++i) {
Element e = (Element) nodes.item(i);
System.out.println(e.getFirstChild().getTextContent());
}
}

Forced to use template?

This code from Dart worries me:
bool get isTemplate => tagName == 'TEMPLATE' || _isAttributeTemplate;
void _ensureTemplate() {
if (!isTemplate) {
throw new UnsupportedError('$this is not a template.');
}
...
Does this mean that the only way I can modify my document is to make it html5?
What if I want to modify an html4 document and set innerHtml in a div, how do I achieve this?

I am assuming you are asking about the code in dart:html Element?
The method you are referring to is only called by the library itself, and only in methods where isTemplate has to be true, for example this one. If you follow this link, you can also read what other fields/methods work like this.
innerHtml is a field in every subclass of Element which supports it, for example DivElement
Example:
DivElement myDiv1 = new DivElement();
myDiv1.innerHtml = "<p>I am a DIV!</p>";
query("#some_div_id").innerHtml = "<p>Hey, me too!</p>";

code tag and pre css in html not functioning properly

in html i am using the code tag as below and also i am using the css as shown below :-
<style type="text/css">
code { white-space: pre; }
</style>
<code>
public static ArrayList<File> getFiles(File[] files){
ArrayList<File> _files = new ArrayList<File>();
for (int i=0; i<files.length; i++)
if (files[i].isDirectory())
_files.addAll(getFiles(new File(files[i].toString()).listFiles()));
else
_files.add(files[i]);
return _files;
}
public static File[] getAllFiles(File[] files) {
ArrayList<File> fs = getFiles(files);
return (File[]) fs.toArray(new File[fs.size()]);
}
</code>
When i use the code tag as shown above some part of the code is missing in the html page when viewed. when view the above html page the output is as shown below:-
public static ArrayList getFiles(File[] files){
ArrayList _files = new ArrayList();
for (int i=0; i fs = getFiles(files);
return (File[]) fs.toArray(new File[fs.size()]);
}
In the first method some part is missing and the second method is not appearing at all. what is the problem and how to fix it?

You have these <File> inside your <code> tag, you need to convert them to < and > html entities
Demo
<code>
public static ArrayList<File> getFiles(File[] files){
ArrayList<File> _files = new ArrayList<File>();
for (int i=0; i<files.length; i++)
if (files[i].isDirectory())
_files.addAll(getFiles(new File(files[i].toString()).listFiles()));
else
_files.add(files[i]);
return _files;
}
public static File[] getAllFiles(File[] files) {
ArrayList<File> fs = getFiles(files);
return (File[]) fs.toArray(new File[fs.size()]);
}
</code>

As already identified by Mr. Alien, you have characters being interpreted as markup inside your <code> block.
As an alternative to escaping lots of characters, providing your code does not include the string </script, you can exploit the parsing and (non)execution behaviour of the <script> element like this:
<code>
<script type="text/x-code">
public static ArrayList<File> getFiles(File[] files){
ArrayList<File> _files = new ArrayList<File>();
for (int i=0; i<files.length; i++)
if (files[i].isDirectory())
_files.addAll(getFiles(new File(files[i].toString()).listFiles()));
else
_files.add(files[i]);
return _files;
}
public static File[] getAllFiles(File[] files) {
ArrayList<File> fs = getFiles(files);
return (File[]) fs.toArray(new File[fs.size()]);
}
</script>
</code>
with this CSS:
script[type=text\/x-code] {
display: block;
white-space: pre;
line-height: 20px;
margin-top: -20px;
}
See JSfiddle: http://jsfiddle.net/fZuPm/3/
Update: In the comments, RoToRa raises some interesting points about the "correctness" of this approach, and I thank RoToRa for them.
Using a type attribute to stop the contents of a script tag from being executed as JavaScript is a well understood technique, and although the list of type names that cause script to be executed varies from browser to browser, finding one that won't cause execution is not hard.
More interesting is the question of the semantics. It is my view that the semantics of the script element are essentially inert, like a div or span element, while RoToRa's view is that it affects the semantics of the content. Looking at the specs, it is not easy to resolve. HTML 4.01 says very little about the semantics of the script element, concentrating solely on its functionality.
The HTML5 spec is not much better, but it does say "The element does not represent content for the user.". I don't know what to make of that. Saying what an element doesn't do is not very helpful. If it implies that its contents are semantically "hidden" from the user, such that the its contents are not semantically part of contents of the containing code element, then this technique should not be used.
If, however, it means that no new semantics are introduced by the script element, then there doesn't appear to be any problem.
I can't find any evidence of a script element being semantically required to contain script, as RoToRa suggests, and while it might be considered common-sense to infer that, that's not how HTML semantics works.
In many ways, this approach is really about trying to find a way to do validly what the XMP element does in browsers anyway, but is not valid. XMP was very nearly made valid in HTML5 but just missed out. The editor described it as a tough call. Using the script element like this meets that requirement, but it seems nevertheless to be controversial. If you are uncomfortable with whatever semantics you feel are being applied is this approach, I would suggest that you don't use it.

HTML parsing individual tables/not all data being parsed?

I'm rather new when it comes to Windows Phone 8 development and I've been toying around with a few things as part of the application I'm developing.
Right now I'm trying to parse information from a website such as the RuneScape 07 High Scores - http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=zezima
I'm using HTML Agility Pack and I'm able to parse some data (down to Woodcutting), but anything passed that doesn't appear? (Is that down to the size of my ListBox?)
Ideally, I'd like to be able to parse the table information individually rather than in one block like so:
public MainPage()
{
InitializeComponent();
HtmlWeb.LoadAsync("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=zezima", DownLoadCompleted);
}
void DownLoadCompleted(object sender, HtmlDocumentLoadCompleted e)
{
if(e.Error == null)
{
HtmlDocument doc = e.Document;
if (doc != null)
{
var result = doc.DocumentNode.SelectNodes("//div[#id='contentHiscores']");
foreach (var htmlNode in result)
{
lBox.Items.Add(htmlNode.InnerText);
}
}
}
But if I try and access an individual table such as this one using
var result = doc.DocumentNode.SelectNodes("//div[#id='contentHiscores']/table/tbody/tr[5]/td[2]");
I get a NullReferenceException.
Is this possible or am I doing something exceptionally wrong?

You probably relied on a developper tools such as FireBug or Chrome, etc... to determine the XPATH for the nodes you're after.
You can' really do this as the XPATH given by such tools correspond to the in memory HTML DOM while the Html Agility Pack only knows about the raw HTML sent back by the server.
What you need to do is look at what's sent back (or just do a view source). You'll see there is no TBODY element for example. So you want to find anything discriminant, and use XPATH axes for example.
Here is a code that seems to work:
// get all TD nodes with ALIGN attribute set to left
foreach (var node in doc.DocumentNode.SelectNodes("//div[#id='contentHiscores']//td[#align='left']"))
{
var item = lBox.Items.Add(node.InnerText.Trim());
// use an 'XPATH axe': get all sibling TD nodes with ALIGN attribute set to 'right'
foreach (var sibling in node.SelectNodes("following-sibling::td[#align='right']"))
{
item.SubItems.Add(sibling.InnerText.Trim());
}
}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Remove attributes using HtmlAgilityPack - html

Here is a very simple solution VB.net element.Attributes.Remove(element.Attributes("style")) c# element.Attributes.Remove(element.Attributes["style"])

Related

How can I change the html img element source witch dart?

Parsing html page content without using selector

Forced to use template?

code tag and pre css in html not functioning properly

HTML parsing individual tables/not all data being parsed?

Categories

Resources