I have an ANT configuration file which is becoming complicated, and now I'm stuck with an issue. One of the tasks retrieves a page from a website and saves it to a file. I need to load such file and extract from it the href attribute of a specific element. HTML is reasonably well formed, but I can't guarantee it.
I was thinking of a RegEx, but the element's attributes are not guaranteed to always appear in the same order (e.g. its class name, or id). Besides, I haven't found out how to just return the value of the href attribute, without the attribute itself.
I'm trying to limit the amount of addons to be added to ANT, therefore a "self-contained" solution would be welcome. Thanks.
I'm not sure how you're going to find the specific HTML element that has the href you're looking for (I'd assume by checking an id attribute, but you did not say so). I put together this chain of regex's to filter the HTML down to candidate anchor tags and then ultimately strip out just the href's. I used the source of this page as my sample input and since I couldn't find any id attributes associated to anchors (that also had hrefs), I filtered down to anchors with the class="question-hyperlink" -- I'm hopeful this could be a good starting point for you (and note: as you stipulated, it does not contain any dependencies on additional modules, etc, regardless of how easy they are to install):
<?xml version="1.0" encoding="UTF-8"?>
<project name="Test Html attribute" default="test" basedir=".">
<target name="test">
<loadfile srcFile="ant.htm" property="html">
<filterchain>
<linecontainsregexp>
<regexp pattern="<a.*href[^>]*>"/>
<regexp pattern="<a.*class=["']question-hyperlink["'][^>]*>"/>
</linecontainsregexp>
<tokenfilter>
<replaceregex pattern=".*<a.*href=["']?([^>"']*).*>[^<]*" replace="\1" flags="gi"/>
</tokenfilter>
</filterchain>
</loadfile>
<echo>${html}</echo>
</target>
</project>
Related
In my AEM project, we have client-side dynamic variable functionality which checks for any strings that are formed inside of a ${ } wrapper. The dynamic variable values are coming from our cookies. Replacing this with a more friendly format that does not conflict with Sightly is not an option at the moment, so please don't tell me to do that :)
When creating an anchor tag in the source editor of the Text core component, I am setting the href as the following: href="/content/en/opt-in.html?hash=${/profile/hash}". The anti-Samy configuration is blocking the href attribute from being rendered on this element, but I have tried to add the following to the overlayed file /apps/cq/xssprotection/config.xml:
<regexp name="expressionURLWithSpecialCharacters" value="(\$\{(\w|\/|:)+\})"/>
<regexp-list>
<regexp name="onsiteURL"/>
<regexp name="offsiteURL"/>
<regexp name="expressionURL"/>
<regexp name="expressionURLWithSpecialCharacters"/>
</regexp-list>
^ inside of the <attribute name="href"> block of common-attributes. Is there something else I need to do in order to make this not be filtered out so that it can be correctly parsed by the global variable replacement? Thanks!
There are two issues here:
The RTE will encode your URL and turn hash=${/profile/hash} into hash=$%7B/profile/hash%7D when storing into JCR
Even if you pass 1, the expression you are trying to use will only match EXACTLY the URL of ${/profile/hash}. You would need to expand the expression to include everything else (scheme, domain/host, path, query etc.). Think onsiteURL and offsiteURL but allowing your expression as well in query parameters. Have a look at https://github.com/apache/sling-org-apache-sling-xss/blob/master/src/main/java/org/apache/sling/xss/impl/XSSFilterImpl.java#L115 to get a starting point.
Have you tried adding disableXSSFiltering="{Boolean}true”?
Vlad, your second point was helpful in that I hadn't considered that one of the regular expressions in the XSS Protection configuration href attribute block needed to match the ${/profile/hash} in addition to the rest of the URL preceding and following it. Although to your first point, the RTE actually did save the special characters as-is into the JCR and did not encode them, probably since I was using the source editor mode and not the inline text editor.
What I ended up doing was creating a new regular expression as follows:
<regexp name="onsiteURLWithVariableExpression"
value="(?!\s*javascript(?::|:))(?:(?://(?:(?:(?:(?:\p{L}\p{M}*)|[\p{N}-._~])|(?:%\p{XDigit}\p{XDigit})|(?:[!$&'()*+,;=]))*#)?(?:\[(?:(?:(?:\p{XDigit}{1,4}:){6}(?:(?:\p{XDigit}{1,4}:\p{XDigit}{1,4})|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])))|(?:::(?:\p{XDigit}{1,4}:){5}(?:(?:\p{XDigit}{1,4}:\p{XDigit}{1,4})|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])))|(?:(?:\p{XDigit}{1,4}){0,1}::(?:\p{XDigit}{1,4}:){4}(?:(?:\p{XDigit}{1,4}:\p{XDigit}{1,4})|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])))|(?:(?:(?:\p{XDigit}{1,4}:){0,1}\p{XDigit}{1,4})?::(?:\p{XDigit}{1,4}:){3}(?:(?:\p{XDigit}{1,4}:\p{XDigit}{1,4})|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])))|(?:(?:(?:\p{XDigit}{1,4}:){0,2}\p{XDigit}{1,4})?::(?:\p{XDigit}{1,4}:){2}(?:(?:\p{XDigit}{1,4}:\p{XDigit}{1,4})|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])))|(?:(?:(?:\p{XDigit}{1,4}:){0,3}\p{XDigit}{1,4})?::(?:\p{XDigit}{1,4}:){1}(?:(?:\p{XDigit}{1,4}:\p{XDigit}{1,4})|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])))|(?:(?:(?:\p{XDigit}{1,4}:){0,4}\p{XDigit}{1,4})?::(?:(?:\p{XDigit}{1,4}:\p{XDigit}{1,4})|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])))|(?:(?:(?:\p{XDigit}{1,4}:){0,5}\p{XDigit}{1,4})?::(?:\p{XDigit}{1,4}))|(?:(?:(?:\p{XDigit}{1,4}:){0,6}\p{XDigit}{1,4})?::))]|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])|(?:(?:(?:\p{L}\p{M}*)|[\p{N}-._~])*|(?:%\p{XDigit}\p{XDigit})*|(?:[!$&'()*+,;=])*))(?::\p{Digit}+)?(?:/|(/(?:(?:\p{L}\p{M}*)|[\p{N}-._~]|%\p{XDigit}\p{XDigit}|[!$&'()*+,;=]|:|#)+/?)*))|(?:/(?:(?:(?:\p{L}\p{M}*)|[\p{N}-._~]|%\p{XDigit}\p{XDigit}|[!$&'()*+,;=]|:|#)+(?:/|(/(?:(?:\p{L}\p{M}*)|[\p{N}-._~]|%\p{XDigit}\p{XDigit}|[!$&'()*+,;=]|:|#)+/?)*))?)|(?:(?:(?:\p{L}\p{M}*)|[\p{N}-._~]|%\p{XDigit}\p{XDigit}|[!$&'()*+,;=]|:|#)+(?:/|(/(?:(?:\p{L}\p{M}*)|[\p{N}-._~]|%\p{XDigit}\p{XDigit}|[!$&'()*+,;=]|:|#)+)*)))?(?:\?(?:(?:\p{L}\p{M}*)|(\$\{(\w|\/|:)+\})|[\p{N}-._~]|%\p{XDigit}\p{XDigit}|[!$&'()*+,;=]|:|#|/|\?)*)?(?:#(?:(?:\p{L}\p{M}*)|[\p{N}-._~]|%\p{XDigit}\p{XDigit}|[!$&'()*+,;=]|:|#|/|\?)*)?"/>
which is just the onsiteURL with my original expressionURLWithSpecialCharacters: (\$\{(\w|\/|:)+\}) value added as a group in the query string parameter section. This enabled AEM to accept this as an href value in my anchor tag.
I appreciate everyone's help!
I'm looking for some guidance on a web scraping script i'm working on.
All is going well but I'm stuck on stripping out the image file data.
I'm currently doing a WebRequest, getting elements by class, selecting outerHTML, but need to strip out just the contents of attribute data-imagezoom as per this example.
Sample data:
<a class="aaImg" href="https://imagehost.ssl.server123.com/Product-800x800/image.jpg">
<img class="aaTmb" alt="Matrix 900 x 900 test" src="https://imagehost.ssl.server123.com/Product-190x190/image.jpg" item="image"
data-imagezoom="https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg" data-thumbnail="https://imagehost.ssl.server123.com/Product-190x190/image.jpg">
</img>
</a>
Current code to get that data:
$ProductInfo = Invoke-WebRequest -Uri $ProductURL
$ProductImageRaw = $ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg") |
Select outerHTML
I can obviously get the first image by selecting the href attribute easily.
I was 'dirty coding' by replacing 800x800 with 1600x1600 as the filenames are the same, just a different path, but that came unstuck pretty quick when there were inconsistencies in path names.
You need to access the outer <a> element's <img> child element and call its .getAttribute() method to get the attribute value of interest:
$ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg").
childnodes[0].getAttribute('data-imagezoom')
.childnodes[0] returns the first child node (element)
.getAttributes('data-imagezoom') returns the value of the data-imagezoom attribute.[1]
This should return string https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg.
As for your own answer:
Using regexes (or substring search) to parse structured data such as HTML and XML is brittle and best avoided.
For instance, if the source HTML changes to use '...' instead of "..." around attribute values, your solution breaks (this particular case is not hard to account for in a regex, but there are many more ways in which such parsing can go wrong).
Cross-platform perspective:
Regrettably, the .ParsedHTML property with its HTML DOM is only available in Windows PowerShell (and its COM implementation is cumbersome and slow to work with in PowerShell).
PowerShell Core, even on Windows, doesn't support it, and there's no in-box HTML parser available (as of PowerShell Core 6.2.0).
The HtmlAgilityPack NuGet package is a popular open-source HTML parser, but it is aimed at C# and therefore nontrivial to install and use in PowerShell.
That said, this answer by TheIncorrigible1 has a working example that downloads the required assembly on demand.
[1] Note that .getAttribute() is necessary to access custom attributes, whereas standard attributes such as id and, in the case of <a> elements, href, are represented directly as object properties (e.g., .id; note that .getAttribute() works with standard attributes too.)
So, after a quick crash course in some Regex, this is what I've come up with.
(?<=data-imagezoom=").*?(?="\s)
A positive lookbehind, select all until the closing quotes and whitespace.
Thanks all.
I have this XSLT document that has a file name. However for archiving purposes we want the file name to be displayed somewhere else within the code.
Now I used to do this like this:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
sample-xml=".\DOCUMENTNAME.xml"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office">
However we want to get rid of this solution due to new working methods. Now I was wondering if I could place this (literally just the word DOCUMENTNAME) somewhere within the XSL or within the HTML wrapped within it, in a way that it is not visible.
We add this code to a database, through a validator that looks for the documentname and checks for a match. And as only the contents of the code is placed on the database its easier to check back from the database what documentname was uploaded. However this documentname should not be visible in an HTML output.
Maybe add a processing instruction or comment somewhere in the XSLT.
Like:
<?DOCUMENTNAME?>
or:
<!--DOCUMENTNAME-->
They're not invisible, but they definitely won't be included in the output.
Not visible to whom?
Your current solution (with sample-xml) is not valid XSLT: unknown attributes on an XSLT element should be rejected by the processor unless they are in a namespace of their own.
The cleanest way to do this is probably a top-level element (a child of xsl:stylesheet) in a private namespace:
<my:document-name xmlns:my="http://my.company.com/ns/document-id">document.xml</my:document-name>
I don't know if that meets your criteria of being "not visible". It certainly wouldn't be visible in the output of the stylesheet.
So I've been working on some XSLT to modify YouTube's RSS XML, and of course as soon as I got it working, they've changed their formatting. Before, each video's unique ID was stored between <videoid> tags, which you could then use to create a URL. But now the only way to get a video's URL is from a tag like this
<media:player url='https://www.youtube.com/watch?v=XXXXXXXXXXX&feature=youtube_gdata_player'/>`
which is contained within <media:group> tags.
The way I've been trying to get at it is
<xsl:value-of select="media:group/media:player#url" />
but doing that gives me a compilation error that says
xsl:value-of : could not compile select expression 'media:group/media:player#url'
Does anyone see anything wrong with that?
Also, as a side note, I want to do something similar with
<xsl:value-of select="media:group/media:thumbnail#url" />
however there are several <media:thumbnail> tags for each entry; would this just grab the first one, or would this potentially cause errors?
you are missing a / in your XPath. Try:
<xsl:value-of select="media:group/media:player/#url" />
As far as DOM API is concerned, attributes belong to an element. The XPath goes about it in a slightly different way. Child nodes live on the child:: axes (which is the default so you rarely see it used explicitly) and attributes live on the atribute:: axes (you can get there with the abbreviated #). When constructing an XPath expression you are basically building a sequence of location steps separated by /. An attribute is "one location step" away from the element owning it.
To the second part of your question. The selector will create a sequence (think node list) of all nodes that match the expression and will do what the xsl: instruction prescribes to do on that sequence. In your case (xsl:value-of), all #url of all matching media:thumbnail will be put together in a node-set that will be further converted to a string:
A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order
So you will get the value of the "first" one. That said, I would argue that running xsl:value-of on a sequence of more than one node is not really intuitive (even though the spec clearly says what it will do) so you would do a favor to someone reading the code after you if you be more specific with your selectors. Something like: media:group/media:thumbnail[1]/#url
I am currently studying ways to present transformed xml files in browsers. My experience with this is minimal, so a number of questions pop up.
I have a transformation test.xslt which transforms input xml to html, and an input file test.xml containing
<?xml version="1.0" standalone="yes"?>
<?xml-stylesheet type="text/xsl" href="test.xslt" ?>
<root>...</root>
which, when opened in IE9, neatly displays the transformed xml contained above in the root element.
Question 1
Is there a processing instruction or similar available to include the source xml into the xml to be opened, somewhat like the following:
<?xml version="1.0" standalone="yes"?>
<?xml-stylesheet type="text/xsl" href="test.xslt" ?>
<... instruction to include source file data.xml>
Question 2
The file opened has extension xml. Is there a way to change file contents so it is valid html, allowing the file to be saved with extension html, so that when opened, the default browser will be selected (simply changing extension to html obviously does not have the desired effect so some structural change is necessary) ?
Question 3
My goal is to query a db to get the data to be parsed by the xslt code. What is the best way to do this (no problem if this includes javascript)?
Question 4
Standard db utilities may export query results in attribute-centered fashion (column names and values being represented as attribute names and values). This may involve pre-parsing the xml from db in order to convert it to parent-child fashion (columns as children instead of attributes). What is the best way to do this pre-parsing (note: I already have the xslt for this; I wonder about the data flow and when/how to run two xslt's in sequence) and then apply test.xslt (preferably without saving intermediate xml result files on the server)?
Question 5
When I open above xml in IE9, this works fine as said. But opening it in Firefox errors (RTF issue, apparently I need to use Firefox's node-set function but I still have to discover which namespace that has), and Opera/Chrome/Safari do not show any content. What exactly are the prerequisites for the various browsers where can I find more information on this?
Q1 If you start by serving an html file which then accesses the xml and xslt via javascript it naturally has access to both the input and the output of the xslt. If you are serving the xml and initiating the transformation using xml-stylesheet pi, then perhaps the best thing to do (depending on what you want to do) is to stuff the original source into the output, then javascript in the generated page can access it if needed, eg
<xsl:template matcj="whatever">
<html>
<head>
<script id="source" type="x-xml-spurce">
<xsl:copy-of select="/"/>
</script>
.... whatever you were going to do
then if you need to access the source in response to a user action on the page, a script can retrieve the script with id source and do whatever is needed. (If there is a possibility of the the source including the string you have to code it a bit more defensively).
Q2 If you want to use the xml-stylesheet API then you have to serve it as xml. However you can instead just serve html and then access the xml and xslt from within a script in the html page using the browsers javascip xslt api. as noted above that is more flexible than the xml-stylesheet mechanism.
Q3 pass
Q4 If you are accessing the xslt from javascript then it is easy to chain the output of one to the input of another without writing back to the server as you just have access to the result as a DOM node (or string, depending)
Answer to question 5: Firefox/Mozilla, Opera, Safari, Chrome all support the EXSLT node-set extension function in the namespace http://exslt.org/common, for IE and MSXML you can use script (imported) inside the XSLT stylesheet to allow it to support that namespace too, see http://dpcarlisle.blogspot.de/2007/05/exslt-node-set-function.html. That way inside the main stylesheet where you need to use the node-set function you don't need to write different code to cater for the different namespaces.