XQuery on doc collection with multiple prefixes that map to the same namespace - namespaces

I am using XQuery (Berkeley dbXML 6.0) to query documents from a collection. All documents use the same namespace, but the prefixes differ. To abstract the problem:
doc1: <a xmlns:ns0="http://my.url"> here i am </a>
doc2: <a xmlns:ns1="http://my.url"> me too! </a>
Both ns0 and ns1 map to the same namespace. I would like to avoid returning different namespace prefixes in the XQuery result. A simple XQuery such as:
<Result xmlns:ns2="http://my.url"> {
for $doc in collection("my_collection")/ns2:a
<Return> {$doc} </Return>
} </Result>
shows the ns0 and ns1 prefixes for my documents 1 and 2. As they all map to the same namespace, I would have thought that the only namespace I should have seen was in the enclosing result document. The namespace prefixes are creating problems for downstream processing. I can remove these manually, but it would be nice if there was a way to construct this correctly in XQuery.

XQuery can't automatically change the namespace prefix because it can't be sure that it's unused. For example if there's an attribute xsi:type='my:part-number', then it doesn't know that my is a namespace prefix, because it's in an attribute value rather than in attribute content. You're going to have to do a much more thorough rebuilding of the document to achieve this (Personally, I would use XSLT for this).

Related

Extracting string from html web scrape

I'm looking for some guidance on a web scraping script i'm working on.
All is going well but I'm stuck on stripping out the image file data.
I'm currently doing a WebRequest, getting elements by class, selecting outerHTML, but need to strip out just the contents of attribute data-imagezoom as per this example.
Sample data:
<a class="aaImg" href="https://imagehost.ssl.server123.com/Product-800x800/image.jpg">
<img class="aaTmb" alt="Matrix 900 x 900 test" src="https://imagehost.ssl.server123.com/Product-190x190/image.jpg" item="image"
data-imagezoom="https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg" data-thumbnail="https://imagehost.ssl.server123.com/Product-190x190/image.jpg">
</img>
</a>
Current code to get that data:
$ProductInfo = Invoke-WebRequest -Uri $ProductURL
$ProductImageRaw = $ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg") |
Select outerHTML
I can obviously get the first image by selecting the href attribute easily.
I was 'dirty coding' by replacing 800x800 with 1600x1600 as the filenames are the same, just a different path, but that came unstuck pretty quick when there were inconsistencies in path names.
You need to access the outer <a> element's <img> child element and call its .getAttribute() method to get the attribute value of interest:
$ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg").
childnodes[0].getAttribute('data-imagezoom')
.childnodes[0] returns the first child node (element)
.getAttributes('data-imagezoom') returns the value of the data-imagezoom attribute.[1]
This should return string https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg.
As for your own answer:
Using regexes (or substring search) to parse structured data such as HTML and XML is brittle and best avoided.
For instance, if the source HTML changes to use '...' instead of "..." around attribute values, your solution breaks (this particular case is not hard to account for in a regex, but there are many more ways in which such parsing can go wrong).
Cross-platform perspective:
Regrettably, the .ParsedHTML property with its HTML DOM is only available in Windows PowerShell (and its COM implementation is cumbersome and slow to work with in PowerShell).
PowerShell Core, even on Windows, doesn't support it, and there's no in-box HTML parser available (as of PowerShell Core 6.2.0).
The HtmlAgilityPack NuGet package is a popular open-source HTML parser, but it is aimed at C# and therefore nontrivial to install and use in PowerShell.
That said, this answer by TheIncorrigible1 has a working example that downloads the required assembly on demand.
[1] Note that .getAttribute() is necessary to access custom attributes, whereas standard attributes such as id and, in the case of <a> elements, href, are represented directly as object properties (e.g., .id; note that .getAttribute() works with standard attributes too.)
So, after a quick crash course in some Regex, this is what I've come up with.
(?<=data-imagezoom=").*?(?="\s)
A positive lookbehind, select all until the closing quotes and whitespace.
Thanks all.

How best open xml, parse with xslt and show result in browser

I am currently studying ways to present transformed xml files in browsers. My experience with this is minimal, so a number of questions pop up.
I have a transformation test.xslt which transforms input xml to html, and an input file test.xml containing
<?xml version="1.0" standalone="yes"?>
<?xml-stylesheet type="text/xsl" href="test.xslt" ?>
<root>...</root>
which, when opened in IE9, neatly displays the transformed xml contained above in the root element.
Question 1
Is there a processing instruction or similar available to include the source xml into the xml to be opened, somewhat like the following:
<?xml version="1.0" standalone="yes"?>
<?xml-stylesheet type="text/xsl" href="test.xslt" ?>
<... instruction to include source file data.xml>
Question 2
The file opened has extension xml. Is there a way to change file contents so it is valid html, allowing the file to be saved with extension html, so that when opened, the default browser will be selected (simply changing extension to html obviously does not have the desired effect so some structural change is necessary) ?
Question 3
My goal is to query a db to get the data to be parsed by the xslt code. What is the best way to do this (no problem if this includes javascript)?
Question 4
Standard db utilities may export query results in attribute-centered fashion (column names and values being represented as attribute names and values). This may involve pre-parsing the xml from db in order to convert it to parent-child fashion (columns as children instead of attributes). What is the best way to do this pre-parsing (note: I already have the xslt for this; I wonder about the data flow and when/how to run two xslt's in sequence) and then apply test.xslt (preferably without saving intermediate xml result files on the server)?
Question 5
When I open above xml in IE9, this works fine as said. But opening it in Firefox errors (RTF issue, apparently I need to use Firefox's node-set function but I still have to discover which namespace that has), and Opera/Chrome/Safari do not show any content. What exactly are the prerequisites for the various browsers where can I find more information on this?
Q1 If you start by serving an html file which then accesses the xml and xslt via javascript it naturally has access to both the input and the output of the xslt. If you are serving the xml and initiating the transformation using xml-stylesheet pi, then perhaps the best thing to do (depending on what you want to do) is to stuff the original source into the output, then javascript in the generated page can access it if needed, eg
<xsl:template matcj="whatever">
<html>
<head>
<script id="source" type="x-xml-spurce">
<xsl:copy-of select="/"/>
</script>
.... whatever you were going to do
then if you need to access the source in response to a user action on the page, a script can retrieve the script with id source and do whatever is needed. (If there is a possibility of the the source including the string you have to code it a bit more defensively).
Q2 If you want to use the xml-stylesheet API then you have to serve it as xml. However you can instead just serve html and then access the xml and xslt from within a script in the html page using the browsers javascip xslt api. as noted above that is more flexible than the xml-stylesheet mechanism.
Q3 pass
Q4 If you are accessing the xslt from javascript then it is easy to chain the output of one to the input of another without writing back to the server as you just have access to the result as a DOM node (or string, depending)
Answer to question 5: Firefox/Mozilla, Opera, Safari, Chrome all support the EXSLT node-set extension function in the namespace http://exslt.org/common, for IE and MSXML you can use script (imported) inside the XSLT stylesheet to allow it to support that namespace too, see http://dpcarlisle.blogspot.de/2007/05/exslt-node-set-function.html. That way inside the main stylesheet where you need to use the node-set function you don't need to write different code to cater for the different namespaces.

How to pass arguments to xslt?

Is there any way to pass any argument to xslt?
For example I need to filter some elements, and I want to be able to change filtering condition.
Preferably without js.
Sure, define global parameters in your stylesheet with top-level <xsl:param name="param-name"/> elements in your stylesheet, then check the documentation of your favorite XSLT processor API on how to set such parameters before you run a transformation.

Grep and Extract Data in Perl

I have HTML content stored in a variable. How do I extract data that is found between a set of common tags in the page? For example, I am interested in the data (represented by DATA kept between a set of tags which one line after the other:
...
<td class="jumlah">*DATA_1*</td>
<td class="ud">*DATA_2*</td>
...
And then I would like to store a mapping DATA_2 => DATA_1 in a hash
Since it is HTML I think this could work for you?
https://metacpan.org/pod/XML::XPath
XPath is the way.
Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath.
First you'll need to parse your string using the HTML::TreeBuilder methods. Assuming your webpage's content is in a variable named $content, do it like this:
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);
Now you can use XPath expressions to get iterators over the nodes you care about. This first expression gets all td nodes that are in a tr in a table in the body in the html element:
my $tdNodes = $tree->findnodes('/html/body/table/tr/td');
Finally you can just iterate over all the nodes in a loop to find what you want:
foreach my $node ($tdNodes->get_nodelist) {
my $data = $node->findvalue('.'); // the content of the node
print "$data\n";
}
See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. w3schools has a passable XPath tutorial here.
With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.
Use HTML parsing modules as described in answers to this Q - HTML::TreeBuilder or HTML::Parser.
Purely theoretically you could try doing this using Regular Expressions to do this but as noted in the linked question's answers and countless other times on SO, parsing HTML with RegEx is a Bad Idea with capital letters - too easy to get wrong, too hard to get well, and impossible to get 100% right since HTML is not a regular language.
You might try this module: HTML::TreeBuilder::XPath. The doc says:
This module adds typical XPath methods to HTML::TreeBuilder, to make it easy to query a document.

How to show XSL-converted XML as a part of an HTML page?

Can I embed an XML file in HTML without using iFrames?
I want to show XSL-transformed XML(which, is HTML as a result of transformation) as a part of my HTML document. Hope this makes it clearer.
If my description of problem is unclear, please tell me and I will try to explain it more.
You can easily use browser based XSL transformation routines to convert an XML string into an XMLDocument or HTML output that can then be applied into any page element.
The steps could be briefly summarized as:
Load an XML string from a resource (or as the result of an AJAX hit).
Load the XML document into an Xml document object (code differs for Browsers - IE uses the ActiveXObject MSXML - DOMDocument, while Mozilla uses the built-in implementation to create a Document. Chrome on the other hand uses the built-in XmlHttpRequest object as the only available XML document object.)
Load the XSL document similarly and set its arguments.
Transform the XML and obtain output as a string.
Apply the string output to any page element.
Note that the code differs for each browser so it may be simpler to use a public JS framework such as JQuery or Prototype.
You will need to use html entities. For example this is how you would write a name tag
<name>.
More reading here