How to get element's raw value without trimming it using dom4j? - dom4j

Imagine that I have the following XML element with multi-lines value \nline2,
<address>
line2</address>
I'm using dom4j Element's getText(), but it trimmed the empty line (the \n) and returned line2.
Is there a way to retrieve the raw value \nline2 without trimming?

I doubt that. You probably used getTextTrim().

Related

Why is XPath contains(text(),'substring') not working as expected?

Let's say I have a piece of HTML like this:
<a>Ask Question<other/>more text</a>
I can match this piece of XPath:
//a[text() = 'Ask Question']
Or...
//a[text() = 'more text']
Or I can use dot to match the whole thing:
//a[. = 'Ask Questionmore text']
This post describes this difference between . (dot) and text(), but in short the first returns a single element, where the latter returns a list of elements. But this is where it gets a bit weird to me. Because while text() can be used to match either of the elements on the list, this is not the case when it comes to the XPath function contains(). If I do this:
//a[contains(text(), 'Ask Question')]
...I get the following error:
Error: Required cardinality of first argument of contains() is one or zero
How can it be that text() works when using a full match (equals), but doesn't work on partial matches (contains)?
For this markup,
<a>Ask Question<other/>more text</a>
notice that the a element has a text node child ("Ask Question"), an empty element child (other), and a second text node child ("more text").
Here's how to reason through what's happening when evaluating //a[contains(text(),'Ask Question')] against that markup:
contains(x,y) expects x to be a string, but text() matches two text nodes.
In XPath 1.0, the rule for converting multiple nodes to a string is this:
A node-set is converted to a string by returning the string-value of
the node in the node-set that is first in document order. If the
node-set is empty, an empty string is returned. [Emphasis added]
In XPath 2.0+, it is an error to provide a sequence of text nodes to a function expecting a string, so contains(text(),'substr') will cause an error for more than one matching text node.
In your case...
XPath 1.0 would treat contains(text(),'Ask Question') as
contains('Ask Question','Ask Question')
which is true. On the other hand, be sure to notice that contains(text(),'more text') will evaluate to false in XPath 1.0. Without knowing the (1)-(3) above, this can be counter-intuitive.
XPath 2.0 would treat it as an error.
Better alternatives
If the goal is to find all a elements whose string value contains the substring, "Ask Question":
//a[contains(.,'Ask Question')]
This is the most common requirement.
If the goal is to find all a elements with an immediate text node child equal to "Ask Question":
//a[text()='Ask Question']
This can be useful when wishing to exclude strings from descendent elements in a such as if you want this a,
<a>Ask Question<other/>more text</a>
but not this a:
<a>more text before <not>Ask Question</not> more text after</a>
See also
How contains() handles a nodeset first arg
How to use XPath contains() for specific text?
Testing text() nodes vs string values in XPath
The reason for this is that the contains function doesn't accept a nodeset as input - it only accepts a string. (Well, it may be engine dependent, because it works for Python's lxml module. According to the specification, it should convert the value of the first node in the set to a string and act on that. See also XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode)
//a[text() = 'Ask Question'] is matching any a elements which contain a text node which equals Ask Question.
//a[text() = 'more text'] is matching any a elements which contain a text node which equals more text.
So both of these expressions match the same a element.
You can re-work your query to //a[text()[contains(., 'Ask Question')]] so that the contains method will only act on a single text node at a time.

Extracting content of HTML tag with specific attribute

Using regular expressions, I need to extract a multiline content of a tag, which has specific id value. How can I do this?
This is what I currently have:
<div(.|\n)*?id="${value}"(.|\n)*?>(.|\n)*?<\/div>
The problem with this is this sample:
<div id="1">test</div><div id="2">test</div>
If I want to replace id="2" using this regexp (with ${value} = 2), the whole string would get matched. This is because from the tag opening to closing I match everything until id is found, which is wrong.
How can I do this?
A fairly simple way is to use
Raw: <div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)</div>
Delimited: /<div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)<\/div>/
Use the variable in place of 2.
The content will be in group 1.
Change (.|\n) to [^>] so it won't match the > that ends the tag. Then it can't match across different divs.
<div\b[^>]*\bid="${value}"[^>]*>.*?<\/div>
Also, instead of using (.|\n)* to match across multiple lines, use the s modifier to the regexp. This makes . match any character, including newlines.
However, using regular expressions to parse HTML is not very robust. You should use a DOM parser.

Create XSD to ignore HTML within tags

Is it possible to build an XSD that will treat any tag's contents just as text? I am trying to extract a tag's contents that sometimes contains HTML tags. There is no fixed pattern to the html and is not always present. I just want to extract all the text from within the tags. e.g. <content>this is a new piece of content by <b>Person A</b></content>. I want to extract just "this is a new piece of content by <b>Person A</b>" but the schema generated by SSIS naturally includes these tags. When I just add a simple entry
<xs:element minOccurs="0" name="content" type="xs:string"></xs:element>
I get the following error which is not unexpected.
[XML Source [5]] Error: The XML Source was unable to process the XML
data. The element "content" cannot contain a child element. Content
model is text only.
You're not distinguishing very clearly between the schema you are writing to describe and constrain your data (and, I assume, guide SSIS in various ways) and the executable code you will at some point want to write in order to extract the data you want at a particular moment. There are several things you seem to want or need:
To allow unconstrained XML within an element, you'll want a wildcard; read up on the xsd:any element.
To extract just the text within an element, you'll want the XPath string() function (but note that your example "this is a new piece of content by <b>Person A</b>" is not just the text of content but contains a child element).
To extract a serialized XML representation of the content of the content element (which is what you apparently want, in contrast to what you say you want), you'll want to serialize the contents; there are a variety of ways to do that.
Think of the XSD primarily as describing allowed markup in a valid XML document rather than as a method to define extraction. If you change the type of content to xs:string, you're declaring that markup is not permitted within content, only text, and the validation error you're getting reflects that.
What you want is to select the string value of the content element. If the context for an XPath doesn't automatically convert its results to a string value, you can do so explicitly via the string() XPath function:
string(/path/to/particular/content)
This will return the concatenation of the string values of all of the children of content, omitting the tags as requested.
Update: Re-reading your question, I see that you actually want to retrieve
"this is a new piece of content by <b>Person A</b>"
(including the b element, not its string value). Here, the wrapping content element clearly has to be described in the XSD as having mixed content (mixed="true"). Extracting this data from an XML document in this form would typically involve selecting a collection of text and elements nodes, and serializing these back to a single string. I am not familiar enough with SSIS to provide details, but perhaps the reference I mentioned in the comments could help.

How can you view the output XPATH functions like normalize-space()?

Say I have the following HTML:
<div class="instruction" id="scan-prompt">
<span class="long instruction">Scan </span>
<span id="slot-to-scan">A-2</span>
<span class="long instruction"> to prep</span>
</div>
And I'm trying to write an XPATH selector like this
//div[#id='scan-prompt' and normalize-space()='Scan A-2 to prep']
Is there a way to see what the normalize-space output actually is?
I know you can do $x("//div[#id='scan-prompt']) in chrome debugger but I don't know how to go from that to seeing the output of normalize-space.
Why can you not simply use the path expression
normalize-space(//div[#id='scan-prompt'])
to see what the normalized string value would look like? Other than that, what normalize-space() does exactly is:
Removing any leading or trailing whitespaces from the string argument
Collapsing any sequence of whitespace characters to just one whitespace character
If handed an element node as an argument (as is the case with your original expression), the function evaluates the string value of that element node. The string value of an element node is the concatenation of all its descendant text nodes.
The result of normalize-space(//div[#id='scan-prompt']) is, given the input you show (whitespace marked with "+"):
Scan+A-2+to+prep
Without invoking normalize-space(), for example string(//div[#id='scan-prompt']):
+
Scan+
A-2+
to+prep+
+
So, simply use path expressions that do nothing else than either giving back a string value or a normalized string value. With Google Chrome by using an XPath expression inside $x().

Passing style parameters in query string

I have a simple html page with a div element in it.
The innerHTML property of the div is set through query String.
In query string I pass html strings,i.e.
<p style='font-size:20px;color:green;'> Sun rises in the east </p> etc...
I get the appropriate output.
However, if I pass color code in style attribute say, #00990a, I am not displayed any content.
Can someone help me through this?
if theres a color code that contains a #, everything after that will be treated fragment identifier. to avoid this you have to url-encode your parameter-value (replacing # with %23 an d doing the same with other characters that have a special meaning (#&%=?#...)).
Finally your url should look like this:
PageUrl?Content=%3Cp+style%3D%27color%3A%23009900%27%3EContent%3C%2Fp%3E
Since you haven't shown us any code, I shall guess…
In a URI, # indicates the start of the fragment identifier (as ? indicates the start of the query string). Your colour is terminated the query string and starting the fragment identifier. You need to URL encode any character that has special meaning in URLs. (# is %23).
Do make sure that you sanitise the passed HTML and CSS on the server though. It is very easy to expose yourself to XSS attacks otherwise.