sas reading xml (xfdf) with xml engine, map for multiple ><span - html

in these days I'm totally struggling myself trying to let sas read an xfdf file, an export of comments (annotation) in a pdf with adobe professional.
If you never worked with an .xfdf file, don't worry, basically is an XML parent format of adobe.
I can't use SAS XML Mapper, for two reason: first one is that I can't use it on workplace (where I develop my personal projects too, like this), second one is that I'd like to write a procedure that could be always repeated (without mapping anytime).
Usually comments are collected in xfdf with this format:
><freetext rect="300.165985,66.879105,380.165985,86.879105" creationdate="D:-001-1-1-1-1-1-00'30'" name="a7311cdb-77b3-4a48-8eff-62364f94213d" color="#FFBF00" flags="print" date="D:20150730153125+01'00'" page="0"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:8.0.0" xfa:spec="2.0.2" style="font-size:11.0pt;text-align:left;color:#FF0000;font-weight:normal;font-style:normal;font-family:Arial,sans-serif;font-stretch:normal"
><p
>THE_COMMENT_TO_EXPORT_IS_THIS_STRING</p
></body
></contents-richtext
></freetext
And I gather that data with this portion of xml map:
<COLUMN name='var1'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>
Sometimes comment are collected in another way:
><freetext rect="331.041992,230.949005,553.198975,250.949005" creationdate="D:-001-1-1-1-1-1-00'30'" name="4f112387-dec6-42f1-ad8c-a1fecf9d8e04" color="#66CCFF" flags="print" date="D:20150730153213+01'00'" page="0"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:8.0.0" xfa:spec="2.0.2" style="font-size:11.0pt;text-align:left;color:#FF0000;font-weight:normal;font-style:normal;font-family:Arial,sans-serif;font-stretch:normal"
><p dir="ltr"
><span style="font-family:Arial"
>THE_COMMENT_TO_EXPORT_IS_THIS_STRING</span
></p
></body
></contents-richtext
></freetext
No problem also here, I can gather this comment with this xml map portion:
<COLUMN name='var2'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p/span</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>
But here comes the problem, sometimes the data is collected in this strange format, with a double span tag:
><freetext rect="9.623672,760.177979,210.281006,783.448975" creationdate="D:00000000000000Z" name="4f037e18-9143-4ec1-a6ae-249fa2215528" width="2" color="#66CCFF" flags="print" date="D:20150731152640+01'00'" page="53"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:8.0.0" xfa:spec="2.0.2" style="font-size:14.0pt;text-align:left;color:#000000;font-weight:normal;font-style:normal;font-family:Arial,sans-serif;font-stretch:normal"
><p dir="ltr"
><span style="font-family:Arial"
>THIS_IS_THE_FIRST_PART </span
><span style="font-family:Arial"
>THIS_IS_THE_SECOND_PART</span
></p
></body
></contents-richtext
></freetext
The second map code hits only the second string (here: THIS_IS_THE_SECOND_PART), can someone please help? How to write an appropriate map for gathering both the informations with sas?
PS: I'm pretty sure that alse SAS XML Mapper can't solve this issue, I found someone with the same problem on the web and using a map created by that tool.
PS2: Path type is xpath 1.0, I gave I try with string-join and I had this error:
ERROR: invalid character in Xpath expression
ERROR: Xpath construct string-join(/xfdf/annots/freetext/contents-richtext/body/p/span, '')
for column var2 is an invalid, unrecognized, or unsupported form
EDIT: Added HTML tag, <P> and <SPAN> are tags related to this language.

I answer my own question, I found out a quite good solution, but if anyone has an optimized version of this, please kindly post it.
I found out that in SAS XML maps you can't use XPath 2.0, but only XPath 1.0. In XPath 1.0 this step can be automatically performed within a single block only knowing the number of <PATH> in advance, using CONCAT('\xxx\xxx[1]',' '\xxx\xxx[2]').
Sadly this function does not work with SAS XML Map, and trying this you will encounter an error ERROR: invalid character in Xpath expression.
But I'm not interested in a perfect format, I can post-process the data I retrieve, hence in the map I reproduced in many variables all the possible cases of repeated <PATH> in this way:
<COLUMN name='vars1'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p/span[1]</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>
<COLUMN name='vars2'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p/span[2]</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>
<COLUMN name='vars3'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p/span[3]</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>
I programmed 6 of these blocks, even if I encountered only 2 <PATH> for making this code the most general as possible.
Then I concatenated those string variables within a datastep.

Related

Is there a way to have an Xquery in an XSLT stylesheet which will be executed upon transformation?

I have an XML file which I've been trying to transform both with xQuery and XSLT at the same moment.
The document basically encodes two different types of text according to TEI standards. The first part is a philological study which I have written about an epic poem, and the second part is a scholarly edition of said poem.
<text>
<front><!-- chapters of the study --></front>
<body>
<lg n="1">
<l n="1.a">first line of the poem</l>
<l n="1.a">second line with <distinct>interesting stuff</distinct></l></lg>
<!-- rest of the poem-->
</body></text>
My main goal is to transform this with XSLT into a nicely formatted html document, and for the most part it works.
Now, the study discusses data from the edition ("This interesting stuff occurs quite often in our poem, as is shown in the following table"). Since all the "interesting stuff" is marked up (see example above), I can easily create those tables using a combination of HTML and xQuery:
<table>
<tr>
<td>Verse Number</td>
<td>Interesting Stuff</td>
<tr>
for $case in doc("mydocument.xml")//distinct
return
<tr>
<td>{data($case/ancestor::l/#n)}</td>
<td>$case</td></tr></table>
The easy way at the moment would be to change the xQuery so it will create a TEI-conform xml table and copy that manually into the document. Then, the XSLT will work smoothly, just as it does for the few static tables that I have. But most of my tables should be dynamic, I want the numbers to change if I change something in the edition. This should be done every time a new reader opens the formatted text in the browser (i.e., each time the XSLT transformation is executed).
I tried combining the code as follows:
<xsl:template match="table[type='query']">
{ (: the xQuery-html instructions from above go here :) }
</xsl template>
I creates a table at the right place, but before it and in the cells it just repeats the xQuery instructions. I've been looking for similar questions, but I found only the reverse process, i.e. how to use xQuery to create XSLT (for example this: calling XQuery from XSLT, building XSLT dynamically in XQuery?), which does not help my problem.
Is there a way to combine the two codes?
Thanks in advance for your help!
There are various ways you can combine XSLT and XQuery. You can have XSLT tasks and XQuery tasks in the same pipeline, or you can invoke XQuery functions from XSLT (for example using load-xquery-module() in XSLT 3.0). But for the case you're describing, it's simplest to just replace the FLWOR expression with an equivalent xsl:for each:
<xsl:for-each select='doc("mydocument.xml")//distinct'>
<xsl:variable name="case" select="."/>
<tr>
<td>{$case/ancestor::l/#n}</td>
<td>{$case}</td>
</tr>
</xsl:for-each>
Note: XSLT 3.0 allows the curly-brace syntax (you need to specify expand-text="yes") but the semantics are slightly different from XQuery - it means "value-of" rather than "copy-of".

Show webservices expose nested or flat lists?

When designing a webservice, not matter if it's soap, xml or json: would you prefer flat or nested lists?
Example:
Nested:
<carRequest>
<cars>
<car>
<manufature />
<price />
<description />
</car>
<car>
<manufature />
<price />
<description />
</car>
</cars>
</carRequest>
Flat:
<carRequest>
<car>
<manufature />
<price />
<description />
</car>
<car>
<manufature />
<price />
<description />
</car>
</carRequest>
What's the advantage of one over the other?
There are advantages and disadvantages combined with personal style, tools (their default configurations, limitations or ease of use), need to support multiple MIME types from a single object representations, etc. I'm not going to go into all of that - since what works for some might not be a good solution for others - but I just want to point out a few things...
Which one seems more natural, the flat elements or the wrapped elements? How do people usually think about repeated elements? For example, <manufature>, <price> and <description> are wrapped in a <car> element. Why? Because they are related and together form a structure. Multiple <car>s are also related and form a structure too: a list of <car>s. It's more expressive in your representation and XML schema, and more readable. But of course now we go into personal preferences and wholly wars...
There is another advantage of the wrapped element. How do you express a list of cars that is empty versus a list of cars that is null?
If the elements are flat and you have no cars then what does this represent when you unmarshall it into an object?
<carRequest>
</carRequest>
Does your request have cars = null or cars = []? You don't know.
If you go with nested elements then cars = null is this:
<carRequest>
</carRequest>
while cars = [] is this:
<carRequest>
<cars>
</cars>
</carRequest>
And since you mentioned SOAP, you might at some point need to consider interoperability across technologies and tools (see Why is it important to be WS-I Basic Profile compliant?) which has rules on how the XML should look like inside the SOAP message. The style called document/literal wrapped pattern is preferred.
This is a broad subject and as a TL;DR I can only think of "choose your poison". I hope my answer is of help to you.

How can i make part of a value in a h:outputText Bold?

How can i make part of a value in a h:outputText Bold?
i want the Name in bold:
<h:outputText value="Normal Text: #{Controller.Object.name}" />
i tried: <h:outputText value="Normal Text: <b>#{Controller.Object.name}</b>" />
got this error: "The value of attribute "value" associated with an element type "h:outputText" must not contain the '<' character."
after some searches here and others pages, found that the attribute escape="false" could fix this... but doesn't make difference for me,
<h:outputText escape="false" value="Normal Text: <b>#{Controller.Object.name}</b>" />
still got the same error.
has anyone had this problem?
Do you really need <h:outputText>?
In Facelets you can just use EL in template text:
Normal Text: <b>#{Controller.Object.name}</b>
If you really insist in using <h:outputText>, then you should indeed manually escape the XML entities and display it with escape="false":
<h:outputText value="Normal Text: <b>#{Controller.Object.name}</b>" escape="false" />
This not only reads uglier, but also puts a XSS attack hole open in case #{Controller.Object.name} is a client-controlled value.
See also:
Is it suggested to use h:outputText for everything?
CSRF, XSS and SQL Injection attack prevention in JSF
To me, it makes much more sense to put the <p></p> inside the text in the .properties file or wherever you are defining the Controller.Object.namevalue. Much cleaner and you don't have to mess with encoding symbols.

Extracting data from HTML files using regular expressions

I am trying to extract the specific data using regular expression but i couldn't be able to achieve what i desire, for example,
in this page
http://mnemonicdictionary.com/wordlist/GREwordlist/startingwith/A
I have to keep only the data which is between,
<div class="row-fluid">
and
<br /> <br /><i class="icon-user"></i>
SO i copied the HTML code in Notepad++ enabled Regular expression in replace, and tried replacing everything that matches,
.*<div class="row-fluid">
to delete everything before <div class="row-fluid">
but it is not working at all.
Does anyone knows why ?
P.S: I am not using any programming language i just need to perform this on an html code using Notepad++, not on an actual HTML file.
I would achieve this in several steps.
Step 1.
transform document into one line. find
\r\n
and replace with nothing. (make sure to select "Extended (\n, \r,..)" option in Replace dialog)
Step 2.
find
<div class="row-fluid">
and replace with
\r\n~<div class="row-fluid">
Make sure, that character "~" not used in the document. This character wil help us to delete unnecessary lines later
Step 3.
find
<br /> <br /><i class="icon-user"></i>
and replace with
<br /> <br /><i class="icon-user"></i>\r\n
Step 4.
Delete unnecessary lines. Check "Regular expression".
find
^[^~].+$\r\n
and replace with nothing
Step 5.
Now you have only lines that starts with
~<div class="row-fluid">
and ends with
<br /> <br /><i class="icon-user"></i>
everything you need it's just delete this tags
PS. You can try to record a macro, if you need to do the same task several times.
You should consider retrieving using Xpath. Most languages support it.
There's a great firefox plugin that infers the xpath expression when you select a page item called xpather.
There's a hacked version that works for newer firefox versions here
http://jassage.com/xpather-1.4.5b.xpi
To use Xpath with python, consider using http://xmlsoft.org/python.html
Notice that Xpath may have problem with malformed html, so you may also find tidy an interesting option to "clean up" the html and get a parseable XML.
http://tidy.sourceforge.net/
IMHO doing it with Notepad++ is difficult. According to this, you need to:
remove all lines (since regexps execute on each line of text)
perform the regexp on the whole (1-line) HTML
Either you want to learn regexps, or you want to parse the HTML. SDepending on which, solution differs.
If you want to learn regular expressions, this is (again IMHO) the wrong problem to solve.
If you want to resolve the problem (keep the data between <div> and <i>), then have a look at how to parse HTML/XML. In python you have some great libraries like BeautifulSoup (which can deal with broken html). You can do it with dom parsing or a more interesting solution (and arguably better for your problem) is to use SAX and per-event processing. Since you know that after every <div> you'll get an <i>, you could do a simple stack to push all the content between the two events...

SharePoint2010 FieldDefinition FullHtml Field, need to prohibit script tags

I have a rather tricky problem. I have a SharePoint 2010 list definition. In this definition I have a field defined, like so:
<Field ID="{AAAAAAAA-AAAA-AAAA-AAAA-AAAAAAAA}"
Name="MyComponentHTMLTemplate"
DisplayName="HTML template"
Type ="HTML"
Description="HTML template for (my component)."
StaticName=MyComponentHTMLTemplate"
RichText ="FALSE"
RichTextMode ="FullHtml"
Required="FALSE">
<Default>
<![CDATA[<a href='{URL}' title='{DESCRIPTION}'>{KEYWORD}</a>]]>
</Default>
</Field>
As you can see, the field accepts html tags. This is desired behavior. However, there's one thing the field mustn't accept - <scrpit> tags.
<a href='{URL}' title='{DESCRIPTION}'>{KEYWORD}<script>alert.('this is not allowed')</script></a>
This is to prevent cross-site scripting and similar shenanigans. How can I set the field to accept, html, but not script tags? So far my google searches have yielded no answer. Any advice would be appreciated.