Show webservices expose nested or flat lists? - json

When designing a webservice, not matter if it's soap, xml or json: would you prefer flat or nested lists?
Example:
Nested:
<carRequest>
<cars>
<car>
<manufature />
<price />
<description />
</car>
<car>
<manufature />
<price />
<description />
</car>
</cars>
</carRequest>
Flat:
<carRequest>
<car>
<manufature />
<price />
<description />
</car>
<car>
<manufature />
<price />
<description />
</car>
</carRequest>
What's the advantage of one over the other?

There are advantages and disadvantages combined with personal style, tools (their default configurations, limitations or ease of use), need to support multiple MIME types from a single object representations, etc. I'm not going to go into all of that - since what works for some might not be a good solution for others - but I just want to point out a few things...
Which one seems more natural, the flat elements or the wrapped elements? How do people usually think about repeated elements? For example, <manufature>, <price> and <description> are wrapped in a <car> element. Why? Because they are related and together form a structure. Multiple <car>s are also related and form a structure too: a list of <car>s. It's more expressive in your representation and XML schema, and more readable. But of course now we go into personal preferences and wholly wars...
There is another advantage of the wrapped element. How do you express a list of cars that is empty versus a list of cars that is null?
If the elements are flat and you have no cars then what does this represent when you unmarshall it into an object?
<carRequest>
</carRequest>
Does your request have cars = null or cars = []? You don't know.
If you go with nested elements then cars = null is this:
<carRequest>
</carRequest>
while cars = [] is this:
<carRequest>
<cars>
</cars>
</carRequest>
And since you mentioned SOAP, you might at some point need to consider interoperability across technologies and tools (see Why is it important to be WS-I Basic Profile compliant?) which has rules on how the XML should look like inside the SOAP message. The style called document/literal wrapped pattern is preferred.
This is a broad subject and as a TL;DR I can only think of "choose your poison". I hope my answer is of help to you.

Related

Is there a way to have an Xquery in an XSLT stylesheet which will be executed upon transformation?

I have an XML file which I've been trying to transform both with xQuery and XSLT at the same moment.
The document basically encodes two different types of text according to TEI standards. The first part is a philological study which I have written about an epic poem, and the second part is a scholarly edition of said poem.
<text>
<front><!-- chapters of the study --></front>
<body>
<lg n="1">
<l n="1.a">first line of the poem</l>
<l n="1.a">second line with <distinct>interesting stuff</distinct></l></lg>
<!-- rest of the poem-->
</body></text>
My main goal is to transform this with XSLT into a nicely formatted html document, and for the most part it works.
Now, the study discusses data from the edition ("This interesting stuff occurs quite often in our poem, as is shown in the following table"). Since all the "interesting stuff" is marked up (see example above), I can easily create those tables using a combination of HTML and xQuery:
<table>
<tr>
<td>Verse Number</td>
<td>Interesting Stuff</td>
<tr>
for $case in doc("mydocument.xml")//distinct
return
<tr>
<td>{data($case/ancestor::l/#n)}</td>
<td>$case</td></tr></table>
The easy way at the moment would be to change the xQuery so it will create a TEI-conform xml table and copy that manually into the document. Then, the XSLT will work smoothly, just as it does for the few static tables that I have. But most of my tables should be dynamic, I want the numbers to change if I change something in the edition. This should be done every time a new reader opens the formatted text in the browser (i.e., each time the XSLT transformation is executed).
I tried combining the code as follows:
<xsl:template match="table[type='query']">
{ (: the xQuery-html instructions from above go here :) }
</xsl template>
I creates a table at the right place, but before it and in the cells it just repeats the xQuery instructions. I've been looking for similar questions, but I found only the reverse process, i.e. how to use xQuery to create XSLT (for example this: calling XQuery from XSLT, building XSLT dynamically in XQuery?), which does not help my problem.
Is there a way to combine the two codes?
Thanks in advance for your help!
There are various ways you can combine XSLT and XQuery. You can have XSLT tasks and XQuery tasks in the same pipeline, or you can invoke XQuery functions from XSLT (for example using load-xquery-module() in XSLT 3.0). But for the case you're describing, it's simplest to just replace the FLWOR expression with an equivalent xsl:for each:
<xsl:for-each select='doc("mydocument.xml")//distinct'>
<xsl:variable name="case" select="."/>
<tr>
<td>{$case/ancestor::l/#n}</td>
<td>{$case}</td>
</tr>
</xsl:for-each>
Note: XSLT 3.0 allows the curly-brace syntax (you need to specify expand-text="yes") but the semantics are slightly different from XQuery - it means "value-of" rather than "copy-of".

Any conventional standards for storing OCR data/metadata in JPEG images?

I want to organize a collection of scanned documents (receipts, bank statements, etc.) by adding their metadata and text content (OCR'ed) into the same jpeg files. Is there any more or less commonly accepted way of storing such data? Any commonly used schemas?
For metadata, for example - I found a Dublin Core scheme, but most of the fields I want are not there, and I'm not sure what's the good way to add custom fields - can I just use them like if they existed in DC or XMP scheme (i.e. <dc:myfield>myvalue</dc:myfield> or <xmp:myfield>myvalue</xmp:myfield>), or I have to define my own scheme by adding xmlns:myScheme="http://myScheme.uri" and then use it as <myScheme:myfield>myvalue</myScheme:myfield> ?
Also, in all the examples I found, this data is stored inside <rdf:Description> which is inside <rdf:RDF> which is inside <x:xmpmeta> - is it a standard requirement? I don't see it in the XMP specification for storage in files...
For now, based on the examples, I plan to embed something like this:
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='MyTool v 0.0.1'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description rdf:about=''
xmlns:dc='http://purl.org/dc/elements/1.1/'
xmlns:myDoc='http://some.custom.uri/'>
<dc:format>image/jpeg</dc:format>
<myDoc:doctype>scan</myDoc:doctype>
<myDoc:originalfilename>20190519121225_003.jpg</myDoc:originalfilename>
<myDoc:originalimagewidth>1684</myDoc:originalimagewidth>
<myDoc:originalimageheight>2788</myDoc:originalimageheight>
<myDoc:langOCR>EN-US</myDoc:langOCR>
<myDoc:acquisitiondatetime>2019-05-19T12:12:25Z</myDoc:acquisitiondatetime>
<myDoc:documentdate>2019-01-02</myDoc:documentdate>
<myDoc:pagesindocument>6</myDoc:pagesindocument>
<myDoc:page>2</myDoc:page>
<myDoc:textcontent>
Bank
statement
02/01/2019
Page 2 of 6
( Here goes raw OCR content
as multiline text )
</myDoc:textcontent>
<dc:subject>
<rdf:Bag>
<rdf:li>bank</rdf:li>
<rdf:li>statement</rdf:li>
</rdf:Bag>
</dc:subject>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end='w'?>
Does it make sense at all? I'm sure many people already worked on similar tasks, I don't want to reinvent the wheel...

Custom Microdata and Custom Schema.org

My site gives satellite frequency info.Like this;
Frequency: 11881
Polarization: V
Symbol Rate: 27500
Fec: 3/4
I want to use microdata for this datas.
i used PageMap like this;
<PageMap>
<DataObject type="action">
<Attribute name="channel_name" value="Watan TV"/>
<Attribute name="frequency" value="11012"/>
<Attribute name="polarization" value="H"/>
<Attribute name="symbol_rate" value="27500"/>
<Attribute name="fec" value="5/6"/>
</DataObject>
</PageMap>
And i want to use microdata. But i cant find any type in Schema.org. So i used like this:
<div itemscope>
<span itemprop="channel_name">Watan TV</span>
<span itemprop="frequency">11012</span>
<span itemprop="polarization">H</span>
<span itemprop="symbol_rate">27500</span>
<span itemprop="fec">5/6</span>
</div>
Is this wrong? Or which schema type must i choose? Thanks...
For Microdata, you can either
find a suitable vocabulary, or
create your own vocabulary, or
use only proprietary properties.
The last case is what you use in your example. Because you don’t specify the itemtype attribute, you are not using a vocabulary. You can come up with any name (as long as it doesn’t contain . or :), but don’t expect consumers to re-use your data: because you are not using a vocabulary, you are the only one who knows what all the properties really mean.
If your goal is providing the data for search engines, you’ll probably want to use Schema.org, as this is currently the only vocabulary which the big search engine services support (they are its sponsors). But if Schema.org doesn’t provide a suitable type, you can’t use it (you could use a broad type that applies, e.g., everything is a Thing, but it’ll miss all the properties you need for your case). Your only option here is to suggest an extension for Schema.org (but even it they implement it, it of course doesn’t mean that search engines start doing something with this data).

What's the difference between `<seg>` and `<span>`

What's the difference between a <seg> in XML and <span> in HTML? Here are two passages from Bibles, one from the English Bible in Christodouloupoulos' and Steedman's massively parallel Bible corpus,
<?xml version="1.0" ?>
<cesDoc version="4">
…
<text>
<body id="Bible" lang="en">
<div id="b.GEN" type="book">
<div id="b.GEN.1" type="chapter">
<seg id="b.GEN.1.1" type="verse">
In the beginning God created the heaven and the earth.
</seg>
<seg id="b.GEN.1.2" type="verse">
And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
</seg>
…
and the other from the NIV English Bible at Bible Gateway, which is where they got most of their texts from:
<p class="chapter-1">
<span id="en-NIV-27932" class="text Rom-1-1">
<span class="chapternum">1 </span>
Paul, a servant of Christ Jesus, called to be an apostle and set apart for the gospel of God—
</span>
<span id="en-NIV-27933" class="text Rom-1-2">
<sup class="versenum">2 </sup>the gospel he promised beforehand through his prophets in the Holy Scriptures
</span>
…
In the HTML, a it seems a <span> can replace a <seg>, except that the HTML has added verse numbers in <span>. Oh, and the chapters are in <div>. So it's not one-to-one.
Of course, I realize that HTML and XML are different, and this is only one juxtaposition; I'm sure there are others out there. But I'm going to need to be able to display XML as HTML, and I don't want to anger the doctype gods. So, conceptually, how is <seg> different from <span> in purpose, meaning and usage?
Update: #jim-garrison, says I'm going to need to read the schema to understand the XML, but I'm a neophyte at that, too. In particular, I did find some official-looking documentation for <seg> by TEI that makes me think it's use is a little more than arbitrary, but I have no idea how to interpret this documentation. Should it give us a more specific answer than what Jim has already written?
The difference between XML and HTML generally is that the list of tags that can be present in XML is defined by a DTD or XML Schema, and tags represent document semantics and not presentation. So tags can be named anything. In HTML the set of tags is generally predefined, as if there was a pre-existing HTML DTD or schema, but HTML is not XML and doesn't follow all the rules of XML. While HTML was in some sense derived from the same parent as XML (SGML), and the two are superficially very similar, they are most definitely NOT the same thing.
The answer to your specific question is that the writers of the XML chose to use a tag named <seg> ("segment"?) to represent generalized strings of text, with attributes providing additional semantic information. For more details you'll need to find the DTD or XML schema that governs the content of the XML and read the documentation that goes with it.
But I'm going to need to be able to display XML as HTML, and I don't want to anger the doctype gods. So, conceptually, how does different from in purpose, meaning and usage?
This is where you will use XSLT to transform the input XML into valid HTML. To figure out how to do that transformation you will need to know the full semantics of all the tags that can appear (again, go to the documentation for the DTD/Schema) and decide on a visual representation for the data. There's no one answer to "how should a <seg>" be transformed. That's up to your requirements regarding presentation. One possible transformation converts <seg> tags to <span>, but that may depend on the value of certain attributes (type="verse" vs some other type). It might even differ depending on output medium (desktop vs tablet vs phone vs watch vs ...?)
Once you convert from XML to HTML you have left the realm of the Doctype gods and they have no interest in what you do :-) There's a whole different set of deities such as CSS-Cthulhu, Javascript-Janai'ngo (look it up), et al who will take great pleasure making your life miserable.

sas reading xml (xfdf) with xml engine, map for multiple ><span

in these days I'm totally struggling myself trying to let sas read an xfdf file, an export of comments (annotation) in a pdf with adobe professional.
If you never worked with an .xfdf file, don't worry, basically is an XML parent format of adobe.
I can't use SAS XML Mapper, for two reason: first one is that I can't use it on workplace (where I develop my personal projects too, like this), second one is that I'd like to write a procedure that could be always repeated (without mapping anytime).
Usually comments are collected in xfdf with this format:
><freetext rect="300.165985,66.879105,380.165985,86.879105" creationdate="D:-001-1-1-1-1-1-00'30'" name="a7311cdb-77b3-4a48-8eff-62364f94213d" color="#FFBF00" flags="print" date="D:20150730153125+01'00'" page="0"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:8.0.0" xfa:spec="2.0.2" style="font-size:11.0pt;text-align:left;color:#FF0000;font-weight:normal;font-style:normal;font-family:Arial,sans-serif;font-stretch:normal"
><p
>THE_COMMENT_TO_EXPORT_IS_THIS_STRING</p
></body
></contents-richtext
></freetext
And I gather that data with this portion of xml map:
<COLUMN name='var1'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>
Sometimes comment are collected in another way:
><freetext rect="331.041992,230.949005,553.198975,250.949005" creationdate="D:-001-1-1-1-1-1-00'30'" name="4f112387-dec6-42f1-ad8c-a1fecf9d8e04" color="#66CCFF" flags="print" date="D:20150730153213+01'00'" page="0"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:8.0.0" xfa:spec="2.0.2" style="font-size:11.0pt;text-align:left;color:#FF0000;font-weight:normal;font-style:normal;font-family:Arial,sans-serif;font-stretch:normal"
><p dir="ltr"
><span style="font-family:Arial"
>THE_COMMENT_TO_EXPORT_IS_THIS_STRING</span
></p
></body
></contents-richtext
></freetext
No problem also here, I can gather this comment with this xml map portion:
<COLUMN name='var2'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p/span</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>
But here comes the problem, sometimes the data is collected in this strange format, with a double span tag:
><freetext rect="9.623672,760.177979,210.281006,783.448975" creationdate="D:00000000000000Z" name="4f037e18-9143-4ec1-a6ae-249fa2215528" width="2" color="#66CCFF" flags="print" date="D:20150731152640+01'00'" page="53"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:8.0.0" xfa:spec="2.0.2" style="font-size:14.0pt;text-align:left;color:#000000;font-weight:normal;font-style:normal;font-family:Arial,sans-serif;font-stretch:normal"
><p dir="ltr"
><span style="font-family:Arial"
>THIS_IS_THE_FIRST_PART </span
><span style="font-family:Arial"
>THIS_IS_THE_SECOND_PART</span
></p
></body
></contents-richtext
></freetext
The second map code hits only the second string (here: THIS_IS_THE_SECOND_PART), can someone please help? How to write an appropriate map for gathering both the informations with sas?
PS: I'm pretty sure that alse SAS XML Mapper can't solve this issue, I found someone with the same problem on the web and using a map created by that tool.
PS2: Path type is xpath 1.0, I gave I try with string-join and I had this error:
ERROR: invalid character in Xpath expression
ERROR: Xpath construct string-join(/xfdf/annots/freetext/contents-richtext/body/p/span, '')
for column var2 is an invalid, unrecognized, or unsupported form
EDIT: Added HTML tag, <P> and <SPAN> are tags related to this language.
I answer my own question, I found out a quite good solution, but if anyone has an optimized version of this, please kindly post it.
I found out that in SAS XML maps you can't use XPath 2.0, but only XPath 1.0. In XPath 1.0 this step can be automatically performed within a single block only knowing the number of <PATH> in advance, using CONCAT('\xxx\xxx[1]',' '\xxx\xxx[2]').
Sadly this function does not work with SAS XML Map, and trying this you will encounter an error ERROR: invalid character in Xpath expression.
But I'm not interested in a perfect format, I can post-process the data I retrieve, hence in the map I reproduced in many variables all the possible cases of repeated <PATH> in this way:
<COLUMN name='vars1'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p/span[1]</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>
<COLUMN name='vars2'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p/span[2]</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>
<COLUMN name='vars3'>
<PATH syntax='XPath'>/xfdf/annots/freetext/contents-richtext/body/p/span[3]</PATH>
<TYPE>character</TYPE>
<DATATYPE>string</DATATYPE>
<LENGTH>60</LENGTH>
</COLUMN>
I programmed 6 of these blocks, even if I encountered only 2 <PATH> for making this code the most general as possible.
Then I concatenated those string variables within a datastep.