How to view xsd:documentation that is in HTML markup? - html

I am generating WSDL/XSD for SOAP services from a UML model using IBM Rational Software Architect (RSA). RSA allows you to document the classes and attributes in the model using rich-formatting.
For example, I have the following documentation on a Trailer class:
A wheeled Vehicle that is designed for towing by another
Vehicle. Known subtypes include:
Caravan
BoxTrailer
BoatTrailer
When the UML model is transformed to WSDL/XSD (using the out-of-the-box UML to WSDL transform), the formatting is preserved as HTML markup inside the xsd:documentation element:
<xsd:complexType name="Trailer">
<xsd:annotation>
<xsd:documentation><p>
A&nbsp;wheeled <strong>Vehicle</strong> that is designed for&nbsp;towing by another <strong>Vehicle.</strong> Known
subtypes include:&nbsp;
</p>
<ul>
<li>
<strong>Caravan</strong>
</li>
<li>
<strong>BoxTrailer</strong>
</li>
<li>
<strong>BoatTrailer</strong>
</li>
</ul></xsd:documentation>
</xsd:annotation>
</xsd:complexType>
Unfortunately, this is really hard to read and I've been searching (with no luck) for a program that can view WSDL/XSD with documentation in HTML markup.
XmlSpy 2008 can't do it, RSA can't do it (which is a bit surprising, as it generated the XSD in the first place), neither can any web browser I've tried.
I did write a JET template that extracted the documentation from the model and outputted to HTML, and I could probably write some XSLT to do something similar from the XSD, but I was hoping there's a program out there (ideally free) that could view the documentation as HTML.
Essentially, I'd like to be able to tell the consumers of our web service that they can view the WSDL in X program if they want to read the documentation - does anybody know the best solution to this?
Edit:
Thanks for the suggestions, but I think I have a solution! I didn't realise that RSA can export a WSDL to HTML (right-click on WSDL, export, HTML). The generated HTML has a graphical view of each schema element, the documentation for each element, as well as the original source, and everything is hyperlinked together.
Most importantly, the documentation is richly-formatted again! One small caveat is that the ;nbsp's appear in the HTML output. This seems to be because the ampersand is escaped in the HTML:
&nbsp;
Instead it should be
I will update my model-to-model transform to ensure that the ;nbsp's are replaced with real spaces (I don't believe I'll need non-breaking spaces in the documentation), so the generated WSDL/XSD won't ever have them.

I highly doubt if the standard xml/xsd editors can interpret the html tags and generate appropriate documentation. Oxygen XML Editor does a decent job of understanding and converting the XML entities (liket < etc) but HTML tags and entities are left as is. Below is the screen shot in design view.

The type of <xs:documentation> is <xs:any> so you should actually be able to include your documentation without escaping the markup, provided that it is a well formed XHTML fragment instead of HTML. I guess some XML Schema tools would be capable to interpret the embedded XHTML and show it as formatted text.
Do note that if the markup is not escaped it absolutely must be a well formatted XML fragment or the documentation element will cause your schema to be malformed. This applies also to HTML entities! If the documentation contains an (unescaped) entity reference (other than the 5 pre-defined XML entities), then your schema either must contain an external DTD reference or have an embedded DTD that defines what is the replacement text of that entity. In your case the documentation contains an entity reference. Probably easiest will be to replace such entities with the corresponding Unicode character/text or with character references (use   for )
If you have a chance, try to include the documentation without escaping the markup and make sure that it will be well formed. Otherwise you probably need to process the documentation twice: 1) parse the schema and extract documentation 2) parse the documentation text again (possibly as HTML, not XML).

I've tried this with the latest build of QTAssistant and it shows like this in the Schema Help Panel only; I've put a feature request for the grid view, as well as the documentation generator to work the same. Is this what you're expecting?
The help panel shows the annotation of the schema object that is selected in the Graph/Diagram view. To display the help panel press F1.

This issue is fixed in RSA 8.0.4 - which now supports exporting to WSDL/XSD with plain text (as well as an option to sort the schema by type, then name alphabetically!).
To view the the documentation in a WSDL/XSD generated from a UML model in prior versions of RSA, the easiest solution is to export the WSDL/XSD as HTML using RSA. You can do this by right-clicking on the WSDL/XSD, selecting export, then selecting HTML.
The generated HTML has a graphical view of each schema element, the documentation for each element, as well as the original source, and everything is hyperlinked together.
Most importantly, the documentation (that's virtually unreadable in the WSDL/XSD) is richly-formatted again! One small caveat is that the ;nbsp's that RSA's documentation editor inserts also appear in the HTML output. This seems to be because the ampersand is not only escaped in the WSDL/XSD (which is good), but also in the HTML (bad!):
&nbsp;
Instead it should be
A simple workaround to this is to replace all &nbsp;'s in the WSDL/XSD with real spaces before generating the HTML.

Related

Extracting string from html web scrape

I'm looking for some guidance on a web scraping script i'm working on.
All is going well but I'm stuck on stripping out the image file data.
I'm currently doing a WebRequest, getting elements by class, selecting outerHTML, but need to strip out just the contents of attribute data-imagezoom as per this example.
Sample data:
<a class="aaImg" href="https://imagehost.ssl.server123.com/Product-800x800/image.jpg">
<img class="aaTmb" alt="Matrix 900 x 900 test" src="https://imagehost.ssl.server123.com/Product-190x190/image.jpg" item="image"
data-imagezoom="https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg" data-thumbnail="https://imagehost.ssl.server123.com/Product-190x190/image.jpg">
</img>
</a>
Current code to get that data:
$ProductInfo = Invoke-WebRequest -Uri $ProductURL
$ProductImageRaw = $ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg") |
Select outerHTML
I can obviously get the first image by selecting the href attribute easily.
I was 'dirty coding' by replacing 800x800 with 1600x1600 as the filenames are the same, just a different path, but that came unstuck pretty quick when there were inconsistencies in path names.
You need to access the outer <a> element's <img> child element and call its .getAttribute() method to get the attribute value of interest:
$ProductInfo.ParsedHTML.body.getElementsByClassName("aaImg").
childnodes[0].getAttribute('data-imagezoom')
.childnodes[0] returns the first child node (element)
.getAttributes('data-imagezoom') returns the value of the data-imagezoom attribute.[1]
This should return string https://imagehost.ssl.server123.com/Product-1600x1600/image.jpg.
As for your own answer:
Using regexes (or substring search) to parse structured data such as HTML and XML is brittle and best avoided.
For instance, if the source HTML changes to use '...' instead of "..." around attribute values, your solution breaks (this particular case is not hard to account for in a regex, but there are many more ways in which such parsing can go wrong).
Cross-platform perspective:
Regrettably, the .ParsedHTML property with its HTML DOM is only available in Windows PowerShell (and its COM implementation is cumbersome and slow to work with in PowerShell).
PowerShell Core, even on Windows, doesn't support it, and there's no in-box HTML parser available (as of PowerShell Core 6.2.0).
The HtmlAgilityPack NuGet package is a popular open-source HTML parser, but it is aimed at C# and therefore nontrivial to install and use in PowerShell.
That said, this answer by TheIncorrigible1 has a working example that downloads the required assembly on demand.
[1] Note that .getAttribute() is necessary to access custom attributes, whereas standard attributes such as id and, in the case of <a> elements, href, are represented directly as object properties (e.g., .id; note that .getAttribute() works with standard attributes too.)
So, after a quick crash course in some Regex, this is what I've come up with.
(?<=data-imagezoom=").*?(?="\s)
A positive lookbehind, select all until the closing quotes and whitespace.
Thanks all.

convert docx with (ordered) list to html

I'm trying to convert a large docx document with several layers' ordered list to an html. (see an example of the document here: http://docdro.id/X1oyfBv You should download it)
I tried the following things, including:
online converters such as html-cleaner and index.html (which only recognize one layer of the list)
save as html - which creates an horrendous file but still doesn't recognize the ol structure.
saved the file as zip and then opened the xml file, but I dont see an easy way to get the ol structure out of the w:... tags
saving it to google docs and running Omar Alzabir's script
http://omaralzabir.com/wp-content/uploads/2014/05/GoogleDocsEmail.jpg
btw. If I create a word file with an ordered list with multiple layers and i convert it, it does recognize it as ol's. But the existing file is not recognized as ol's even if I 'un-list' and list it again. So possibly there is something wrong with how the original document was created (?)
Any suggestions much appreciated:) Or indications as to why this problem occurs
Are you asking how to save a Word-doc in HTML format, with multi-level ordered-lists?
Word-HTML has bugs in its multi-level ordered lists. For the list-items, the indentation tends to be incorrect and inconsistent. There's an example here.
Word-HTML has similar bugs in its multi-level unordered lists. An example is here.
I recently wrote a Python program that fixes these bugs, in Word's HTML. The program is part of WordWebNav (WWN), which is free and open-source.
WWN is an app that converts a Microsoft-Word document to a usable web-page. It adds some missing features in the Word-HTML web-page (e.g., a navigation pane), and it fixes bugs in the Word-HTML.
You can use pandoc : https://github.com/jgm/pandoc
This is an open source universal command line tool to convert markup source based document files.
You can use it as something like that:
pandoc -o output.html input.docx

Adding metadata to markdown text

I'm working on software creating annotations and would like my main data structure to be based around markdown.
I was thinking of working with an existing markdown editor, but hacking it so that certain tags, i.e. [annotation-id-001]Sample text.[/annotation-id-001] did not show up as rendered HTML; the above would output Sample text. in an HTML preview and link to a separate annotation with the ID 001.
My question is, is this the most efficient way to represent this kind of metadata inside of a markdown document? Also, if a user wants to legitimately use something like "[annotation-id-001]" as text inside of their document, I assume that I would have to make that string syntax illegal, correct?
I don't know what Markdown parser you use but you can abord your problem with different points of view:
first you can "hack" an existing parser to exclude your annotation tags from "classic" parsing and include them only in a certain mode
you can also use the internal "meta-data" information proposed by certain parsers (like MultiMarkdown or MarkdownExtended) and only write your annotations like meta-data with a reference to their final place in content
or, as mentionned by mb21, you can use simple links notation like [Sample text.](#annotation-id-001) or use footnotes like [Sample text.](^annotation-id-001) and put your annotations as footnotes.

How convert Html into Prolog

How convert Html into Prolog?
I need to extract from an html page its tag and i describe it into Prolog.
Example, if my file contains this html code
<title>Prove<title>
<select id="data_nastere_zi" name="data_nastere_zi">
i should get
title(Prove),
select(id(data_nastere_zi)).
I tried to see various library but i couldn't.
Thanks.
You can parse well formed HTML using SWI-Prolog library(sgml), in particular load_html/2.
My experience, scraping 'real world' websites, isn't really pleasant, because of insufficient error handling.
Anyway, when you will have loaded the page structure, you will have available library(xpath) to inspect such complex data.
edit getting a table inside a div:
xpath(Page, //div, Div),
xpath(Div, //table, Table)...
SWI-Prolog has a package for SGML/XML parsing based on the SWI-Prolog interface to SP by Anjo Anjewierden: "SWI-Prolog SGML/XML parser".

Eclipse - how to extend HTML editor to add custom tags?

I write an application and inside of HTML code I have custom tags (of course these tags are parsed on server side and end user gets them as valid HTML code). Example of custom tag usage:
<html>
<body>
...
<Gallery type="grid" title="My Gallery" />
...
</body>
</html>
1.) How can I have eclipse recognize my custom tags inside of HTML code and add syntax highlighting to them?
2.) How can I add auto-suggestions to my custom tags? For example if I type "<Gallery " press "Ctrl+Space" - in the list of available attributes it shows me "type" and "title" and if I type "<Gallery type=" press "Ctrl+Space" I would see list of available values only for tag "Gallery" and its attribute "type".
Thanks in advance!
Not really what you want, but maybe it helps you:
You can try the Aptana Plug-in for Eclipse. It allows to write your own regular expression for HTML validation, so a custom tag would be ignored by the validator.
E.g.:
.gallery.
Eclipse allows you to add simple auto-suggestions via Templates. On
Eclipse 3.7.1 (Indigo) + PHP Dev Tools (PDT) 3.0.0: Window > Preferences > Web > HTML Files > Editor > Templates
Sadly, there is no easy way: you have to roll your own parser for this, and then add both your extra elements and the base grammar (HTML) to it.
If you have your parser, you could use it to do syntax highlighting (strictly speaking, for that simple lexing is enough); and a good parser can support content assist (auto-suggestions in your terminology).
Caveats:
Creating a parser for HTML is not an easy task. Maybe by aiming at a more often used subset is feasible.
If a parser exists, the editor parts are still hard to get well.
Some help on the other hand: you could use some text editor generators to ease your work:
Eclipse IMP http://www.eclipse.org/imp/ can in theory handle any type of parser, but currently it is most optimized for LPG. The documentation is scarce, but the developers are helpful in the forums.
Xtext http://www.eclipse.org/Xtext/ got quite a hype for creating text editors for DSLs. The generated editors are quite nice out of the box, but is not the best solution for large files. Has a really helpful developer community.
EMFText http://www.emftext.org/index.php/EMFText is a lesser known entity - I don't know it in details, but I guess, it is similar to Xtext.
I know its been a long time since this Q was asked,
but I hope this might help others like myself that reach this in search of a solution.
So, When using Eclipse (Mars.1 Release (4.5.1) - and possibly earlier - I did not check).
Go to Window - Prefrences
Then in the dialog that opens go to Web - HTML Files - Editor - Validation.
On the right side:
under Ignore specified element names in validation and enter the list of custom elements you use. (e.g. Gallery,tab,tabset,my-element-directives-*)
you might also like to go under Ignore specified attribute names in validation do the same for your custom attributes.(e.g. ng-*,my-attr-directives-*)
Two things to note:
After letting eclipse do a full validation you must also close the file and reopen it to have the warnings removed from the source code.
Using this method would ignore those attributes under any element. I don't think there is a simple way to tell it to ignore some-attribute only if its a child of some-element.
I find templates are an ok alternative but let's see if we can encourage a more robust solution; please take a moment and vote for this: https://bugs.eclipse.org/bugs/show_bug.cgi?id=422584
You need to add a new HTML template.To add a new template, complete the following steps:
1) From the Window menu, select Preferences.
2) In the Preferences page, select Web and XML > HTML Files > HTML Templates.
3) Click New.
4) Enter the new template name and a brief description of the template.
5) Using the Context drop-down list, specify the context in which the template is available.
6) In the Pattern field, enter the appropriate tags, attributes, or attribute values (the content of the template) to be inserted by content assist.
7) If you want to insert a variable, click the Variable button and select the variable to be inserted. For example, the word_selection variable indicates the word that is selected at the beginning of template insertion, and the cursor variable determines where the cursor will be after the template is inserted in the HTML document.
8) Click OK to save the new template.
You can edit, remove, import, or export a template by using the same Preferences page.
Reference : http://help.eclipse.org/kepler/index.jsp?topic=%2Forg.eclipse.wst.sse.doc.user%2Ftopics%2Ftsrcedt024.html