How can I take an xml string and display it on my page similiar to how StackOverflow does it with 'insert code'? - html

I'm using the DataContractSerializer to convert and object returned from a WCF call to xml. The client would like to see that xml string in a webpage. If I output the string directly to a label, the browser strips out the angle brackets obviously. My question is how can I do something similar to StackOverflow? Are they doing a find & replace to replace angle brackets with their html entities? I see they are doing a code tag inside a pre tag and then making spans with the appropriate class. Is there an existing utility out there I can use to do this instead of writing some kind of parsing routine. I'm sure something free must be out there. If anyone can direct to the right place or some code that can easily accomplish this, I would greatly appreciate it. I apologize if this is more of a meta.stackoverflow question. Thanks for any tips.

The basic answer is that to get HTML displayed as typed, special characters and all, you need to replace the special characters (<, > etc.), with their escaped equivalents (>, < etc.). Beyond that if you want syntax colouring you'll have to parse the input to identify the keywords etc.
A full list of the special characters and their escape codes can be found here, but this is just one site of many.

you're talking about "pretty print".. if you want to diplay source code you could use this link 16 Free Javascript Code Syntax Highlighters For Better Programming
But if you want to display only xml.. there are some functions on the web that could help you with that, like this one: XML PHP Pretty Printer
and dont forget the special characters =)
good luck

Related

Store arbitrary characters in Semantic MediaWiki

I'm trying to store some text containing html tags into properties, which doesn't work. I created a form for a property with the data type 'text' and a template. Saving the form writes the text into the template, but it can't get displayed, as it contains illegal characters, as I guess.
What I'm trying to do:
I need a form to enter data, containing html tags and special
characters
I'd like to be able to use a query to find all those pages
and show that text using a template I provide to the ask query.
I also tried to use the free text option, but then I can't retrieve it using the ask query.
What would be the best, or at least a working solution to this?
Thanks a lot
storing text with html tags is a bit tricky in SemanticMediaWiki
The reason is the invention of the StripMarkers UNIQ/QINU by the MediaWiki developers.
When parsing the content of page with html tags in it the parsing is sort of "postponed". This technical detail unfortunately makes it hard for extension developers like the SMW developers to solve the issue of handling such content. Also it makes it hard for lay people to follow the discussion on how to solve the problem
Here are two examples of SMW Issues that are marked as "closed". This state of affairs means that by following the configuration hints in the issue your problem should be solved. If not please ask a question on the SMW issue list or even initiate the reopening of the issues.
https://github.com/SemanticMediaWiki/SemanticMediaWiki/pull/794
https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues/3707
On my wiki we ran into this and resolved it by replacing special characters (we had issues with [ ] =, but the same problem happens with to < > tags too) with alternate unicode characters using the regex extension and a template before setting the property with {{#set:}}. If you want to display the formatted text on the wiki directly then call that parameter separately without replacing the unicode characters.
When you want to display the property, you can then run the reverse replacement with regex before displaying your now intact code (using the template result format to allow you to perform the operation on the output of the query).
To switch to special characters you can create this template
{{#regex:{{#regex:{{#regex:{{#regex:{{#regex:{{{1|}}}|/=/|꞊}}|/\[/|[}}|/\]/|]}}|/>/|≽}}|/</|≼}}
And to switch back you can use this as a template
{{#regex:{{#regex:{{#regex:{{#regex:{{#regex:{{{1|}}}|/꞊/|=}}|/[/|[}}|/]/|]}}|/≽/|>}}|/≼/|<}}

Regular expressions for HTML

I am trying to find the following regular expressions to implement to a program of mine to parse a given html file. Could you help me with any of those?
<div>
<div class=”menuItem”>
<span>
class=”emph”
Any string beginning with < and ending with >, i.e. all tags.
The contents of the body tag.
The contents of all divs
All divs that make menus
I have managed to figure out that the single div tag is simply " < div >"
and the "all tags expression is <(\"[^\"]*\"|'[^']*'|[^'\">])*>
Do you think you could help me with any of the rest?
Thank you in advance guys...
I know that HTML parsing is an already solved problem and that regex is not efficient, however it is requested that I do this like this, in order to demonstrate how regular expressions can work by making them (sometimes) long and detailed. That's why I'm simply handling the HTML file I have as a simple text file and I need to apply those regular expressions on it.
Please, for your own sanity, consider using an HTML parser library for the language you are using. Regexps are not a suitable tool for this application - they cannot reliably or cleanly handle structured data like HTML.
https://stackoverflow.com/a/1732454/457201

Escape an apostrophe from a data binding's XML

I have a string from xml with an apostrophe that should be escaped to &apos; and it is not.
<city place="park's place"/>
In html I am grabbing the value.
<span datafld="place"></span>
I need the value in place to be "park&apos;s place" and not park's place. Currently it shows "park's place".
I have spent a good amount of time trying to find an answer and can't seem to find one.
This code example is badly hacked together since I am not allowed to show any original code.
Thanks.
Edit: This is on a xhtml page using javascript.
In the XML "data model" all values are unescaped. So whether your attribute was specified as:
place="park's place"
or:
place="park&apos;s place"
or:
place="park's place"
when you use an XML parser (or the DOM) you'll get "park's place". (Things like "innerHTML" are an exception to this general rule.)
If you need to compare that to some other string that has a different level of escaping then you either need to escape the string you get from the DOM, or you need to unescape the other string. It's a lot like if you were going to compare a measurement in meters to one in feet: you need to convert to a common unit-of-measurement/level-of-escaping.
I'd go with the unescaping approach if you can. If that isn't possible then you'll need to make sure that you escape in a consistent way everywhere, which can be difficult. Note that I've shown you three different ways of legally escaping that particular string -- and there are many many more.
Maybe it serves someone looking for it you can put the symbol (´) the tilde

How extract meaningful text from HTML

I would like to parse a html page and extract the meaningful text from it. Anyone knows some good algorithms to do this?
I develop my applications on Rails, but I think ruby is a bit slow in this, so I think if exists some good library in c for this it would be appropriate.
Thanks!!
PD: Please do not recommend anything with java
UPDATE:
I found this link text
Sadly, is in python
Use Nokogiri, which is fast and written in C, for Ruby.
(Using regexp to parse recursive expressions like HTML is notoriously difficult and error prone and I would not go down that path. I only mention this in the answer as this issue seems to crop up again and again.)
With a real parser like for instance Nokogiri mentioned above, you also get the added benefit that the structure and logic of the HTML document is preserved, and sometimes you really need those clues.
Solutions integrating with Ruby
use Nokogiri as recommended by Amigable Clark kant
Use Hpricot
External Solutions
If your HTML is well-formed, you could use the Expat XML Parser for this.
For something more targeted toward HTML-only, the W3C actually released the code for the LibWWW, which contains a simple HTML parser (documentation).
Lynx is able to do this. This is open source if you want to take a look at it.
You should strip all angle-bracketed part from text and then collapse white-spaces.
In theory the < and > should not be there in other cases. Pages contain < and > everywhere instead of them.
Collapsing whitespaces: Convert all TAB, newline, etc to spaces, then replace every sequence of spaces to a single space.
UPDATE: And you should start after finding the <body> tag.

Howto remove HTML <a> tags in a CDATA element

I have HTML in a CDATA element (HTML is too crappy to be parsed) and I would like to remove <a href> tags, but keep text in the tags.
I'm searching around regex but still not find a good way to do that.
All advices are welcome!
You could remove anything from a string that looks like a HTML link via regex. Results heavily depend on your input, but replacing </?a\b[^>]*> with the empty string could get you pretty far.
In any case, handling HTML with regular expressions is crappy and ad-hoc. If your input data set is limited and well known and all you need to do is some throw-away one-time conversion code then crappy and ad-hoc may be enough and you could get away with it.
If you are developing code that is intended to be of the long-lived sort, you should definitely look into one of the avilable HTML parsers (BeautifulSoup for Python or the HTML Agility Pack for .NET come to mind) and not only handle your HTML in a structured way, but also fix it while you are at it.