Got some html with javascript in, the javascript creates an MSXml2 object and loads some XML from a file, and populates a span. However the HTML that's within the XML is being stripped. Is there a way to stop it from doing this?
(pseudocode)
I've tried various combinations of mySpan = blah.GetNode("mynode").text , .value, .innerxml etc. but nothing is working yet.
Typically, as soon as I post it on here my 2 hours of googling pays off, and I discover its simply (psudocode) getNode("mynode").xml !
Related
I'm developing an ASP code that read a external websites and parse it via HTMLDocument interface Object ( "HTMLFILE" Object) to navigate contents via DOM structure. But there are some pages that throw an error :
'htmlfile error 80070057 Invalid Argument.'
After doing a lot of research, I've discovered that there are some HTML tags that, i don't know why, are not rendered or managed correctly by HTMLFILE object giving me that error.
Because ASP is too old and there isn't much content available today to be probing, I'm convinced that I have to parse it before send to HTMLFILE Object, and the best way that I have figured is to do via RegEx.
But I'm facing some problems (and because i don't have much practice).
I have to successfully locate HTML Tag Blocks that 'HTMLFILE' do not accept to be able to remove them.
For Example:
<head>
<script> ....... </script>
<style> ....... </style>
</head>
<body>
<iframe> ........ </iframe>
<div> ..... </div>
<table>.....</table>
I have to match full script block, style and iframe, leaving the rest of document intact.
From last days i've doing some research and have almost done it:
<(?:script|embed|object|frameset|frame|iframe|meta|style).+(.|\s)*?>$
I've tried to match single line tag (for example '<BR>') but I'm totally confused now and there are some inconsistencies on it, for example, some of lines that close some tags are improperly selected.
I Know that the best way is discover why HTMLFILE is throwing me on error, but there is no more information on error to debug it.
Thank for all the time and patience.
Here is the regex candidate:
<(script|meta|style|embed|object|frameset|frame|iframe)[\s\S]*?<\/(script|meta|style|embed|object|frameset|frame|iframe)>
DEMO with explanation
EDIT
Update with lazy match for [\s\S]*?
Regex is not best tool for that, take a look here, but if you really want, I think in simple cases you can also use one regex for all tags, also nested:
(?=(<([^>]+)>([\s\S]*?)<\/\2>))
DEMO
the 1st groups shows whole captured part, 2nd groups capture just tag, and 3rd group capture content of tag. It doesn't actually match text, only capture some fragments. However you probably can get start/end index of match, and use in as you want.
Still I think you should reconsider using regex, however suntex used above is quite useful, so it is worth to know how to use it.
The website use ajax to load some data.When DocumentCompleted,I only get the html code without ajax data.
How to get the ajax data through webkit.net?
Thanks.
I've just recently fought with this myself and have what should be a working solution. I've not tried it with ajax, but I have used it after creating and appending DOM elements from C# and it produces the full code where DocumentText only produces the original unmodified HTML.
var fullHTML = webKitBrowser1.StringByEvaluatingJavaScriptFromString("document.getElementsByTagName('html')[0].outerHTML")
The only limitation to this method that I've seen is that it does not include the doctype tag if there is one, but everything else is there.
Variations on my problem have been discussed elsewhere, so I hope I'm not duplicating!
I'm implementing a simple Private Messaging System as part of a web app. I've got an annoying problem though when dynamically inserting text into a textarea box in order to make a reply. Getting the content and displaying it is fine, but I can't work out how to format it correctly.
Obviously, I can't use html tags, but plain text formatting like line breaks and carriage returns seem to be ignored too.
This happens when an existing message is being displayed either as part of a reply or as a thread in a new message.
How do I check what formatting is being saved in my db? Or indeed what formatting is being sent back from my db?!
What about using some for of HTML editor for the replies. Save the html in the database and shown them again in the editro on your web site.
Check this wiki page for a list of possible editors
UPDATE:
Thanks for your replies, but I've worked it out. I was playing around and realised the problem was at the stage of sending the data to the db. I passed the text through the nl2br() function before sending it to the db and this seems(!) to have done the trick!
I'm just curious if anyone has any tricks on how to keep source code looking good when you "View Source." I'm militant about keeping my code well formatted and spaced while I'm developing and I tend to "View Source" a lot to double check the output (when firebug is overkill). When I start using RenderPartials and RenderActions and anything in the tag it gets pretty messy.
I don't want to send too many extra characters to the browser to keep file size efficient but is there a way to force the xhtml/html to do a newline or tab? I tried a couple of things that didn't work. Thanks!
Get over it.
Don't worry about how it looks in 'view source'; worry about how it looks in csharp :) If you get worried about the efficiency of the HTML you can gzip it, and other such things.
I use firefox's ViewSourceWith extension to view the source in a code editor (in my case SciTe) in which I have a macro programmed so that when I press Ctrl-1 it reformats the HTML using a script I've written.
If validation is the goal then consider using a HTML validator rather than your eyeballs. Total Validator looks good.
Just send a \n and it should come out as a newline in the "view-source" section of the browser.
Example:
public static String Etc(...)
{
TagBuilder myTag = new TagBuilder("span");
myTag.SetInnerText("I'm mr. tag-content!");
return myTag.ToString(TagRenderMode.Normal) + Environment.NewLine;
}
Ive used WindowsHost to host a WebBrowser control, and that has allowed me to access the WebBrowsers Document/DOM directly, t read HTML content via mouse clicks on HTML document elements and also to invokes on submit forms. I never found a way even in Net 3.5 to do this when I was searching at the time. Ive found this post http://rhizohm.net/irhetoric/blog/72/default.aspx and it looks like through som magic casing you can expose the dom. BUT My question is, has any one done this, and is it possible once you get the dom to do Invokes to submit contect to html forms and also get HTML elements via mouse click events????
Anyone tried? and was able to do both?
Thanks
I'm using WPF.
add a reference to:
Microsoft.mshtml
then:
var doc = ( mshtml.HTMLDocument )_wbOne.Document;
and this gives you the raw string:
doc.documentElement.innerHTML
in return, if you know how to get information out of the HTML document, i'd appreciate it.
for example get all the s and and the metas and whatever else might be gettable so i can get the information from them? i don't want to dink around with the html, just get the info from them...:-)