How to detect HTML in clipboard data using Qt - html

I have a rich text editor I'm working on where I need to parse and clean data from the clipboard when appropriate. Whenever the text being pasted contains HTML, I will clean it up and update the text field with the correct html.
However, when there is no html in the clipboard, there is no need for me to run the html cleaning tool.
My first thought was to use Regex and check for any html tag in there, but I'm not sure this is the best solution for this problem as it can cause more headaches in the long run with false positives, etc.
My question is, how can I detect some HTML in the clipboard?
Is there a an elegant way to solve this problem without having to resort to Regex?

may be one of these functions:
bool QDomDocument::setContent(...)
This function reads the XML document from the string text, returning true if the content was successfully parsed; otherwise returns false. Since text is already a Unicode string, no encoding detection is done
Addition for a clipboard's mixed data:
// get a html data from a junk
QString htmlText = cliboardString.section("</html>", -2, 0,QString::SectionIncludeTrailingSep)
.section("<html", 1,-1,String::SectionIncludeLeadingSep);
// check for a validness, correctness etc.
if( !htmlText.isEmpty() ) {
QDomDocument::setContent(htmlText,...
}

Related

write_html() method in fpdf not using font/encoding specified

I'm creating a PDF with a large collection of quotes that I've imported into python with docx2python, using html=True so that they have some tags. I've done some processing to them so they only really have the bold, italics, underline, or break tags. I've sorted them and am trying to write them onto a PDF using the fpdf library, specifically the pdf.write_html(quote) method. The trouble comes with several special characters I have, so I am hoping to encode the PDF to UTF-8. To write with .write_html(), I had to create a new class as shown in their readthedocs under the .write_html() method at the very bottom of the left hand side:
from fpdf import FPDF, HTMLMixin
class htmlFPDF(FPDF, HTMLMixin):
pass
pdf = htmlFPDF()
pdf.add_page()
#set the overall PDF to utf-8 to preserve special characters
pdf.set_doc_option('core_fonts_encoding', 'utf-8')
pdf.write_html(quote) #[![a section of quote giving trouble with quotations][2]][2]
The list of quotes that I have going into the pdf all appear with their special characters and the html tags (<u> or <i>) in the debugger, but after the .write_html() step they then show up in the pdf file with mojibake, even before being saved, as seen through debugger. An example being "dayâ€ÂTMs demands", when it should be "day's demands" (the apostrophe is curled clockwise in the quote, but this textbox doesn't support).
I've tried updating the font I use by
pdf.add_font('NotoSans', '', 'NotoSans-Regular.ttf', uni=True)
pdf.set_font('NotoSans', '', size=12)
added after the .add_page() method, but this doesn't change the current font (or fix mojibake) on the PDF unless I use the more common .write(text_height, quote) method, which renders the underline/italicize tags into the PDF as text. The .write() method does preserve the special characters. I'm not trying to change the font really, but make sure that what's written onto the PDF preserves the special characters instead of mojibake them.
I've also attempted some .encode/.decode action before going into the .write_html(), as well as attempted some methods from the ftfy library. And tried adding '' to the start of each quote to no effect.
If anyone has ideas for a way to iterate through each line on the PDF that'd be terrific, since then I could use ftfy to fix the mojibake. But ideally, it would be some other html tag at the start of each quote or a way to change the font/encoding of the .write_html() method, maybe in the class declaration?
Or if I'm at a dead-end and should just split each quote on '<', use if statements to detect underlines, italicize, etc., and use the .write() method after all.
Extract docx to html works really bad with docx2python. I do this few month ago. I recommend PyDocX. docx2python are good for docx file content extracting, not converting it into a html.

How to Html Encode but preserve some elements like <span> tag in ASP.NET?

I am writing an application in which we are consuming a service which returns the results.
The result is having some HTML elements which are to be printed as it is.
Sample Result : {values ="Lorem impsum loren ipsum <span class=\"boldc\">bold value </span> lorenipsum .....}
Now this has to be displayed in an ASP.NET page. If I HTML encode, the span gets encoded I cannot make the items bold as desired.
#Html.Raw(Message) - this works but opens all the vulnerabilities and is dangerous.
What is the best way to handle this scenario ? Is there any way in which I can print these HTMl characters ; yet have the safety ?
Unfortunately, no. There's no way you can achieve displaying decoded-HTML without being exposed to any other vulnerabilities with the default behaviour. You either decode or encode this html string. You could however do your own parsing after the HTML has been decoded in order to disable or more likely remove any dangerous mark-up such as script, iframe, form tags, etc
I know this question is 7 years old, but I just found a way to do it and wanted to share.
The solution to this problem can be found in this answer and it uses regex. There's a more detailed version here
So basically you have to use a Regex that captures what you want to discard (in this case the HTML tags) and what you want to encode.
I came up with this regex and it worked great:
var regex = new Regex(#"(?:<(?:\/?)(?:strong|b|em|i|u|font|sub|sup|a href|ol|ul|li|div|span|p|br|center)(?:\s(?:\w|\s)*)?>)|(.)");
return regex.Replace(value, delegate (Match m) {
if (m.Groups[1].Value == "") return m.Value;
else return HttpUtility.HtmlEncode(m.Value);
});
It has the downside that you have to keep a list of all supported tags, but it's a regex that is easy to understand and maintain.
It can be simpler, if you don't check the html tags.
var regex = new Regex(#"<\/?[A-Za-z]+[^<>]*>|(.)");

How can I remove or escape new line-carriage returns within an XML string in XSL?

I've got an ASP multiline textbox that saves user defined text to a database. When I retrieve the data, I serialize it into xml and then run it through an XSL transform to output my HTML.
Within my transform, I am passing the textbox defined data into a javascript function via an onclick event of an element.
The problem I'm running into...when a user enters a carriage return into the textbox and saves it to the database, a javascript error is generated on page load.
I'm using .NET's XslCompiledTransform to do the transform. There is a property on XmlDocument called PreserveWhiteSpace, default is false, that can be set to strip out white space in the XML. This solves the problem of not allowing a user to enter breaking text, however, the client wants to preserve the formatting of the text that they enter if at all possible.
From what I know, .NET XslCompiledTransform transforms carriage returns-new line into
. I believe these are the characters that are breaking the javascript.
My first thought was to strip out the carriage returns within the xsl prior to passing the string into the javascript function, but I've not been able to figure out what characters to "search" the string for.
I guess another question is what characters get stored in SQL for carriage returns from within an ASP.NET textbox control?
Looking directly at the data in the database, SQL seems to display the character as "non-displayable" characters or (2 empty boxes).
Has anyone had any experience with this type of thing?
I was able to do this in the code behind to get my desired results:
using (StringWriter sWriter = new StringWriter())
{
xTrans.Transform(xDoc, xslArgs, sWriter);
return sWriter.ToString().Replace("
", "\\r\\n");
}
One other thing that I've stumbled across...
Initially, I wanted to find a solution to this problem that did not require a "compiled" code change, ie. a way to do this within xsl aka "a short term fix".
I tried this first and was not successful...
<xsl:variable name="comment" select="normalize-space(.\Comment)" />
This essentially did nothing and I was still receiving the javascript error.
Eventually, I tried this...
<div onclick="Show('{normalize-space($comment)}'"></div>
The second actually worked in stripping out the white space, thus, the javascript error was avoided. This wasn't a complete solution for my requirements because it only solved the issue of the javascript error, however, it would effectively prevent the user from "breaking" the page.
For that reason, it could suffice as a short term solution.

mail-merge HTML from a database into MS Word

project: Using VB.NET to build a winforms database interface and work-automation app.
I am using this editor for the users to enter their text in the database interface environment that will both load/save/show them what they are working on in the form and also mail-merge into a Word document waiting for the content. I can do the first step and it works well, but how do I get MS Word to recognize HTML as formatting instead of just merging in tags and text all as text?
The tool has two relevant properties: one to get just the text (no markup, i.e. no HTML) and one to get the full markup with HTML. Both of these are in text format (which I use for easy storage in the Database).
ideas/directions I can think of:
1) use the clipboard. I can copy/paste the content straight from the editor window to Word and it works great! But loading from a database is significantly different, even when using the clipboard programatically. (maybe I don't understand how to use the clipboard tools)
2) maybe there is a library or class/function in Word that can understand the HTML as "mergable" content?
thanks!
:-Dan
You may use our (SautinSoft) .Net library to transform each of your HTML data to Word document.
Next you may merge all produced Word documents into single Word document. The component also have function to merge Word documents.
This is link download the component: http://www.sautinsoft.com/products/html-to-rtf/download.php
This is a sample code to transform HTML to Word document in memory:
Dim h As New SautinSoft.HtmlToRtf
Dim rtfString As String = ""
rtfString = h.ConvertString(htmlString)
This is a sample code to merge two documents in memory:
Dim h As New SautinSoft.HtmlToRtf
Dim rtfSingle As String = ""
rtfSingle = h.MergeRtfString(rtf1, rtf2)
I ended up using the clipboard to set the text. Here is a code sample that I needed to answer this question.
Clipboard.SetText(Me._Object.Property, TextDataFormat.Rtf)
I just didn't know how to tell the computer that the content was HTML or RTF etc. It turned out to be simple.
:-Dan

Source text contains simple HTML. How can I simply format the text in MS Word?

I've inherited a project that stores basic HTML formatting (i.e. - <b>, <i> tags) in a database and writes it out to a Word document. This is my first Word automation assignment, so be gentle!
Currently, there is a complicated function that runs after the document is complete that searches and replaces these tags. However, as this is run after the document is complete, any logic that is determined at run time (i.e. - insert page break here) can lead to disastrous results. For example, if I have a large chunk of bolded text, this bold text takes up more space and pushes the line break down to the next page, resulting in a mostly blank page.
I believe the fix for this is to format the text as it comes from the database so the positioning logic will be correct. I don't want to call the complicated procedure multiple times as it is time consuming and our end users need this document as quickly as possible.
Is there an easy way to write HTML formatted text to a Word document without needing to find and replace every supported tag? I would think that there would be something within Word that could handle this automatically. Thanks in advance if you can point me in the right direction.
Try this:
First, save the HTML you are about to insert as an ordinary ".htm" file.
Then use the Range object and it's InsertFile method to insert the ".htm" file at any given position:
Dim r As Range
Set r = ActiveDocument.Range
r.InsertFile FileName:=TempFilePath, Link:=False, ConfirmConversions:=False
Word should be smart enough to handle the HTML and do all of the format conversion on it's own. Use CSS to control the finer parts of the formatting.
Delete the ".htm" file when done.
maybe you can invoke an embedded IE (IWebBrowser2) to layout the text, then copy to clipboard as richtext, and finally paste to Word as RichText (formatted).