Can Excel functions recognize bold text? - html

For convenience sake in something work related, I need to convert text style into html format. If I have this sentence for example; "the sky is Blue" in a MS Word .doc document, I want to be able to copy it to excel and have the bold potion be written with html tags.
Question is, can Excel functions detect text styles? and if so which function would be correct? I was thinking of Substitute but not so sure anymore.
Any help would be appreciated!

I think this is something that will be better done in the Word before you copy it to Excel. I found this article about it (https://word.tips.net/T001904_Adding_Tags_to_Text.html) - basically just use Find and Replace where you set up the format of what are you looking for (like italic) and that you want to replace it with tags like this:
<i>^&</i>
The part ^& tells it to include the string it found, so you do not lose the content and it adds the tags before and after the string in given format.

Related

write_html() method in fpdf not using font/encoding specified

I'm creating a PDF with a large collection of quotes that I've imported into python with docx2python, using html=True so that they have some tags. I've done some processing to them so they only really have the bold, italics, underline, or break tags. I've sorted them and am trying to write them onto a PDF using the fpdf library, specifically the pdf.write_html(quote) method. The trouble comes with several special characters I have, so I am hoping to encode the PDF to UTF-8. To write with .write_html(), I had to create a new class as shown in their readthedocs under the .write_html() method at the very bottom of the left hand side:
from fpdf import FPDF, HTMLMixin
class htmlFPDF(FPDF, HTMLMixin):
pass
pdf = htmlFPDF()
pdf.add_page()
#set the overall PDF to utf-8 to preserve special characters
pdf.set_doc_option('core_fonts_encoding', 'utf-8')
pdf.write_html(quote) #[![a section of quote giving trouble with quotations][2]][2]
The list of quotes that I have going into the pdf all appear with their special characters and the html tags (<u> or <i>) in the debugger, but after the .write_html() step they then show up in the pdf file with mojibake, even before being saved, as seen through debugger. An example being "dayâ€ÂTMs demands", when it should be "day's demands" (the apostrophe is curled clockwise in the quote, but this textbox doesn't support).
I've tried updating the font I use by
pdf.add_font('NotoSans', '', 'NotoSans-Regular.ttf', uni=True)
pdf.set_font('NotoSans', '', size=12)
added after the .add_page() method, but this doesn't change the current font (or fix mojibake) on the PDF unless I use the more common .write(text_height, quote) method, which renders the underline/italicize tags into the PDF as text. The .write() method does preserve the special characters. I'm not trying to change the font really, but make sure that what's written onto the PDF preserves the special characters instead of mojibake them.
I've also attempted some .encode/.decode action before going into the .write_html(), as well as attempted some methods from the ftfy library. And tried adding '' to the start of each quote to no effect.
If anyone has ideas for a way to iterate through each line on the PDF that'd be terrific, since then I could use ftfy to fix the mojibake. But ideally, it would be some other html tag at the start of each quote or a way to change the font/encoding of the .write_html() method, maybe in the class declaration?
Or if I'm at a dead-end and should just split each quote on '<', use if statements to detect underlines, italicize, etc., and use the .write() method after all.
Extract docx to html works really bad with docx2python. I do this few month ago. I recommend PyDocX. docx2python are good for docx file content extracting, not converting it into a html.

HTML/RTF string in to RTF file

Does anybody know how to insert some formatted text string into some RTF file?
I am able to insert any plain text into an RTF file (to any place in document I want), but not formatted strings.
I know that when such string is added to RTF file, then also some RTF heading has to be updated. And here is a problem. I need to find out what shall be placed in RTF heading and in which exactly place. Maybe there is some ready solution. So far I cannot find it anywhere.
Normally I work with Java, but the problem is not necessary related to any language.
They talk about using a valid headder, in the .rtf specifications. I hope this will help you to get a valid format result.
Plain text in .rtf files, without any valid "formatting syntax", will not return any other result than the given plain text.
Another way, to get a neat rich text formatted document, is by using an .rtf editor or some .rtf compiler for the programming language you are using.
Instead of dealing with the rtf property, you can use the Text property. Set the cursor to where you want to insert text. Then paste formatted text from another richtext box, or paste normal text and change its formatting.

Comparison of HTML and plain text from SQL

There are two columns. One of them contains HTML and another contains plain text. How can I compare them as 2 plain texts? Converting HTML -> plain text should be done the same way as a browser does when copying selected HTML into clipboard and pasting it into notepad.
The answer to this SO question links to a user-defined function for stripping HTML tags from text. After doing this you can then compare with the plain text field, e.g.
SELECT * FROM YourTable
WHERE plainText = udf_stripHTML(htmlText)
The SQL doesn't know that one is HTML and one is not.
If you just want to compare the precise content, use = or LIKE.
If you want to remove the tags, do precisely that... remove the tags from the HTML column, and then compare the result of that to the SQL column.
When you pull the values from the database they are whatever datatype your field containes. You can manipulate the strings any way you want in your desired programming language.... (they should already be text if that is what they were).
SQL 2008 (and earlier) does not contain any function or code that can "natively" convert HTML into, err, non-HTML. You either need to write such a function yourself, or find a third-party utility that can do this. (Is there application code that does this? Perhaps read the data and run it through that app?)

HTML to EXCEL -> simple question

O have a ,,export to excel" function, I have some tables and it works fine, but I have one single problem.
For moving to the next line I use <br />, but what if I want to switch to the next column? What tag can I use to switch to the next column?
Thanks
Simple HTML tags are supported on a limited basis by Excel. There used to be a list of supported HTML tags as well as some HTML extensions supported by Excel (from Excel 97 onwards), but I can't find it on MSDN anymore. Here's an alternate link:
http://www.code4lifesoftware.com/articles/msexcelreadme.htm
The new XML/HTML format supported from Excel 2000 onwards is a lot more complex, and requires more work:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnoffxml/html/ofxml2k.asp
Take a look at these links, hopefully you'll find the syntax you're looking for!
In all Excel versions where I used this approach no other way to go to another column, but to use the table. You can mark up your html file with a table layout (although this is not recommended by W3C), and place all of the nested data table inside the main layout table. Unfortunately no other way.
P.S.: Look at Excel html format: Saving and Opening HTML Files.
The BR tag has a mso-data-placement style attribute specifying where the data is stored. The attribute can have one of the following string constants: new-cell means to start a new cell in the next row after the break and same-cell means that the break is in a cell.
If you use commas and make your file a .csv, that would be one way. If you use tabs, then have it read as a tab delimited file. Basically, you need to tell Excel what your delimiter (separator character) is, and it will handle it from there.

Source text contains simple HTML. How can I simply format the text in MS Word?

I've inherited a project that stores basic HTML formatting (i.e. - <b>, <i> tags) in a database and writes it out to a Word document. This is my first Word automation assignment, so be gentle!
Currently, there is a complicated function that runs after the document is complete that searches and replaces these tags. However, as this is run after the document is complete, any logic that is determined at run time (i.e. - insert page break here) can lead to disastrous results. For example, if I have a large chunk of bolded text, this bold text takes up more space and pushes the line break down to the next page, resulting in a mostly blank page.
I believe the fix for this is to format the text as it comes from the database so the positioning logic will be correct. I don't want to call the complicated procedure multiple times as it is time consuming and our end users need this document as quickly as possible.
Is there an easy way to write HTML formatted text to a Word document without needing to find and replace every supported tag? I would think that there would be something within Word that could handle this automatically. Thanks in advance if you can point me in the right direction.
Try this:
First, save the HTML you are about to insert as an ordinary ".htm" file.
Then use the Range object and it's InsertFile method to insert the ".htm" file at any given position:
Dim r As Range
Set r = ActiveDocument.Range
r.InsertFile FileName:=TempFilePath, Link:=False, ConfirmConversions:=False
Word should be smart enough to handle the HTML and do all of the format conversion on it's own. Use CSS to control the finer parts of the formatting.
Delete the ".htm" file when done.
maybe you can invoke an embedded IE (IWebBrowser2) to layout the text, then copy to clipboard as richtext, and finally paste to Word as RichText (formatted).