All paragraphs are empty in an opened document in python-docx - python-docx

I do the following:
from docx import Document
document = Document('text.docx')
document.paragraphs[42].text
And it gives me '' whatever number I enter, and for loop to find and replace a word does not work. But if I save the document with document.save('text2.docx'), the document is not empty.
The document is relatively big and contains many different formatting, images, tables, styles.
My task is to find and replace a word in docx document with some correction of the following word, so I will be glad, if you suggest another tool

I ran into this problem and was able to read the document using docx2txt: https://pypi.org/project/docx2txt/

Related

How to encode a hyperlink in CSV formatted file?

When I try to encode a HTML anchor link in CSV file cell it becomes corrupted and not readable by Excel.
Is there some sort of non-HTML solution or format to encode a hyperlink in CSV file cell?
For when automagicalism doesn't work, and you're definitely using Excel, use this as the field content.
=HYPERLINK("http://stackoverflow.com")
This worked for me:
Use the =HYPERLINK function, the first parameter is the web link, the second is the cell value.
Put " quotes around the entire function.
Escape the internal quotes within the function with two sets of quotes, i.e., ""
Here's a four-column comma-delimited example.csv:
5,6,"=HYPERLINK(""http://www.yahoo.com"";""See Yahoo"")",8
When a spreadsheet program (LibreOffice, etc.) opens this .csv, it creates an active link for you.
What worked for me in Excel 2003 - output to your CSV the statement:
CELLVALUE="=HYPERLINK("+QM+URLCONTENTS+QM+";"+QM+"URLDISPLAYNAME"+QM+")"
note the semicolon ; use in the hyperlink. I've found the comma not to work for me in Excel 2003.
Depending on the script or language you use quotemarks could be handled differently. The cellvalue you put into the CSV before you import it into Excel should look exactly like this: "=HYPERLINK("URLCONTENTS";"URLDISPLAYNAME")"
where:
CELLVALUE is the output written to the CSV
QM is the ASCII value of ["] -> (ASCII 34)
URLCONTENTS is the full URL to the page you want to link to.
-URLDISPLAYNAME is the text you see in the Excel cell.
You can also use relative paths and set a base location in Excel.
File/Properties > Tab Summary > Field Hyperlink Base.
Use as fieldvalue something like http://www.SITENAME.com/SUB_LOCATION/../SUB_LOCATION that sets your starting point so you can click it in Excel. Of course, you don't have to use SUB_LOCATIONs if the sitename itself already will resolve successfully for your relative path.
What I couldn't find is how to make the links automatically underlined in Excel. From other tips found in this article:
Format manually all linkcells as underlined and darkblue (for example) and then the standard functionality appears with already visited links turning into another color.
A CSV file is simply text - it's up to the loading program how it chooses to interpret the text.
If Excel is complaining when you feed it "Link", "another cell" then try just having the raw URL and you might find Excel will automagically turn it into a link.
But in general Excel doesn't process HTML, so expecting it render HTML from a CSV file is asking too much.

mail-merge HTML from a database into MS Word

project: Using VB.NET to build a winforms database interface and work-automation app.
I am using this editor for the users to enter their text in the database interface environment that will both load/save/show them what they are working on in the form and also mail-merge into a Word document waiting for the content. I can do the first step and it works well, but how do I get MS Word to recognize HTML as formatting instead of just merging in tags and text all as text?
The tool has two relevant properties: one to get just the text (no markup, i.e. no HTML) and one to get the full markup with HTML. Both of these are in text format (which I use for easy storage in the Database).
ideas/directions I can think of:
1) use the clipboard. I can copy/paste the content straight from the editor window to Word and it works great! But loading from a database is significantly different, even when using the clipboard programatically. (maybe I don't understand how to use the clipboard tools)
2) maybe there is a library or class/function in Word that can understand the HTML as "mergable" content?
thanks!
:-Dan
You may use our (SautinSoft) .Net library to transform each of your HTML data to Word document.
Next you may merge all produced Word documents into single Word document. The component also have function to merge Word documents.
This is link download the component: http://www.sautinsoft.com/products/html-to-rtf/download.php
This is a sample code to transform HTML to Word document in memory:
Dim h As New SautinSoft.HtmlToRtf
Dim rtfString As String = ""
rtfString = h.ConvertString(htmlString)
This is a sample code to merge two documents in memory:
Dim h As New SautinSoft.HtmlToRtf
Dim rtfSingle As String = ""
rtfSingle = h.MergeRtfString(rtf1, rtf2)
I ended up using the clipboard to set the text. Here is a code sample that I needed to answer this question.
Clipboard.SetText(Me._Object.Property, TextDataFormat.Rtf)
I just didn't know how to tell the computer that the content was HTML or RTF etc. It turned out to be simple.
:-Dan

How can I convert an OpenOffice Writer document (.odt) to multiple HTML files with navigation?

I have an OpenOffice Writer document (.odt) with a table of contents, sections, subsections, etc.
Is there a quick way to convert (export) this into multiple HTML files with a navigation sidebar, converting the sections into links?
You can:
Unzip the odt, parse the XML and make the HTML file yourself.
Use OpenOffice to export the document to HTML.
There are several ways to export HTML from OpenOffice or LibreOffice:
Use File > Export, then select file type XHMTL. However, this creates one big HTML file, not multiple files.
Use File > Save as, then select file type HTML document. This creates one big HTML file which is similar but not fully equal to the one above.
Use File > Send > Create HTML document. In the following dialog, you can select a style used in the document based on which the document is split into multiple HTML files. However, I did not get this to work properly. My document is always split on level 1, no matter what I selected here.
Use File > Wizards > Web page. You will get multiple settings to chose from. However, this does not work at all for me. It either fails completely or it does not produce the expected output.
The last two solutions were found on the OpenOffice Wiki at https://wiki.openoffice.org/wiki/Documentation/OOo3_User_Guides/Getting_Started/Saving_Writer_documents_as_web_pages
As a conclusion, I cannot provide a complete solution. I am still looking for a good way to solve this problem.

HTML to EXCEL -> simple question

O have a ,,export to excel" function, I have some tables and it works fine, but I have one single problem.
For moving to the next line I use <br />, but what if I want to switch to the next column? What tag can I use to switch to the next column?
Thanks
Simple HTML tags are supported on a limited basis by Excel. There used to be a list of supported HTML tags as well as some HTML extensions supported by Excel (from Excel 97 onwards), but I can't find it on MSDN anymore. Here's an alternate link:
http://www.code4lifesoftware.com/articles/msexcelreadme.htm
The new XML/HTML format supported from Excel 2000 onwards is a lot more complex, and requires more work:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnoffxml/html/ofxml2k.asp
Take a look at these links, hopefully you'll find the syntax you're looking for!
In all Excel versions where I used this approach no other way to go to another column, but to use the table. You can mark up your html file with a table layout (although this is not recommended by W3C), and place all of the nested data table inside the main layout table. Unfortunately no other way.
P.S.: Look at Excel html format: Saving and Opening HTML Files.
The BR tag has a mso-data-placement style attribute specifying where the data is stored. The attribute can have one of the following string constants: new-cell means to start a new cell in the next row after the break and same-cell means that the break is in a cell.
If you use commas and make your file a .csv, that would be one way. If you use tabs, then have it read as a tab delimited file. Basically, you need to tell Excel what your delimiter (separator character) is, and it will handle it from there.

Source text contains simple HTML. How can I simply format the text in MS Word?

I've inherited a project that stores basic HTML formatting (i.e. - <b>, <i> tags) in a database and writes it out to a Word document. This is my first Word automation assignment, so be gentle!
Currently, there is a complicated function that runs after the document is complete that searches and replaces these tags. However, as this is run after the document is complete, any logic that is determined at run time (i.e. - insert page break here) can lead to disastrous results. For example, if I have a large chunk of bolded text, this bold text takes up more space and pushes the line break down to the next page, resulting in a mostly blank page.
I believe the fix for this is to format the text as it comes from the database so the positioning logic will be correct. I don't want to call the complicated procedure multiple times as it is time consuming and our end users need this document as quickly as possible.
Is there an easy way to write HTML formatted text to a Word document without needing to find and replace every supported tag? I would think that there would be something within Word that could handle this automatically. Thanks in advance if you can point me in the right direction.
Try this:
First, save the HTML you are about to insert as an ordinary ".htm" file.
Then use the Range object and it's InsertFile method to insert the ".htm" file at any given position:
Dim r As Range
Set r = ActiveDocument.Range
r.InsertFile FileName:=TempFilePath, Link:=False, ConfirmConversions:=False
Word should be smart enough to handle the HTML and do all of the format conversion on it's own. Use CSS to control the finer parts of the formatting.
Delete the ".htm" file when done.
maybe you can invoke an embedded IE (IWebBrowser2) to layout the text, then copy to clipboard as richtext, and finally paste to Word as RichText (formatted).