I am working a project which has requirement to convert specific paragraphs in word document to HTML. I will have the range object of para or paras , from that range I can get WordOpenXML, I want to convert that to HTML. ( it should not have the html, head, body tags as it is not full document but just a small html chunk )
I saw Eric White's open XML articles, he did great articles on this topic and power tools for openxml has html converter which converts entire document to html, my requirement is to convert a specific para or range to HTML. Can any one guide me in right direction.
For example, If a word document has
This is para1.
This is para2.
This is para3.
My requirement is to convert para2, which is available with me as para object. So, basically I am looking to write a function like
public string WordOpenXMLToHtml( string sWordOpenXML) {
// do the transformation
return sHtml;
}
You could try the HtmlConverter object. More info here Transforming Open XML WordprocessingML to XHTML Using the Open XML SDK 2.0
Related
Like this question, extract text from xml tags in an XML file using apach tika parser
I want to extract all text from text based files, including tagged content, the tags themselves, and other text in XML/HTML elements.
I've tried XML (application/xml), and HTML (text/html) and seen that AutoDetectParser returns less than the full text content.
I've also tried YAML (text/plain), and JSON (text/plain) which do return the full text content.
I understand that I can't do XML or HTML using the AutoDetectParser. What I can't find documented is a list of what types of files would need special handling.
To get full text content (even if that means a complete 'raw' copy of the file):
1. What Mimetypes should be parsed using a TXTParser?
2. What Mimetypes should be parsed using other parsers?
Basically, I'm asking what Mimetypes does the AutoDetectParser return less than the full text content?
Thanks
EDIT
My use case is to be able to extract text and metadata from a wide variety of input file formats including txt, xml, html, doc(x), ppt(x), pdf, ...
Essentially, I want to be able to handle any file type Tika can handle.
I am using code like this
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
try (InputStream stream = new FileInputStream(fileToExtract)){
parser.parse(stream, handler, metadata, context);
} catch ... {
}
I see the same results for XML files as the question referenced above.
What I am trying to find out is: where is it documented when the combination of AutoDetectParser and BodyContentHandler will return less than the full text of the input file.
When, or for what Mimetypes, do I need to switch the Parser and/or ContentHandler?
I don't see this information clearly documented, and I am hoping to avoid a trail and error approach.
My goal is to use a WCF Service to accept some parameters, generate and then return a HTML string that a user can use to embed on certain webpages of their choice.
Is XDocument appropriate for generating a string of HTML?
I do not really need a full HTML document, just a simple snippet that has some image elements, a p-tag element, and a table element.
It's suitable for generating XHTML, which is valid XML. It wouldn't be suitable for parsing HTML, which doesn't have to be a valid XML document.
There may be more HTML-specific APIs available, but for just a simple snippet of XHTML, using XDocument should be fine.
project: Using VB.NET to build a winforms database interface and work-automation app.
I am using this editor for the users to enter their text in the database interface environment that will both load/save/show them what they are working on in the form and also mail-merge into a Word document waiting for the content. I can do the first step and it works well, but how do I get MS Word to recognize HTML as formatting instead of just merging in tags and text all as text?
The tool has two relevant properties: one to get just the text (no markup, i.e. no HTML) and one to get the full markup with HTML. Both of these are in text format (which I use for easy storage in the Database).
ideas/directions I can think of:
1) use the clipboard. I can copy/paste the content straight from the editor window to Word and it works great! But loading from a database is significantly different, even when using the clipboard programatically. (maybe I don't understand how to use the clipboard tools)
2) maybe there is a library or class/function in Word that can understand the HTML as "mergable" content?
thanks!
:-Dan
You may use our (SautinSoft) .Net library to transform each of your HTML data to Word document.
Next you may merge all produced Word documents into single Word document. The component also have function to merge Word documents.
This is link download the component: http://www.sautinsoft.com/products/html-to-rtf/download.php
This is a sample code to transform HTML to Word document in memory:
Dim h As New SautinSoft.HtmlToRtf
Dim rtfString As String = ""
rtfString = h.ConvertString(htmlString)
This is a sample code to merge two documents in memory:
Dim h As New SautinSoft.HtmlToRtf
Dim rtfSingle As String = ""
rtfSingle = h.MergeRtfString(rtf1, rtf2)
I ended up using the clipboard to set the text. Here is a code sample that I needed to answer this question.
Clipboard.SetText(Me._Object.Property, TextDataFormat.Rtf)
I just didn't know how to tell the computer that the content was HTML or RTF etc. It turned out to be simple.
:-Dan
I am using POI to create a spreadsheet report, I have html content with <p>, <b/>, etc, how do i parse these html tags in POI?. is there any function in POI which can parse html content?
this is a sample of my POI code:
HSSFCell cell = getHSSFCell(mysheet, 5, 1);
cell.setCellValue(new HSSFRichTextString(htmlContent));
Thank you in advance.
POI is not for HTML, it's for MS Office. what you want to use is Xpath for your HTML parsing portion. Xpath is a rabbit hole of it's own, so I won't go into alot of detail about it, but here are some resources for java xpath:
roseindia tutorial
javadocs
IBM Xpath API
One of the simple solution would be to use an HTML parser to parse the HTML content and then set the text using POI. I use Jericho HTML Parser. http://jericho.htmlparser.net/docs/index.html
A simple HTML Parsing using jericho:
Source source = new Source("The HTML Text");
String parsedHTMLText = source.getTextExtractor().toString();
Can I embed an XML file in HTML without using iFrames?
I want to show XSL-transformed XML(which, is HTML as a result of transformation) as a part of my HTML document. Hope this makes it clearer.
If my description of problem is unclear, please tell me and I will try to explain it more.
You can easily use browser based XSL transformation routines to convert an XML string into an XMLDocument or HTML output that can then be applied into any page element.
The steps could be briefly summarized as:
Load an XML string from a resource (or as the result of an AJAX hit).
Load the XML document into an Xml document object (code differs for Browsers - IE uses the ActiveXObject MSXML - DOMDocument, while Mozilla uses the built-in implementation to create a Document. Chrome on the other hand uses the built-in XmlHttpRequest object as the only available XML document object.)
Load the XSL document similarly and set its arguments.
Transform the XML and obtain output as a string.
Apply the string output to any page element.
Note that the code differs for each browser so it may be simpler to use a public JS framework such as JQuery or Prototype.
You will need to use html entities. For example this is how you would write a name tag
<name>.
More reading here