Parse HTML Content in POI - html

I am using POI to create a spreadsheet report, I have html content with <p>, <b/>, etc, how do i parse these html tags in POI?. is there any function in POI which can parse html content?
this is a sample of my POI code:
HSSFCell cell = getHSSFCell(mysheet, 5, 1);
cell.setCellValue(new HSSFRichTextString(htmlContent));
Thank you in advance.

POI is not for HTML, it's for MS Office. what you want to use is Xpath for your HTML parsing portion. Xpath is a rabbit hole of it's own, so I won't go into alot of detail about it, but here are some resources for java xpath:
roseindia tutorial
javadocs
IBM Xpath API

One of the simple solution would be to use an HTML parser to parse the HTML content and then set the text using POI. I use Jericho HTML Parser. http://jericho.htmlparser.net/docs/index.html
A simple HTML Parsing using jericho:
Source source = new Source("The HTML Text");
String parsedHTMLText = source.getTextExtractor().toString();

Related

Why is lxml html parser not parsing the complete file?

I am trying to parse a 16Mb html file using lxml. My actual task is to get all the doc tags and for each doc tag if the value of docno tag matches my doc list I extract the content of doc tag.
self.doc_file_list is a list containing paths of such 16Mb files that I need to parse.
file is absolute path of the file.
This is the code I am using currently
for file in file(self.doc_file_list,'r'):
tree = etree.parse(file.strip(), parser)
doc = tree.findall('.//doc')
for elem in doc:
docno = elem.find('.//docno').text
if docno in self.doc_set:
print >> out, etree.tostring(elem)
I checked the content of tree using etree.tostring(tree) and it does not parse the complete file and only parses some kb of the actual file.
Note: I am not getting any error message but the parsed content of tree is incomplete so I am not able to get the whole list.
I was finally able to solve this problem. I checked the tree generated and it was not parsing the whole document. This is because the document was heavily broken. You can check this information on the link: lxml.de/parsing.html (removed http as stackoverflow did not let me add more than 2 links).
This issue of broken html document can be resolved using one of the following two approaches:
1. Instead of using html parser you can either use ElementSoup provided by lxml. It uses BeautifulSoup parser to handle broken html docs. Link: http://lxml.de/lxmlhtml.html
Note: This approach did not work out for me.
2. Another approach is to directly use BeautifulSoup directly and using the parsers provided by it. There are many parser options provided and you need to find out which one suits you the best. For me, html.parser worked.
Link: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
Thanks all for the help.

Objective-C event-driven HTML parsing

I need to be able to parse HTML snippets in an event-driven way. For example, if the parser finds a HTML tag, it should notify me and pass the HTML tag, value, attributes etc. to a delegate. I cannot use NSXMLParser because I have messy HTML. Is there a useful library for that?
What I want to do is parse the HTML and create a NSAttributedArray and display it in a UITextView.
YES you can parse HTML content of file.
If you want to get specific value from HTML content you need to Parce HTML content by using Hpple. Also This is documentation with exmple that are is for parse HTML. Another way is rexeg but it is more complicated so this is best way in your case.

How to convert microsoft word para/range to html?

I am working a project which has requirement to convert specific paragraphs in word document to HTML. I will have the range object of para or paras , from that range I can get WordOpenXML, I want to convert that to HTML. ( it should not have the html, head, body tags as it is not full document but just a small html chunk )
I saw Eric White's open XML articles, he did great articles on this topic and power tools for openxml has html converter which converts entire document to html, my requirement is to convert a specific para or range to HTML. Can any one guide me in right direction.
For example, If a word document has
This is para1.
This is para2.
This is para3.
My requirement is to convert para2, which is available with me as para object. So, basically I am looking to write a function like
public string WordOpenXMLToHtml( string sWordOpenXML) {
// do the transformation
return sHtml;
}
You could try the HtmlConverter object. More info here Transforming Open XML WordprocessingML to XHTML Using the Open XML SDK 2.0

Generate a xml from a html

Im trying to generate a xml from a html (url). The html website have a formulary that i want to get into a xml archive, but its too long and im searching a way to do it easier.
There is a method to generate a xml with all the fields, etc, from a html?
you can also use an html parser and print out the objects / array as xml
try this: http://sourceforge.net/projects/html2xml/
You can try the free dotnet-classlibrary SgmlReader that can load html into a xmldocument. This in turn can be saved as xml.

parse html in adobe air

I am trying to load and parse html in adobe air. The main purpose being to extract title, meta tags and links. I have been trying the HTMLLoader but I get all sort of errors, mainly javascript uncaught exceptions.
I also tried to load the html content directly (using URLLoader) and push the text into HTMLLoader (using loadString(...)) but got the same error. Last resort was to try and load the text into xml and then use E4X queries or xpath, no luck there cause the html is not well formed.
My questions are:
Is there simple and reliable (air/action script) DOM component there (I do not need to display the page and headless mode will do)?
Is there any library to convert (crappy) html into well formed xml so I can use xpath/E4X
Any other suggestions on how to do this?
thx
ActionScript is supposed to be a superset of JavaScript, and thankfully, there's...
Pure JavaScript/ActionScript HTML Parser
created by Javascript guru and jQuery creator John Resig :-)
One approach is to run the HTML through HTMLtoXML() then use E4X as you please :)
Afaik:
No :-(
No :-(
I think the easiest way to grab title and meta tags is writing some regular expressions. You can load the page's HTML code into a string and then read out whatever you need like this:
var str:String = ""; // put HTML code in here
var pattern:RegExp = /<title>(.+)<\/title>/i;
trace(pattern.exec(str));