extract text from text files XML, HTML, etc, using Tika - html

Like this question, extract text from xml tags in an XML file using apach tika parser
I want to extract all text from text based files, including tagged content, the tags themselves, and other text in XML/HTML elements.
I've tried XML (application/xml), and HTML (text/html) and seen that AutoDetectParser returns less than the full text content.
I've also tried YAML (text/plain), and JSON (text/plain) which do return the full text content.
I understand that I can't do XML or HTML using the AutoDetectParser. What I can't find documented is a list of what types of files would need special handling.
To get full text content (even if that means a complete 'raw' copy of the file):
1. What Mimetypes should be parsed using a TXTParser?
2. What Mimetypes should be parsed using other parsers?
Basically, I'm asking what Mimetypes does the AutoDetectParser return less than the full text content?
Thanks
EDIT
My use case is to be able to extract text and metadata from a wide variety of input file formats including txt, xml, html, doc(x), ppt(x), pdf, ...
Essentially, I want to be able to handle any file type Tika can handle.
I am using code like this
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
try (InputStream stream = new FileInputStream(fileToExtract)){
parser.parse(stream, handler, metadata, context);
} catch ... {
}
I see the same results for XML files as the question referenced above.
What I am trying to find out is: where is it documented when the combination of AutoDetectParser and BodyContentHandler will return less than the full text of the input file.
When, or for what Mimetypes, do I need to switch the Parser and/or ContentHandler?
I don't see this information clearly documented, and I am hoping to avoid a trail and error approach.

Related

Cannot convert DOCX to HTML with Python

I've tried it by using mammoth:
import mammoth
result = mammoth.convert_to_html("MyDocument.docx")
print (result.value)
I don't get an HTML, but this strange code:
kbW7yqZoo4h9pYM6yBxX1QFx2pCoPYflXfieIPbtqpT913Vk7OzcZdEk3eO7TbWjvZNTGilsfmRrPwDvB[...]
I've also tried to use docx2html, but I can't install it. When I run pip install docx2html I get this error:
SyntaxError: Missing parentheses in call to 'print'
Mammoth .docx to HTML converter
Mammoth is designed to convert .docx documents, such as those created by Microsoft Word, and convert them to HTML. Mammoth aims to produce simple and clean HTML by using semantic information in the document, and ignoring other details. For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.
There's a large mismatch between the structure used by .docx and the structure of HTML, meaning that the conversion is unlikely to be perfect for more complicated documents. Mammoth works best if you only use styles to semantically mark up your document.
The following features are currently supported:
Headings.
Lists.
Customisable mapping from your own docx styles to HTML. For instance, you could convert WarningHeading to h1.warning by providing an appropriate style mapping.
Tables. The formatting of the table itself, such as borders, is currently ignored, but the formatting of the text is treated the same as in the rest of the document.
Footnotes and endnotes.
Images.
Bold, italics, underlines, strikethrough, superscript and subscript.
Links.
Line breaks.
Text boxes. The contents of the text box are treated as a separate paragraph that appears after the paragraph containing the text box.
Comments.
Installation
pip install mammoth
Basic conversion
To convert an existing .docx file to HTML, pass a file-like object to mammoth.convert_to_html. The file should be opened in binary mode. For instance:
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages, such as warnings during conversion
You can also extract the raw text of the document by using mammoth.extract_raw_text. This will ignore all formatting in the document. Each paragraph is followed by two newlines.
with open("document.docx", "rb") as docx_file:
result = mammoth.extract_raw_text(docx_file)
text = result.value # The raw text
messages = result.messages # Any messages
You can use pypandoc module for that purpose. See below code
import pypandoc
output = pypandoc.convert_file('file.docx', 'docx', outputfile="file_converted.html")
The issue you're having is probably that mammoth doesn't create legit HTML files, just HTML snippets. Meaning it's missing the and tags.
Some browsers can still render the content from the file since they're advanced enough to do so, but I ran into a similar problem when trying to use the raw output.
A nifty workaround for this is to add this to your code to convert it to proper HTML files:
import mammoth
with open("test.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages,
full_html = (
'<!DOCTYPE html><html><head><meta charset="utf-8"/></head><body>'
+ html
+ "</body></html>"
)
with open("test.html", "w", encoding="utf-8") as f:
f.write(full_html)
Where test.html is whatever the title you gave to your document.
I'm not taking credit for this, I found it here as well, but can't find the source post.
As stated in the documentation:
To convert an existing .docx file to HTML, pass a file-like object to
mammoth.convert_to_html. The file should be opened in binary mode. For
instance:
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages, such as warnings during conversion

Why is lxml html parser not parsing the complete file?

I am trying to parse a 16Mb html file using lxml. My actual task is to get all the doc tags and for each doc tag if the value of docno tag matches my doc list I extract the content of doc tag.
self.doc_file_list is a list containing paths of such 16Mb files that I need to parse.
file is absolute path of the file.
This is the code I am using currently
for file in file(self.doc_file_list,'r'):
tree = etree.parse(file.strip(), parser)
doc = tree.findall('.//doc')
for elem in doc:
docno = elem.find('.//docno').text
if docno in self.doc_set:
print >> out, etree.tostring(elem)
I checked the content of tree using etree.tostring(tree) and it does not parse the complete file and only parses some kb of the actual file.
Note: I am not getting any error message but the parsed content of tree is incomplete so I am not able to get the whole list.
I was finally able to solve this problem. I checked the tree generated and it was not parsing the whole document. This is because the document was heavily broken. You can check this information on the link: lxml.de/parsing.html (removed http as stackoverflow did not let me add more than 2 links).
This issue of broken html document can be resolved using one of the following two approaches:
1. Instead of using html parser you can either use ElementSoup provided by lxml. It uses BeautifulSoup parser to handle broken html docs. Link: http://lxml.de/lxmlhtml.html
Note: This approach did not work out for me.
2. Another approach is to directly use BeautifulSoup directly and using the parsers provided by it. There are many parser options provided and you need to find out which one suits you the best. For me, html.parser worked.
Link: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
Thanks all for the help.

Extracting JSON data from html source for use with jsonlite in R

I have a background in data and have just been getting into scraping so forgive me if my web standards and languages is not up to scratch.
I am trying to scrape some data from a javascript component of a website I use. Viewing the page source I can actually see the data I need already there within javascript function calls in JSON format. For example it looks a little like this.
<script type="text/javascript">
$(document).ready(function () {
gameState = 4;
atView.init("/Data/FieldView/20152220150142207",{"a":[{"co":true,"col:"Red"}],"b":false,...)
meLine.init([{"c":100,"b":true,...)
</script>
Now, I only need the JSON data in meLine.init. If I physically copy/paste only the JSON data into a file I can then convert that with jsonlite in R and have exactly what I need.
However I don't want to have to copy/paste multiple pages so I need a way of extracting only this data and leaving everything else behind. I originally thought to save the html source code to R, convert to text and try and regex match "meLine.init(", but I'm not really getting anywhere with that. Could anyone offer some help?
Normally I'd use XML and xpath to parse an html page but in this case (since you know the exact structure you're looking for) you might be able to do it directly with a bit of regular expressions (this is generally not a good idea as emphasized here). Not sure if this gets you exactly to your goal but
sub("[ ]+meLine.init\\((.+)\\)" , "\\1",
grep("meLine.init", readLines("file://test.html"), value=TRUE),
perl=TRUE)
will return the line you're looking for and then you can work your magic with jsonlite. The idea is to read the page line by line. grep the (hopefully) single line that contains the string meLine.init and then extract the JSON string from that. Replace file://test.html with the URL you want to use

How to detect HTML in clipboard data using Qt

I have a rich text editor I'm working on where I need to parse and clean data from the clipboard when appropriate. Whenever the text being pasted contains HTML, I will clean it up and update the text field with the correct html.
However, when there is no html in the clipboard, there is no need for me to run the html cleaning tool.
My first thought was to use Regex and check for any html tag in there, but I'm not sure this is the best solution for this problem as it can cause more headaches in the long run with false positives, etc.
My question is, how can I detect some HTML in the clipboard?
Is there a an elegant way to solve this problem without having to resort to Regex?
may be one of these functions:
bool QDomDocument::setContent(...)
This function reads the XML document from the string text, returning true if the content was successfully parsed; otherwise returns false. Since text is already a Unicode string, no encoding detection is done
Addition for a clipboard's mixed data:
// get a html data from a junk
QString htmlText = cliboardString.section("</html>", -2, 0,QString::SectionIncludeTrailingSep)
.section("<html", 1,-1,String::SectionIncludeLeadingSep);
// check for a validness, correctness etc.
if( !htmlText.isEmpty() ) {
QDomDocument::setContent(htmlText,...
}

Print xml source in html page

So I have a servlet which prints content of various files. But when I want to print .xml file my servlet page doesn't print anything, because page uses this xml tags as html and is parsing them istead of printing. And I want to print this tags. I am reading file line by line and lines are stored in variable line.
If you want to print xml content in your HTMl page, you can use StringEscapeUtils.escapeHtml() function from Apache commons lang library to write xml file contents to your HTML page
PrintWriter writer = response.getWriter();
writer.write("<html><head></head><body>");
writer.write(StringEscapeUtils.escapeHtml(xmlContent);
writer.write("</body></html>");
If you are attempting to Display XML as content in an HTML document:
Browsers can't tell the difference better a < that the author intends to mean "Start of tag" and one that the author intends to mean "Render this".
You need to represent it as < if you want it to appear as data.
The answer to htmlentities equivalent in JSP? explains how to convert a string of text into a string of HTML.
If you are attempting to Output an XML document instead of an HTML document:
You need to specify an XML content type (such as application/xml) instead of an HTML content-type.
See How to set the content type on the servlet for an explanation.