How do I parse an html/txt file using only strtok and/or strsep?
I'm trying that saves the text parts of a wikipedia article to a .txt file. The first part of my code allows me to download the article in html form. The thing I should do next is to parse that html file and save it as a txt file.
Related
Like this question, extract text from xml tags in an XML file using apach tika parser
I want to extract all text from text based files, including tagged content, the tags themselves, and other text in XML/HTML elements.
I've tried XML (application/xml), and HTML (text/html) and seen that AutoDetectParser returns less than the full text content.
I've also tried YAML (text/plain), and JSON (text/plain) which do return the full text content.
I understand that I can't do XML or HTML using the AutoDetectParser. What I can't find documented is a list of what types of files would need special handling.
To get full text content (even if that means a complete 'raw' copy of the file):
1. What Mimetypes should be parsed using a TXTParser?
2. What Mimetypes should be parsed using other parsers?
Basically, I'm asking what Mimetypes does the AutoDetectParser return less than the full text content?
Thanks
EDIT
My use case is to be able to extract text and metadata from a wide variety of input file formats including txt, xml, html, doc(x), ppt(x), pdf, ...
Essentially, I want to be able to handle any file type Tika can handle.
I am using code like this
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
try (InputStream stream = new FileInputStream(fileToExtract)){
parser.parse(stream, handler, metadata, context);
} catch ... {
}
I see the same results for XML files as the question referenced above.
What I am trying to find out is: where is it documented when the combination of AutoDetectParser and BodyContentHandler will return less than the full text of the input file.
When, or for what Mimetypes, do I need to switch the Parser and/or ContentHandler?
I don't see this information clearly documented, and I am hoping to avoid a trail and error approach.
I've tried it by using mammoth:
import mammoth
result = mammoth.convert_to_html("MyDocument.docx")
print (result.value)
I don't get an HTML, but this strange code:
kbW7yqZoo4h9pYM6yBxX1QFx2pCoPYflXfieIPbtqpT913Vk7OzcZdEk3eO7TbWjvZNTGilsfmRrPwDvB[...]
I've also tried to use docx2html, but I can't install it. When I run pip install docx2html I get this error:
SyntaxError: Missing parentheses in call to 'print'
Mammoth .docx to HTML converter
Mammoth is designed to convert .docx documents, such as those created by Microsoft Word, and convert them to HTML. Mammoth aims to produce simple and clean HTML by using semantic information in the document, and ignoring other details. For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.
There's a large mismatch between the structure used by .docx and the structure of HTML, meaning that the conversion is unlikely to be perfect for more complicated documents. Mammoth works best if you only use styles to semantically mark up your document.
The following features are currently supported:
Headings.
Lists.
Customisable mapping from your own docx styles to HTML. For instance, you could convert WarningHeading to h1.warning by providing an appropriate style mapping.
Tables. The formatting of the table itself, such as borders, is currently ignored, but the formatting of the text is treated the same as in the rest of the document.
Footnotes and endnotes.
Images.
Bold, italics, underlines, strikethrough, superscript and subscript.
Links.
Line breaks.
Text boxes. The contents of the text box are treated as a separate paragraph that appears after the paragraph containing the text box.
Comments.
Installation
pip install mammoth
Basic conversion
To convert an existing .docx file to HTML, pass a file-like object to mammoth.convert_to_html. The file should be opened in binary mode. For instance:
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages, such as warnings during conversion
You can also extract the raw text of the document by using mammoth.extract_raw_text. This will ignore all formatting in the document. Each paragraph is followed by two newlines.
with open("document.docx", "rb") as docx_file:
result = mammoth.extract_raw_text(docx_file)
text = result.value # The raw text
messages = result.messages # Any messages
You can use pypandoc module for that purpose. See below code
import pypandoc
output = pypandoc.convert_file('file.docx', 'docx', outputfile="file_converted.html")
The issue you're having is probably that mammoth doesn't create legit HTML files, just HTML snippets. Meaning it's missing the and tags.
Some browsers can still render the content from the file since they're advanced enough to do so, but I ran into a similar problem when trying to use the raw output.
A nifty workaround for this is to add this to your code to convert it to proper HTML files:
import mammoth
with open("test.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages,
full_html = (
'<!DOCTYPE html><html><head><meta charset="utf-8"/></head><body>'
+ html
+ "</body></html>"
)
with open("test.html", "w", encoding="utf-8") as f:
f.write(full_html)
Where test.html is whatever the title you gave to your document.
I'm not taking credit for this, I found it here as well, but can't find the source post.
As stated in the documentation:
To convert an existing .docx file to HTML, pass a file-like object to
mammoth.convert_to_html. The file should be opened in binary mode. For
instance:
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value # The generated HTML
messages = result.messages # Any messages, such as warnings during conversion
I have a process that converts "text" content to PDF. All "text" is stored in a database.
We have recently added HTML content to the UI and database. So, I need to modify the pdf generation process to, "on the fly," convert HTML to PDF. I would like to use the XMLParser, but all of the examples show opening a new document - the process I need is to convert HTML to a PDF paragraph to push into an open document. Any ideas would be appreciated.
I have to call a csv file as resource file and the output needs to be displayed.Now the issue is how should I display a message like "Thanks for answering" in red colour through csv file.I need to use html tags in csv file like
<font color='red" ></font>
but the file that is calling this csv file is displaying the content along with the html tags.
CSV stands for "Comma/Character Separated Values" – plain text, no formatting, except for the optional header line and the character that separates the values.
If you need to define formatting like font colors in the source file, you would have to use another file format like (X)HTML, XLS or RTF.
I have an OpenOffice Writer document (.odt) with a table of contents, sections, subsections, etc.
Is there a quick way to convert (export) this into multiple HTML files with a navigation sidebar, converting the sections into links?
You can:
Unzip the odt, parse the XML and make the HTML file yourself.
Use OpenOffice to export the document to HTML.
There are several ways to export HTML from OpenOffice or LibreOffice:
Use File > Export, then select file type XHMTL. However, this creates one big HTML file, not multiple files.
Use File > Save as, then select file type HTML document. This creates one big HTML file which is similar but not fully equal to the one above.
Use File > Send > Create HTML document. In the following dialog, you can select a style used in the document based on which the document is split into multiple HTML files. However, I did not get this to work properly. My document is always split on level 1, no matter what I selected here.
Use File > Wizards > Web page. You will get multiple settings to chose from. However, this does not work at all for me. It either fails completely or it does not produce the expected output.
The last two solutions were found on the OpenOffice Wiki at https://wiki.openoffice.org/wiki/Documentation/OOo3_User_Guides/Getting_Started/Saving_Writer_documents_as_web_pages
As a conclusion, I cannot provide a complete solution. I am still looking for a good way to solve this problem.