Assuming I am in control of the parsing environment and I'm certain it is only to be converted to HTML (and not any of the many other formats possible); is it ok to embed some HTML within one's Markdown, in order to side-step around a bug?
Could there be any basic sideffects I (as a newbie) couldn't predict but should be aware of?
Non-conventional Markdown example:
_"<strong>This</strong> is an example sentence."_ -**OP**
Which outputs valid HTML:
<em>"<strong>This</strong> is an example sentence."</em> -<strong>OP</strong>
Resulting in successful content:
"This is an example sentence." -OP
Background (don't have to read):
I noticed that if I include HTML in my Markdown, it appears to get skipped during the conversion, resulting in it being seamlessly incorporated in the output HTML.
This appears to be a good thing, at least in my case (Using Hugo to build a website with a template theme) where the Markdown wasn't producing the correct result (leaving a pair of unwanted *s in the HTML: should have been *italic* but asterisks showing).
For those wondering - yes, I confirmed my Markdown was correct using other parsers that handled it fine.
Note: the examples here are simplifications of my specific case
Not only is it okay to do, but it is encouraged. As the rules state:
For any markup that is not covered by Markdown’s syntax, you simply use HTML itself. There’s no need to preface it or delimit it to indicate that you’re switching from Markdown to HTML; you just use the tags.
And later:
If you want, you can even use HTML tags instead of Markdown formatting; e.g. if you’d prefer to use HTML <a> or <img> tags instead of Markdown’s link or image syntax, go right ahead.
Of course, there are a few things to take into consideration. For example block level tags must be at the document root level (cannot be nested inside blockquotes, lists, etc) and content inside them does not get parsed as Markdown. However, inline tags can be placed anywhere and do not restrict Markdown parsing.
For people using Markdown in highly modular or user-flexible environments (probably slightly more advanced readers):
One should note that although Markdown is most commonly converted to HTML, it can also be used with other formats[1].
For this reason I think it's important to confirm that if you (as a publisher of content) are not the one who determines what the Markdown will be parsed with, or how it is converted it may be 'safer' to not embed HTML in it.
[1] as stated in the Markdown Wikipedia page.
Related
I'm trying to understand Markdown's relationship to HTML. If I understand correctly both are markup languages (an umbrella term describing languages that add formatting elements to plain-text documents). Markdown converts plain text to HTML.
My understanding is that Markdown is a superset of HTML:
Markdown is a popular markup language that is a superset of HTML.
I'm assuming that it's a strict or proper superset. Drawing a parallel from What does it mean when one language is a parallel superset of another?, I interpret that to mean that every valid HTML program is also a valid Markdown program (e.g. HTML is understood in a Jupyter Notebook Markdown cell), but that the converse is not true.
What seems conflicting to me is that if Markdown is a superset of HTML, then why is it that Markdown can't do everything HTML can (I would think the opposite to be true since a superset extends the language without removing or changing any of the existing features. Also, I would expect HTML to be a superset of Markdown since HTML is more expressive and more difficult to read by most humans.
Below is a diagram trying to mimic that in What does “Objective-C is a superset of C more strictly than C++” mean exactly?
That documentation is misleading. Markdown itself is not a superset of HTML. The documentation for the original Markdown project is pretty clear:
Markdown is not a replacement for HTML, or even close to it. Its syntax is very small, corresponding only to a very small subset of HTML tags. The idea is not to create a syntax that makes it easier to insert HTML tags. In my opinion, HTML tags are already easy to insert. The idea for Markdown is to make it easy to read, write, and edit prose. HTML is a publishing format; Markdown is a writing format. Thus, Markdown’s formatting syntax only addresses issues that can be conveyed in plain text.
Today there are several flavours of Markdown, many of which add features that were not present in the original version like tables and syntax-highlighted code blocks. This doesn't change the fundamental fact that Markdown covers a subset of HTML.
(Technically speaking, Markdown isn't a subset of HTML either. *, for example, has no special meaning in HTML. Unconverted Markdown documents might be well-formed HTML but the semantics are very different. But Markdown syntax maps to a subset of HTML tags.)
However, the very next paragraph in the original documentation says:
For any markup that is not covered by Markdown’s syntax, you simply use HTML itself. There’s no need to preface it or delimit it to indicate that you’re switching from Markdown to HTML; you just use the tags.
Since you can directly use HTML in Markdown it could be considered a superset of HTML. For example, this is valid Markdown:
# My awesome title
I <em>really</em> like coffee
If you pass an HTML document through a conforming Markdown processor it should come out the other side untouched. Being able to directly use HTML in Markdown is very similar to how one can directly use C in C++. This may be what the Jupyter documentation means.
Having the HTML of a webpage, what would be the easiest strategy to get the text that's visible on the correspondent page? I have thought of getting everything that's between the <a>..</a> and <p>...</p> but that is not working that well.
Keep in mind as that this is for a school project, I am not allowed to use any kind of external library (the idea is to have to do the parsing myself). Also, this will be implemented as the HTML of the page is downloaded, that is, I can't assume I already have the whole HTML page downloaded. It has to be showing up the extracted visible words as the HTML is being downloaded.
Also, it doesn't have to work for ALL the cases, just to be satisfatory most of the times.
I am not allowed to use any kind of external library
This is a poor requirement for a ‘software architecture’ course. Parsing HTML is extremely difficult to do correctly—certainly way outside the bounds of a course exercise. Any naïve approach you come up involving regex hacks is going to fall over badly on common web pages.
The software-architecturally correct thing to do here is use an external library that has already solved the problem of parsing HTML (such as, for .NET, the HTML Agility Pack), and then iterate over the document objects it generates looking for text nodes that aren't in ‘invisible’ elements like <script>.
If the task of grabbing data from web pages is of your own choosing, to demonstrate some other principle, then I would advise picking a different challenge, one you can usefully solve. For example, just changing the input from HTML to XML might allow you to use the built-in XML parser.
Literally all the text that is visible sounds like a big ask for a school project, as it would depend not only on the HTML itself, but also any in-page or external styling. One solution would be to simply strip the HTML tags from the input, though that wouldn't strictly meet your requirements as you have stated them.
Assuming that near enough is good enough, you could make a first pass to strip out the content of entire elements which you know won't be visible (such as script, style), and a second pass to remove the remaining tags themselves.
i'd consider writing regex to remove all html tags and you should be left with your desired text. This can be done in Javascript and doesn't require anything special.
I know this is not exactly what you asked for, but it can be done using Regular Expressions:
//javascript code
//should (could) work in C# (needs escaping for quotes) :
h = h.replace(/<(?:"[^"]*"|'[^']*'|[^'">])*>/g,'');
This RegExp will remove HTML tags, notice however that you first need to remove script,link,style,... tags.
If you decide to go this way, I can help you with the regular expressions needed.
HTML 5 includes a detailed description of how to build a parser. It is probably more complicated then you are looking for, but it is the recommended way.
You'll need to parse every DOM element for text, and then detect whether that DOM element is visible (el.style.display == 'block' or 'inline'), and then you'll need to detect whether that element is positioned in such a manner that it isn't outside of the viewable area of the page. Then you'll need to detect the z-index of each element and the background of each element in order to detect if any overlapping is hiding some text.
Basically, this is impossible to do within a month's time.
I would like to convert HTML to plain text but retain the minimum structure.
All sections which contain stuff only the browser needs to see such as <script> and <style> to be stripped completely.
Convert all block tags to <div> and all inline ones to <span> or remove inlines completely without leaving whitespace and turning anything delineatd by block levels into paragraphs with two linebreaks.
The idea is to turn random web pages into something suitable for natural language text processing without artefacts left from naively removing markup artifically break words up or making unrelated blocks look like sentences.
Any binary, library, or source in any programming language is OK.
Is there a standard source preferably machine-readable with a full list of elements defining which are block, which inline, and which are like <script> and <style> above?
The list of HTML 4 Block-level elements is here: http://htmlhelp.com/reference/html40/block.html
The most popular HTML parsing libraries for Perl are HTML::Parser which is a SAX-style parser and HTML::TreeBuilder which is more DOM-like.
Beyond that, you'll have to decide which elements are important and which are not based on what you're trying do to.
You may want to do some research yourself. Then, when you run into a problem, ask a question related to the problem. This sounds more like specification for a project that you want someone to do for you.
For starters, websites use tags for all sorts of things, and the problem is very complex. You would probably want to save information in h# and p tags, but you also may want to save div tag information if they use the id tag. In short, you'd have to write rules for each website you encounter, or employ some sort of fuzzy logic.
Instead of doing it on a tag by tag basis, why not try detecting sentences and grammar, or things likely to be in headings, and choose tags that include those things while stripping out the rest?
Here's my own tool to solve this problem in Perl using HTML::Parser as a github gist: html2txt.pl
It's unfinished and perhaps slightly Windows-centric but I thought I'd share it since a few people have viewed my question here. Feel free to play with it.
I have some html (in this case created via TinyMCE) that I would like to add to a page. However, for security reason, I don't want to just print everything the user has entered.
Does anyone know of a templatetag (a filter, preferably) that will allow only a safe subset of html to be rendered?
I realize that markdown and others do this. However, they also add additional markup syntax which could be confusing for my users, since they are using a rich text editor that doesn't know about markdown.
There's removetags, but it's a blacklisting approach which fails to remove tags when they don't look exactly like the well-formed tags Django expects, and of course since it doesn't attempt to remove attributes it is totally vulnerable to the 1,000 other ways of script-injection that don't involve the <script> tag. It's a trap, offering the illusion of safety whilst actually providing no real security at all.
HTML-sanitisation approaches based on regex hacking are almost inevitably a total fail. Using a real HTML parser to get an object model for the submitted content, then filtering and re-serialising in a known-good format, is generally the most reliable approach.
If your rich text editor outputs XHTML it's easy, just use minidom or etree to parse the document then walk over it removing all but known-good elements and attributes and finally convert back to safe XML. If, on the other hand, it spits out HTML, or allows the user to input raw HTML, you may need to use something like BeautifulSoup on it. See this question for some discussion.
Filtering HTML is a large and complicated topic, which is why many people prefer the text-with-restrictive-markup languages.
Use HTML Purifier, html5lib, or another library that is built to do HTML sanitization.
You can use removetags to specify list of tags to be remove:
{{ data|removetags:"script" }}
I have an FAQ in HTML (example) in which the questions refer to each other a lot. That means whenever we insert/delete/rearrange the questions, the numbering changes. LaTeX solves this very elegantly with \label and \ref -- you give items simple tags and LaTeX worries about converting to numbers in the final document.
How do people deal with that in HTML?
ADDED: Note that this is no problem if you don't have to actually refer to items by number, in which case you can set a tag with
<a name="foo">
and then link to it with
some non-numerical way to refer to foo.
But I'm assuming "foo" has some auto-generated number, say from an <ol> list, and I want to use that number to refer to and link to it.
There is nothing like this in HTML.
The way you would normally solve this, is by having the HTML for the links generated, by either parsing the HTML itself and inserting the TOC (you can do that on the server, before you send the HTML out to the browser, or on the client, by traversing the DOM with a little piece of ECMAScript and simply collecting and inspecting all <a> elements) or generating the entire HTML document from a higher level source like a database, an XML document, markdown or – why not? – even LaΤΕΧ.
I know it's not widely supported by browsers, but you can do this using CSS counter.
Also, consider using ids instead of names for your anchors.
Instead of \label{key} use <a name="key" />. Then link using Link.
PrinceXML can do that, but that's about it. I suppose it'd be best to use server-side scripting.
Here's how I ended up solving this with a php script:
http://yootles.com/genfaq
It's roughly as convenient as \label and \ref in LaTeX and even auto-generates the index of questions.
And I put it on an etherpad instance which is handy when multiple people are contributing questions to the FAQ.