I have an FAQ in HTML (example) in which the questions refer to each other a lot. That means whenever we insert/delete/rearrange the questions, the numbering changes. LaTeX solves this very elegantly with \label and \ref -- you give items simple tags and LaTeX worries about converting to numbers in the final document.
How do people deal with that in HTML?
ADDED: Note that this is no problem if you don't have to actually refer to items by number, in which case you can set a tag with
<a name="foo">
and then link to it with
some non-numerical way to refer to foo.
But I'm assuming "foo" has some auto-generated number, say from an <ol> list, and I want to use that number to refer to and link to it.
There is nothing like this in HTML.
The way you would normally solve this, is by having the HTML for the links generated, by either parsing the HTML itself and inserting the TOC (you can do that on the server, before you send the HTML out to the browser, or on the client, by traversing the DOM with a little piece of ECMAScript and simply collecting and inspecting all <a> elements) or generating the entire HTML document from a higher level source like a database, an XML document, markdown or – why not? – even LaΤΕΧ.
I know it's not widely supported by browsers, but you can do this using CSS counter.
Also, consider using ids instead of names for your anchors.
Instead of \label{key} use <a name="key" />. Then link using Link.
PrinceXML can do that, but that's about it. I suppose it'd be best to use server-side scripting.
Here's how I ended up solving this with a php script:
http://yootles.com/genfaq
It's roughly as convenient as \label and \ref in LaTeX and even auto-generates the index of questions.
And I put it on an etherpad instance which is handy when multiple people are contributing questions to the FAQ.
Related
I really dislike the non-semantic usage of <big> on our wiki, and would like to prevent it. Flat-out commands didn't work so far, so I'm switching to doing it by code...
AFAIK, there's no configuration switch to control the blacklist/whitelist of HTML tags. Looking at the source code, it seems like the data is coming from Sanitizer::getRecognizedTagData(), while the work itself is done in Sanitizer::removeHTMLtags(). However, I do not see a way to add to the list myself, except using one of the hooks before or after (InternalParseBeforeSanitize, InternalParseBeforeLinks) and either:
Call Sanitizer::removeHTMLtags() again myself, with the additional tag to blacklist as a parameter
Do a search myself on the text to remove all the <big> tags.
The first one is a duplication of work, the second one is a duplication of code. Is there a better way? What would you recommend?
No coding is needed: just install AbuseFilter and create a rule that warns or disallows on save of pages containing these tags.
I have a project where my input files used to be XML. I'm now being asked to start processing HTML with embedded CSS instead, and I'd like to accomplish this as cleanly and with as few code changes as possible. I was using XML::LibXML to parse the XML files, but now that we're moving to HTML with CSS, I'm thinking I'll need to move to something else. That said, before I dig myself knee deep into silly decisions I'll likely regret, I wanted to ask here: what do you guys use for this kind of task?
The structures of the old XML and the new HTML input files are pretty similar, with both holding the same information. The HTML uses divs in place of the XML's text nodes, and holds its style information in style tags and attributes instead of separated xml attributes.
An example of the old XML is:
<text font="TimesNewRoman,BoldItalic" size="11.04" x="59" y="405" w="52"
h="12" bold="yes" italic="yes" cs="4.6" o_bbox="59,405;52,12"
o_size="11.04" o_cs="4.6">
Some text
</text>
An example of the new HTML is:
<div o="9ka" style="position:absolute;top:145;left:89;x-pdf-top:744;x-pdf-left:60;x-pdf-bottom:732;x-pdf-right:536;">
<span class="ft19" >
Some text
</span></nobr>
</div>
where "ft19" refers to a css style element from the top of the page of the format:
.ft19{ vertical-align:top;font-size:14px;x-pdf-font-size:14px;
font-family:Times;color:#000000;x-pdf-color:#000000;font-style:italic;
x-pdf-letter-spacing:0.83px;}
Basically, all I want is a parser that can read the stylistic elements of each node as attributes, so I could do something like:
my #texts_arr = $page_node->findnodes('text');
my $test_node = $texts_arr[1];
print "node\'s bold value is: " . $text_node->getAttribute('bold');
as I'm able to do with the XML. Does anything like that exist for parsing HTML? I'd really like to make sure I start this the right way instead of finding something that sort of does what I want on CPAN and realizing two months later that there was another module that was way better for what I'm trying to do.
Ideas?
The basic one I am aware of is HTML::Parser.
There is also a project that works with it, Marpa::HTML which is the work of the larger parser project Marpa, which parses any language that can be described in BNF, documented on the author's blog which is very interesting but much newer and experimental.
I also see that wildly successful WWW::Mechanize uses HTML::TokeParser, and it uses HTML::PullParser, so there's that too.
If you need something even more generic (and evil) you can look into "writing" your own using something like Text::Balanced (which has some nice methods for tags, not sure about tag properties though) or even Regexp::Grammars, but again this means reinventing the wheel somewhat, I would only choose these routes if the above don't do what you need.
Perhaps I haven't helped. Perhaps I have just done a literature search for you, but maybe one of these will work better for you than others.
Edit: one more parser for you, seems like it might do what you need HTML::Tree. Then look at methods like look_down from HTML::Element to act on the tree. I saw an example here.
It's not clear - is the Perl parsing for the purposes of doing the conversion to HTML (with embedded CSS)? If so, why not forget Perl and use XSLT which is designed to transform XML documents?
Having the HTML of a webpage, what would be the easiest strategy to get the text that's visible on the correspondent page? I have thought of getting everything that's between the <a>..</a> and <p>...</p> but that is not working that well.
Keep in mind as that this is for a school project, I am not allowed to use any kind of external library (the idea is to have to do the parsing myself). Also, this will be implemented as the HTML of the page is downloaded, that is, I can't assume I already have the whole HTML page downloaded. It has to be showing up the extracted visible words as the HTML is being downloaded.
Also, it doesn't have to work for ALL the cases, just to be satisfatory most of the times.
I am not allowed to use any kind of external library
This is a poor requirement for a ‘software architecture’ course. Parsing HTML is extremely difficult to do correctly—certainly way outside the bounds of a course exercise. Any naïve approach you come up involving regex hacks is going to fall over badly on common web pages.
The software-architecturally correct thing to do here is use an external library that has already solved the problem of parsing HTML (such as, for .NET, the HTML Agility Pack), and then iterate over the document objects it generates looking for text nodes that aren't in ‘invisible’ elements like <script>.
If the task of grabbing data from web pages is of your own choosing, to demonstrate some other principle, then I would advise picking a different challenge, one you can usefully solve. For example, just changing the input from HTML to XML might allow you to use the built-in XML parser.
Literally all the text that is visible sounds like a big ask for a school project, as it would depend not only on the HTML itself, but also any in-page or external styling. One solution would be to simply strip the HTML tags from the input, though that wouldn't strictly meet your requirements as you have stated them.
Assuming that near enough is good enough, you could make a first pass to strip out the content of entire elements which you know won't be visible (such as script, style), and a second pass to remove the remaining tags themselves.
i'd consider writing regex to remove all html tags and you should be left with your desired text. This can be done in Javascript and doesn't require anything special.
I know this is not exactly what you asked for, but it can be done using Regular Expressions:
//javascript code
//should (could) work in C# (needs escaping for quotes) :
h = h.replace(/<(?:"[^"]*"|'[^']*'|[^'">])*>/g,'');
This RegExp will remove HTML tags, notice however that you first need to remove script,link,style,... tags.
If you decide to go this way, I can help you with the regular expressions needed.
HTML 5 includes a detailed description of how to build a parser. It is probably more complicated then you are looking for, but it is the recommended way.
You'll need to parse every DOM element for text, and then detect whether that DOM element is visible (el.style.display == 'block' or 'inline'), and then you'll need to detect whether that element is positioned in such a manner that it isn't outside of the viewable area of the page. Then you'll need to detect the z-index of each element and the background of each element in order to detect if any overlapping is hiding some text.
Basically, this is impossible to do within a month's time.
I would like to convert HTML to plain text but retain the minimum structure.
All sections which contain stuff only the browser needs to see such as <script> and <style> to be stripped completely.
Convert all block tags to <div> and all inline ones to <span> or remove inlines completely without leaving whitespace and turning anything delineatd by block levels into paragraphs with two linebreaks.
The idea is to turn random web pages into something suitable for natural language text processing without artefacts left from naively removing markup artifically break words up or making unrelated blocks look like sentences.
Any binary, library, or source in any programming language is OK.
Is there a standard source preferably machine-readable with a full list of elements defining which are block, which inline, and which are like <script> and <style> above?
The list of HTML 4 Block-level elements is here: http://htmlhelp.com/reference/html40/block.html
The most popular HTML parsing libraries for Perl are HTML::Parser which is a SAX-style parser and HTML::TreeBuilder which is more DOM-like.
Beyond that, you'll have to decide which elements are important and which are not based on what you're trying do to.
You may want to do some research yourself. Then, when you run into a problem, ask a question related to the problem. This sounds more like specification for a project that you want someone to do for you.
For starters, websites use tags for all sorts of things, and the problem is very complex. You would probably want to save information in h# and p tags, but you also may want to save div tag information if they use the id tag. In short, you'd have to write rules for each website you encounter, or employ some sort of fuzzy logic.
Instead of doing it on a tag by tag basis, why not try detecting sentences and grammar, or things likely to be in headings, and choose tags that include those things while stripping out the rest?
Here's my own tool to solve this problem in Perl using HTML::Parser as a github gist: html2txt.pl
It's unfinished and perhaps slightly Windows-centric but I thought I'd share it since a few people have viewed my question here. Feel free to play with it.
I have some html (in this case created via TinyMCE) that I would like to add to a page. However, for security reason, I don't want to just print everything the user has entered.
Does anyone know of a templatetag (a filter, preferably) that will allow only a safe subset of html to be rendered?
I realize that markdown and others do this. However, they also add additional markup syntax which could be confusing for my users, since they are using a rich text editor that doesn't know about markdown.
There's removetags, but it's a blacklisting approach which fails to remove tags when they don't look exactly like the well-formed tags Django expects, and of course since it doesn't attempt to remove attributes it is totally vulnerable to the 1,000 other ways of script-injection that don't involve the <script> tag. It's a trap, offering the illusion of safety whilst actually providing no real security at all.
HTML-sanitisation approaches based on regex hacking are almost inevitably a total fail. Using a real HTML parser to get an object model for the submitted content, then filtering and re-serialising in a known-good format, is generally the most reliable approach.
If your rich text editor outputs XHTML it's easy, just use minidom or etree to parse the document then walk over it removing all but known-good elements and attributes and finally convert back to safe XML. If, on the other hand, it spits out HTML, or allows the user to input raw HTML, you may need to use something like BeautifulSoup on it. See this question for some discussion.
Filtering HTML is a large and complicated topic, which is why many people prefer the text-with-restrictive-markup languages.
Use HTML Purifier, html5lib, or another library that is built to do HTML sanitization.
You can use removetags to specify list of tags to be remove:
{{ data|removetags:"script" }}