I have legacy xml files with html entities such as — etc. How can I convert this entities to hex equivalent such as —. Is there any easy way to do this using a batch command or something else? I am not high level programmer so any detail help will be appreciated.
Just a simple find and replace may be all you need. Most, if not all text/code editors have a find/replace function.
Chances are that there are only a few characters strings that make up the majority of what you need to replace and fortunately, they're all pretty unique so it's unlikely that you'll have any accidental replacements.
Related
In the article Better web typography in a few simple steps, it says
Talking about apostrophes, the correct sign for them is the right single quotation mark. A dead give-away for amateur typography is the presence of straight quotation marks, also called 'dumb quotes' by type-savvy designers.
I've been using these "dumb quotes" all along!
Now, when one is writing regular HTML (and not Markdown, which automatically produces apostrophes), how is one supposed to sanely write correct apostrophes? Am I just supposed to inject ’ wherever a ' would go before? Is there a program that automatically does this?
How do professional web designers take care of this problem?
You have couple of options here:
As was pointed out before, either use numerical or named HTML entities.
Write your HTML with single apostrophes and then do a search and replace before publishing. This is workable, but could lead to unexpected replacements if you aren’t careful.
Insert the actual single quote using the appropriate keyboard sequence for your operating system: option-shift-] on a Mac or alt-0146 on a PC and make sure to save and serve your HTML as UTF-8 encoded. That way you don't have to screw around with entity names, but asumes a UTF-8 clean workflow.
I would like to program a parser for a Markup language similar to BBCode, Markdown, Wikisyntax etc. using a high-level language like Python or Perl. It should feature sectioning, code highlighting, automatic link creation, embedding images but allowing HTML for more complex formatting.
Has anyone done similar things or has worked closely with those systems and could describe generally how this could be done efficiently?
Although efficiency is not really of concern for such a small system, it is generally favourable.
In particular I would like to learn if there is a more efficient way than using regular expressions for such a program.
For your general discussion…
You should start with the following blueprint:
you need to iterate charwise over entire data
you need to identify every char by its context, for it may be a tag-opening ('<', '[' etc) or just the char. This may be done by having an escapement flag, triggered by an escape-char (like backslashes in some languages do). if you use that approach, you also need to check for an escaped escapement.
you may also need some flag telling you to be inside a comment or special data section, that may have different escapement rules.
you need to build a tree-like structure or at least some stack for nested tags. This is why regexes are a bad idea: they not only take much to much overhead, they're also of no use if you want to get the correct closing tag for the second x (x=any tag) in the following snipped: <x><x><x></x><x><x></x></x><x></x><!-- </x> -->this one →</x><x></x></x>
This question already has answers here:
What to do Regular expression pattern doesn't match anywhere in string?
(8 answers)
Closed 8 years ago.
Reading this amusing rant ( RegEx match open tags except XHTML self-contained tags ) I wondered ... how could regexes be changed to successfully parse HTML?
I'm looking here for suggestions that :
make the minimal addition to regexes as we know and love them (ie. not "make them look like XSLT!" type answers)
are robust enough to work properly.
suggest syntax (not just list the general requirements)
Has anyone actually made something like this?
Add a new escape sequence:
\H -- match HTML document
DOM/XML parsers internally use regex to parse html. The difference between them and using ONLY regex is to make up for the shortcomings of regex. One of the major shortcomings of regex is handling nested tags and malformed code (like missing tags). So around the basic regex, all sorts of algorithms and conditions are written to try and handle those things. And then there is of course the parts that actually create an object out of it.
So you asked what it would take to make regex do what a DOM/XML parser does? You would have to somehow cram all those algorithms and conditions into the regex engine, internally and within pattern syntax.
I personally do not wish for this to happen. IMO regex should be straight pattern matching. IMO it already has some stuff in it that IMO is questionable (some regex flavors do in fact have a way to use conditions, for instance). Taking the regex engine and then building a larger tool around it (like a DOM/XML parser) IMO is the best way to go.
It's interesting that real world tools can be and often are modified to perform tasks they might not otherwise be suited for. For example, if someone were to attempt to eat broth with a fork, they would be largely unsuccessful. Enter the spork.
I don't think programmers necessarily work that way all the time. It's not uncommon for tools to expand their scope, but it's also been a long tradition that programmers try to use specific tools for specific purposes.
Now, it just so happens that in order for regex to be able to parse HTML, it would have to be a pattern matcher/recognizer that also remembered state. This is, to a T, exactly what a parser does. It uses pattern matching (indeed, it often uses regex!) in order to match tokens. It then remembers combinations of tokens.
So in fact regex is used very frequently to parse HTML, along with other functions that remember larger patterns that cannot be described or processed using regex alone.
Hope that answers the question.
Perl 6 has a regex extension that is designed to do that: http://en.wikipedia.org/wiki/Perl_6_rules.
Depends what you mean by "parse". Typically this involves transforming a character stream into an object tree. To do this with regular expressions you would need to completely change capturing groups to be runtime-variable multi-node tree, rather than the compile-time-fixed array that they currently are. Once you've done that you've just re-written lex/yacc.
Has anyone implemented a good system for ensuring that output is properly HTML-encoded where it makes sense? Maybe even something that recognizes when output should be URL-encoded or JSON-encoded instead?
The lazy approach — just encoding all inputs — causes problems when you want to send those inputs to a database, or to a block of JavaScript code. So something a little smarter is needed.
The tedious approach — putting the proper encoding function around each piece of data on the template — works, but it's easy for developers to forget to do it.
Is there a good approach that makes it easy for developers, and ensures that the right encoding is done? I was listening to one of the SO podcasts, and Joel tossed out an idea about using typed data to enforce a difference between HTML-encoded strings and non-encoded strings. Maybe that could be a starting point.
I'm looking more for a strategy than for an implementation in a particular language (although I'd be happy to hear about implementations that already exist and work).
EDIT: Here are some links I've found so far:
A type-based solution to the "strings problem"
String::Smart
Reducing XSS by way of Automatic Context-Aware Escaping in Template Systems
Secure String Interpolation in JS
Data that goes into your database probably should not have any escaping for HTML, JavaScript, or what have you. If you do include markup, you'll just have to strip it out if you decide to inject this data into a CSV file or PDF, etc...
Instead, whenever you query 'raw' data like this out of the database, escape the data at that time as appropriate to wherever you're injecting it; HTML, a JavaScript string, server-side scripting, etc.
I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted).
This is rather confusing for me, I always thought that in general, the best way to parse any complicated string is to use a regular expression. So how does a HTML parser work? Doesn't it use regular expressions to parse.
One particular argument for using a regular expression is that there's not always a parsing alternative (such as JavaScript, where DOMDocument isn't a universally available option). jQuery, for instance, seems to manage just fine using a regex to convert a HTML string to DOM nodes.
Not sure whether or not to CW this, it's a genuine question that I want to be answered and not really intended to be a discussion thread.
So how does a HTML parser work? Doesn't it use regular expressions to parse?
Well, no.
If you reach back in your brain to a theory of computation course, if you took one, or a compilers course, or something similar, you may recall that there are different kinds of languages and computational models. I'm not qualified to go into all the details, but I can review a few of the major points with you.
The simplest type of language & computation (for these purposes) is a regular language. These can be generated with regular expressions, and recognized with finite automata. Basically, that means that "parsing" strings in these languages use state, but not auxiliary memory. HTML is certainly not a regular language. If you think about it, the list of tags can be nested arbitrarily deeply. For example, tables can contain tables, and each table can contain lots of nested tags. With regular expressions, you may be able to pick out a pair of tags, but certainly not anything arbitrarily nested.
A classic simple language that is not regular is correctly matched parentheses. Try as you might, you will never be able to build a regular expression (or finite automaton) that will always work. You need memory to keep track of the nesting depth.
A state machine with a stack for memory is the next strength of computational model. This is called a push-down automaton, and it recognizes languages generated by context-free grammars. Here, we can recognize correctly matched parentheses--indeed, a stack is the perfect memory model for it.
Well, is this good enough for HTML? Sadly, no. Maybe for super-duper carefully validated XML, actually, in which all the tags always line up perfectly. In real-world HTML, you can easily find snippets like <b><i>wow!</b></i>. This obviously doesn't nest, so in order to parse it correctly, a stack is just not powerful enough.
The next level of computation is languages generated by general grammars, and recognized by Turing machines. This is generally accepted to be effectively the strongest computational model there is--a state machine, with auxiliary memory, whose memory can be modified anywhere. This is what programming languages can do. This is the level of complexity where HTML lives.
To summarize everything here in one sentence: to parse general HTML, you need a real programming language, not a regular expression.
HTML is parsed the same way other languages are parsed: lexing and parsing. The lexing step breaks down the stream of individual characters into meaningful tokens. The parsing step assembles the tokens, using states and memory, into a logically coherent document that can be acted on.
Usually by using a tokeniser. The draft HTML5 specification has an extensive algorithm for handling "real world HTML".
Regular expressions are just one form of parser. An honest-to-goodness HTML parser will be significantly more complicated than can be expressed in regexes, using recursive descent, prediction, and several other techniques to properly interpret the text. If you really want to get into it, you might check out lex & yacc and similar tools.
The prohibition against using regexes for HTML parsing should probably be written more correctly as: "Don't use naive regular expressions to parse HTML..." (lest ye feel the wrath) "...and treat the results with caution." For certain specific goals, a regex may well be perfectly adequate, but you need to be very careful to be aware of the limitations of your regex and as cautious as is appropriate to the source of the text you're parsing (e.g., if it's user input, be very careful indeed).
Parsing HTML is the transformation of a linear text into a tree structure. Regular expressions cannot generally handle tree structures. The regular expression you need at each point to get the next token changes all the time. You can use regular expressions in a parser, but you will need a whole array of regular expressions for each possible state of parsing.
If you want to have a 100% solution: You need to write your own custom code that iterates through the HTML character-by-character and you need to have a tremendous amount of logic to determine if you should stop the current node and start the next.
The reason is that this is valid HTML:
<ul>
<li>One
<li>Two
<li>Three
</ul>
But so is this:
<ul>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ul>
If you are ok with "90% solution": Then using an XML parser to load a document is fine. Or using Regex (though the xml is easier if you are then master of the content).