How to program Emacs to syntax highlight html character references specified numerically - html

In html major mode, Emacs is programmed to syntax highlight html character entity references (i.e., character references specified by name, e.g., ) but not, for some reason, numeric character references (e.g.,   or &#xa0). I guess this is a special case of the more general problem of customizing syntax highlight in a given mode. I imagine it involves some use of regexes. Can someone give me some guidance on how get started with this?

Following code snippet should help you:
(add-to-list 'sgml-font-lock-keywords-2
'("\\&#x?[0-9a-fA-F][0-9a-fA-F]*;?" . font-lock-variable-name-face))
but it should be put after loading of sgml-mode that provides html-mode. You can force loading with following command:
(require 'sgml-mode)

Related

Trying to get `sed `to fix a phpBB board - but I cannot get my regexp to work

I have an ancient phpBB3 board which has gone through several updates over its 15+ years of existence. Sometimes, in the distant past, such updates would partially fail, leaving all sorts of 'garbage' in the BBCode. I'm now trying to do a 'simple' regexp to match a particular issue and fix it.
What happened was the following... during a database update, long long ago, BBCode tags were, for some reason, 'tagged' with a pseudo-attribute — allegedly for the database updating script to figure out each token that required updating, I guess. This attribute was always a 8-char-long alphanumeric string, 'appended' to the actual BBCode with a semicolon, like this:
[I]something in italic[/I]
...
[I:i9o7y3ew]something in italic[/I:i9o7y3ew]
Naturally, phpBB doesn't recognise this as valid BBCode, and just prints the whole text out.
The replacement regexp is actually very basic:
s/\[(\/?)(.+):[[:alnum:]]{0,8}\]/[\1\2]/gim
You can see a working example on regex110.com (where capture groups use $1 instead of \1). The example given there includes a few examples from the actual database itself. [i] is actually the simplest case; there are plenty of others which are perfectly valid but a bit more complex, thus requiring a (.+) matcher, such as [quote=\"Gwyneth Llewelyn\":2m80kuso].
As you can see from the example on regex110.com, this works :-)
Why doesn't it work under (GNU) sed? I'm using version 4.8 under Linux:
$ sed -i.bak -E "s/\[(\/?)(.+):[[:alnum:]]+\]/[\1\2]/gim" table.sql
Just for the sake of the argument, I tried using [A-Za-z0-9]+ instead of [[:alnum:]]+; I've even tried (.+) (to capture the group and then just discard it)
None produced an error; none did any replacements whatsoever.
I understand that there are many different regexp engines out there (PCRE, PCRE2, Boost, etc. and so forth) so perhaps sed is using a syntax that is inconsistent with what I'm expecting...?
Rationale: well, I could have done this differently; after all, MySQL has built-in regexp replacements, too. However, since this particular table is so big, it takes eternities. I thought I'd be far better off by dumping everything to a text file, doing the replacements there, and importing the table again. There is a catch, though: the file is 95 MBytes in size, which means that most tools I've got (e.g. editors with built-in regexp search & replace) will fail with such a huge exception. One notable exception is good old emacs, which has no trouble with such large files. Alas, emacs cannot match anything, so I thought I'd give sed a try (it should be faster, too). sed takes also close to a minute or so to process the whole file — about the same as emacs, in fact — and has the same result, i.e. no replacements are being made. It seems to me that, although the underlying technology is so different (pure C vs. Emacs-LISP), both these tools somehow rely on similar algorithms... both of which fail.
My understanding is that some libraries use different conventions to signal literal vs. metacharacters and quantifiers. Here is an example from an instruction manual for vim: http://www.vimregex.com/#compare
Indeed, contemporary versions of sed seem to be able to handle two different kinds of conventions (thus the -E flag). The issue I have with my regexp is that I find it very difficult to figure out which convention to apply. Let's start with what I'm used to from PHP, Go, JavaScript and a plethora of other regexp implementations, which use the convention that metacharacters & quantifiers do not get backslashed (while literals do).
Thus, \[(\/?)(.+):[[:alnum:]]+\] presumes that there are a few literal matches for [, ], /, and only these few cases require backslashes.
Using the reverse convention — i.e. literals do not get backslashed, while metacharacters and some quantifies do — this would be written as:
[\(/\?\)\(\.\+\):\[\[:alnum:\]\]\+]
Or so I would think.
Sadly, sed also rejects this with an error — and so do vim and emacs, BTW (they seem to use a similar regexp library, or perhaps even the same one).
So what is the correct way to write my regexp so that sed accepts it (and does what I intend it to do)?
UPDATE
I have since learned that, in the database, phpBB, unlike I assumed, does not store BBCode (!) but rather a variant of HTML (some tags are the same, some are invented on the spot). What happens is that BBCode gets translated into that pseudo-HTML, and back again when displaying; that, at least, explains why phpBB extensions such as Markdown for phpBB — but also BBCode add-ons! — can so easily replace, partially or even totally, whatever is in the database, which will continue to work (to a degree!) even if those extensions get deactivated: the parsed BBCode/Markdown is just converted to this 'special' styling in the database, and, as such, will always be rendered correctly by phpBB3, no matter what.
On other words, fixing those 'broken' phpBB tags requires a bit more processing, and not merely search & replace with a single regexp.
Nevertheless, my question is still pertinent to me. I'm not really an expert with regexps but I know the basics — enough to make my life so much easier! — and it's always good to understand the different 'dialects' used by different platforms.
Notably, instead of using egrep and/or grep -E, I'm fond of using ugrep instead. It uses PCRE2 expressions (with the Boost library), and maybe that's the issue I'm having with the sed engine(s) — the different engines speak different regular expressions dialect, and converting from one grep variant to a different one might not be useful at all (because some options will not 'translate' well enough)...
Using sed
(\[[^:]*) - Retain everything up to but not including the next semi colon after a opening bracket within the parenthesis which can later be returned with back reference \1
[^]]* - Exclude everything else up to but not including the next closing bracket
$ sed -E 's/(\[[^:]*)[^]]*/\1/g' table.sql
[I]something in italic[/I]
...
[I]something in italic[/I]

Alternative to entering entity references in source code

Google's HTML/CSS Style Guide advises against using entity references:
Do not use entity references.
There is no need to use entity references like —, ”, or ☺, assuming the same encoding (UTF-8) is used for files and editors as well as among teams.
<!-- Not recommended -->
The currency symbol for the Euro is “&eur;”.
<!-- Recommended -->
The currency symbol for the Euro is “€”.
I'm not sure I understand what it is that they are proposing. The only thing I can think of is that they are saying that you should be using your text editor's insert character command (e.g., in Atom, Ctrl-Shift-U, or in Emacs, C-x 8) to enter Unicode characters rather than typing in the literal entity references. Is that it?
The only thing I can think of is that they are saying that you should be using your text editor's insert character command […] rather than typing in the literal entity references. Is that it?
Yes, that's precisely what they're saying.
You don't write A to insert the letter A, after all! There's no more reason to write ä for ä, or ♥ for ♥, when those characters can be represented directly in the HTML file.

How to embed HTML string syntax in CoffeeScript using VIM?

I have looked at how to embed HTML syntax in JavaScript string from HTML syntax highlighting in javascript strings in vim.
However, when I use CoffeeScript I cannot get the same thing working by editing coffee.vim syntax file in a similar way. I got recursive errors which said including html.vim make it too nested.
I have some HTML template in CoffeeScript like the following::
angular.module('m', [])
.directive(
'myDirective'
[
->
template: """
<div>
<div>This is <b>bold</b> text</div>
<div><i>This should be italic.</i></div>
</div>
"""
]
)
How do I get the template HTML syntax in CoffeeScript string properly highlighted in VIM?
I would proceed as follows:
Find out the syntax groups that should be highlighted as pure html would be. Add html syntax highlighting to these groups.
To find the valid syntax group under the cursor you can follow the instructions here.
In your example the syntax group of interest is coffeeHereDoc.
To add html highlighting to this group execute the following commands
unlet b:current_syntax
syntax include #HTML syntax/html.vim
syn region HtmlEmbeddedInCoffeeScript start="" end=""
\ contains=#HTML containedin=coffeeHereDoc
Since vim complains about recursion if you add these lines to coffee.vim i would go with an autocommand:
function! Coffee_syntax()
if !empty(b:current_syntax)
unlet b:current_syntax
endif
syn include #HTML syntax/html.vim
syn region HtmlEmbeddedInCoffeeScript start="" end="" contains=#HTML
\ containedin=coffeeHereDoc
endfunction
autocmd BufEnter *.coffee call Coffee_syntax()
I was also running into various issues while trying to get this to work. After some experimentation, here's what I came up with. Just create .vim/after/syntax/coffee.vim with the following contents:
unlet b:current_syntax
syntax include #HTML $VIMRUNTIME/syntax/html.vim
syntax region coffeeHtmlString matchgroup=coffeeHeredoc
\ start=+'''\\(\\_\\s*<\\w\\)\\#=+ end=+\\(\\w>\\_\\s*\\)\\#<='''+
\ contains=#HTML
syn sync minlines=300
The unlet b:current_syntax line disables the current syntax matching and lets the HTML syntax definition take over for matching regions.
Using an absolute path for the html.vim inclusion avoids the recursion problem (described more below).
The region definition matches heredoc strings that look like they contain HTML. Specifically, the start pattern looks for three single quotes followed by something that looks like the beginning of an HTML tag (there can be whitespace between the two), and the end pattern looks for the end of an HTML tag followed by three single quotes. Heredoc strings that don't look like they contain HTML are still matched using the coffeeHeredoc pattern. This works because this syntax file is being loaded after the syntax definitions from the coffeescript plugin, so we get a chance to make the more specific match (a heredoc containing HTML) before the more general match (the coffeeHeredoc region) happens.
The syn sync minlines=300 widens the matching region. My embedded HTML strings sometimes stretched over 50 lines, and Vim's syntax highlighter would get confused about how the string should be highlighted. For complete surety you could use syn sync fromstart, but for large files this could theoretically be slow (I didn't try it).
The recursion problem originally experienced by #heartbreaker was caused by the html.vim script that comes with the vim-coffeescript plugin (I'm assuming that was being used). That plugin's html.vim file includes the its coffee.vim syntax file to add coffeescript highlighting to HTML files. Using a relative syntax include, a la
syntax include #HTML syntax/html.vim
you get all the syntax/html.vim files in VIM's runtime path, including the one from the coffeescript plugin (which includes coffee.vim, hence the recursion). Using an absolute path will restrict you to only getting the particular syntax file you specify, but this seems like a reasonable tradeoff since the HTML one would embed in a coffeescript string is likely fairly simple.

How can I disable Vim's HTML error highlighting?

I use Vim to edit HTML with embedded macros, where the macros are bracketed with double angle brackets, e.g., "<>". Vim's standard HTML highlighting sees the second "<" and ">" as errors, and highlights them as such. How can I prevent this? I'd be happy to either teach $VIMHOME/syntax/html.vim that double-angle-brackets are OK, or to simply disable the error highlighting, but I'm not sure how to do either one. ("highlight clear htmlTagError" has no effect. In fact, "highlight clear" has no effect in an HTML buffer.)
If you want to introduce full syntax highlighting in your macros, it'll be easiest to start with a syntax file like htmldjango ($VIMRUNTIME/syntax/htmldjango.vim, which then uses html.vim and django.vim from the same directory); in it, there is special meaning in {{ ... }}, among other things. You want it just the same, but with << and >> being your delimiters.
To just highlight << ... >> specially, you'd need a syntax line like this:
syntax region mylangMacro start="<<" end=">>" containedin=ALLBUT,mylangMacro
And then you could highlight it with:
highlight default link mylangMacro Macro
This could either go in ~/.vim/after/syntax/html.vim or could be done in the style of htmldjango as a new syntax highlighter (this would be my preferred approach; you can then make HTML files use this new syntax file with an autocmd).
(You can also remove the error highlighting with syntax clear htmlTagError which would go in the same sort of position. But hopefully you'll think getting separate highlighting is better than just removing the error.)
Here are instructions to edit existing syntax highlighting in Vim:
http://vimdoc.sourceforge.net/htmldoc/syntax.html#mysyntaxfile-add
vim runtime paths for Unix/Linux:
$HOME/.vim,
$VIM/vimfiles,
$VIMRUNTIME,
$VIM/vimfiles/after,
$HOME/.vim/after
Create a directory in your vim runtime path called "after/syntax".
Commands for Unix/Linux:
mkdir ~/.vim/after
mkdir ~/.vim/after/syntax
Write a Vim script that contains the commands you want to use. For example, to change the colors for the C syntax: highlight cComment
ctermfg=Green guifg=Green
Write that file in the "after/syntax" directory. Use the name of the syntax, with ".vim" added. For our C syntax: :w
~/.vim/after/syntax/c.vim
That's it. The next time you edit a C file the Comment color will be
different. You don't even have to restart Vim.
If you have multiple files, you can use the filetype as the directory
name. All the "*.vim" files in this directory will be used, for
example: ~/.vim/after/syntax/c/one.vim ~/.vim/after/syntax/c/two.vim
Alternatively, you could take a much easier route and use syntax highlighting within the Nano command line editor, which you can define your own syntax very easily with regular expressions:
http://how-to.wikia.com/wiki/How_to_use_syntax_highlighting_with_the_GNU_nano_text_editor

Find and Replace with Notepad++

I have a document that was converted from PDF to HTML for use on a company website to be referenced and indexed for search. I'm attempting to format the converted document to meet my needs and in doing so I am attempting to clean up some of the junk that was pulled over from when it was a PDF such as page numbers, headers, and footers. luckily all of these lines that need to be removed are in blocks of 4 lines unfortunately they are not exactly the same therefore cannot be removed with a simple literal replace. The lines contain numbers which are incremental as they correlate with the pages. How can I remove the following example from my html file.
Title<br>
10<br>
<hr>
<A name=11></a>Footer<br>
I've tried many different regular expression attempts but as my skill in that area is limited I can't find the proper syntax. I'm sure i'm missing something fairly easy as it would seem all I need is a wildcard replace for the two numbers in the code and the rest is literal.
any help is apprciated
The search & replace of npp is quite odd. I can't find newline charactes with regular expression, although the documentation says:
As of v4.9 the Simple find/replace (control+h) has changed, allowing the use of \r \n and \t in regex mode and the extended mode.
I updated to the last version, but it just doesn't work. Using the extended mode allows me to find newlines, but I can't specify wildcards.
However, you can use the macros to overcome this problems.
prepare a search that will find a unique passage (like Title<br>\r\n, here you can use the extended mode)
start recording a macro
press F3 to use your search
mark the four lines and delete them
stop recording the macro ... done!
Just replay it and it deletes what you wanted to delete.
If I have understood your request correctly this pattern matches your string:
Title<br>( ?)\n([0-9]+)<br>( ?)\n<hr>( ?)\n<A name=([0-9]+)></a>Footer<br>
I use the Regex Coach to try out complicated regex patterns. Other utilities are available.
edit
As I do not use Notepad++ I cannot be sure that this pattern will work for you. Apologies if that transpires to be the case. (I'm a TextPad man myself, and it does work with that tool).