Createing a Sphinx code-block, with inline text parsing - html

I'm trying to create a directive, that will allow me to parse links inside a Sphinx CodeBlock directive. I looked at the ParsedLiteral directive from docutils, which does something like that, only it doesn't do syntax highlighting, like CodeBlock. I tried replacing the part of CodeBlock (in sphinx/directives/code.py), which generates the literal_block:
literal: Element = nodes.literal_block(code, code)
with
text_nodes, messages = self.state.inline_text(code, self.lineno)
literal: Element = nodes.literal_block(code, "", *text_nodes)
which is what docutils ParsedLiteraldirective does, but I of course kept the rest of the Sphinx CodeBlock. This parses the code correctly, but does not apply the correct syntax highlighting, so I'm wondering where the syntax highlighting is taking place, and why it's not taking place in my modified CodeBlock directive.
I'm very confused as to why this is the case and I'm looking for some input from smarter people than me.

Syntax highlights are applied at the translation phase, see sphinx.writers.html.HTMLTranslator.visit_literal_block:
def visit_literal_block(self, node: Element) -> None:
if node.rawsource != node.astext(): # <<< LOOK AT HERE
# most probably a parsed-literal block -- don't highlight
return super().visit_literal_block(node)
lang = node.get('language', 'default')
linenos = node.get('linenos', False)
# do highlight...
Once the node's rawsource is not equal to its text, the highlight will not be applied.
In your example, code is not equal to text_nodes.as_text() obviously.
Just set literal.rawsource to literal.as_text() can fix the syntax highlight.

Related

How to highlight nested structure in SourceGraph structured search?

I have the following SourceGraph structured search: repo:… file:… "tls_certs" {...default = {...}...} which correctly matches:
variable "tls_certs" {
description = "…"
type = map(string)
default = {
…
}
}
It's currently highlighting the entire "tls_certs" block. I would like it to highlight only the default = block. Assuming that's possible, how would that be done?
(I'm assuming you want to scope your search to Terraform files based on the example match provided)
Try this and see if it works for you: :[~[\s\n]]default = {...} lang:Terraform
It'll match a block of the form default = {...} that's preceded by whitespace or a newline. It's not strictly guaranteed to only match nested structures, but it seems to work well with the lang:Terraform filter.
It uses both the ... and the :[~regexp] syntax of structural search. (Syntax reference docs: https://docs.sourcegraph.com/code_search/reference/structural#syntax-reference)
Example: https://sourcegraph.com/search?q=context:global+:%5B~%5B%5Cs%5Cn%5D%5Ddefault+%3D+%7B...%7D+lang:Terraform+-repo:%5Egithub%5C.com/Wilfred/difftastic$&patternType=structural

How to mark a in pyparsing position in stream to come back to it later?

Background:
I am trying to implement a simple (?) markup language to be used to write novels.
This is quite different from usual markups because semantic is centered on different primitives, in particular direct speech is far from being similar to a list.
The basic structure is well-known: #part{title}, #chapter{title} and #scene[{title}] have the usual meanings and double-\n indicates a paragraph break.
Specific features include:
#speach[speaker]{utterance, possibly complex}
#stress{something that should be visually enhanced}
#standout{some part that should have a different visual enhancement}
#quotation[original author]{possibly long block quotation}
This should be parsed and translated to different output formats (e.g.: html and LaTeX).
I have a pyparsing grammar able to parse a non-trivial input.
Problem is generation of paragraphs for HTML:
As said a paragraph ends with double-newline, but essentially starts from end of previous paragraph unless some top-level constucts (e.g.: #chapter) intervene to break sequence.
First naive attempt was to accumulate text fragments in a global buffer and to emit them at selected points; this wold logically work, but it seems pyparsing calls it's ParseActions multiple times, so my global buffer ends up holding the same fragment duplicated.
I have not found a way to either avoid such duplication or to mark the "start of paragraph" in such a way I can come back to it later to generate the well-known <p>Long line, maybe containing #speech{possibly nested with #standout{!} and other constructs}</p> (of course #standout should map to <b>!</b> and#speech to some specific <div class="speech"></div>)
What is the "best practice" to handle this kind of problems?
Note: LaTeX code generation is much less problematic because paragraphs are simply terminated (like in the markup) either with a blank line or with \par.
Is it possible for you recast this not as a "come back to the beginning later" problem but as a "read ahead as far as I need to get the whole thing" problem?
I think nestedExpr might be a way for you to read ahead to the next full markup, and then have a parse action re-parse the contents in order to process any nested markup directives. nestedExpr returns its parsed input as a nested list, but to get everything as a flattened string, wrap it in originalTextFor.
Here is a rework of the simpleWiki.py example from the pyparsing examples:
import pyparsing as pp
wiki_markup = pp.Forward()
# a method that will construct and return a parse action that will
# do the proper wrapping in opening and closing HTML, and recursively call
# wiki_markup.transformString on the markup body text
def convert_markup_to_html(opening,closing):
def conversionParseAction(s, l, t):
return opening + wiki_markup.transformString(t[1][1:-1]) + closing
return conversionParseAction
# use a nestedExpr with originalTextFor to parse nested braces, but return the
# parsed text as a single string containing the outermost nested braces instead
# of a nested list of parsed tokens
markup_body = pp.originalTextFor(pp.nestedExpr('{', '}'))
italicized = ('ital' + markup_body).setParseAction(convert_markup_to_html("<I>", "</I>"))
bolded = ('bold' + markup_body).setParseAction(convert_markup_to_html("<B>", "</B>"))
# another markup and parse action to parse links - again using transform string
# to recursively parse any markup in the link text
def convert_link_to_html(s, l, t):
t['link_text'] = wiki_markup.transformString(t['link_text'])
return '{link_text}'.format_map(t)
urlRef = ('link'
+ '{' + pp.SkipTo('->')('link_text') + '->' + pp.SkipTo('}')('url') + '}'
).setParseAction(convert_link_to_html)
# now inject all the markup bits as possible markup expressions
wiki_markup <<= urlRef | italicized | bolded
Try it out!
wiki_input = """
Here is a simple Wiki input:
ital{This is in italics}.
bold{This is in bold}!
bold{This is in ital{bold italics}! But this is just bold.}
Here's a URL to link{Pyparsing's bold{Wiki Page}!->https://github.com/pyparsing/pyparsing/wiki}
"""
print(wiki_markup.transformString(wiki_input))
Prints:
Here is a simple Wiki input:
<I>This is in italics</I>.
<B>This is in bold</B>!
<B>This is in <I>bold italics</I>! But this is just bold.</B>
Here's a URL to Pyparsing's <B>Wiki Page</B>!
Given your markup examples, I think this approach may get you further along.

Proper html attribute highlighting in Vim?

While I was looking for proper html tag highlighting in Vim, I found this post. But after looking at romainl's answer (and his screenshot) and html syntax file, I wonder how can I change color of = (equal sign) after attribute to match the color of an attribute without changing html tag's color?
Exploration
Here is a very useful function I've found somewhere (a long time ago, probably on the Vim Wiki) that gives you the syntax group of the word/symbol under your cursor:
function! SynStack()
if !exists("*synstack")
return
endif
echo map(synstack(line('.'), col('.')), 'synIDattr(v:val, "name")')
endfunc
Just place your cursor on the item you want to inspect and type :call SynStack() to echo the syntax group in the command-line.
If I place my cursor on the = in <div id="example"></div>, the output of SynStack() is ['htmlTag'].
With the cursor on <> I get ['htmlTag'] as well.
With the cursor on div I get ['htmlTag', 'htmlTagN', 'htmlTagName'] which means that the color of div (h1, p…) is defined via a special syntax group called htmlTagName that inherits from htmlTag.
Some alternative/custom syntax files may define other syntax groups with slightly varying name so my example is only valid for me. You'll have to play with SynStack() to get the correct syntax groups.
Reflexion
With the info we have gathered so far, it's obvious that the tag name (['htmlTagName']) can be styled independtly from the rest of the tag but it doesn't seem doable to highlight the = differently. Because it is part of the same syntax group as <>, the = will necessarilly be highlighted the same.
We have 2 possibilities:
a. <, = and > are the same colour while div is different.
b. <, div, = and > are all the same colour.
The original theme followed path a which I didn't like, so I had to customize it a little (path b) with the few lines in my answer to the previous question:
hi htmlTag guifg=#90b0d1 gui=NONE
hi htmlSpecialTagName guifg=#90b0d1 gui=NONE
hi htmlTagName guifg=#90b0d1 gui=NONE
hi htmlEndTag guifg=#90b0d1 gui=NONE
As it is, having the = coloured differently than <> is not possible. If we want to colorize the = we are going to edit the HTML syntax file and your colorscheme, cowboy style.
Action
The first step is to make a local copy of the default HTMl syntax file:
$ cp /usr/share/vim/vim73/syntax/html.vim ~/.vim/syntax/html.vim
The next step is to edit this file. We are going to perform two changes:
add the definition of the htmlEqualSign syntax group
Line 44 should be (Attention! Not thoroughly tested.):
syn match htmlEqualSign contained "="
add htmlEqualSign to the htmlTag group
Line 40 of ~/.vim/syntax/html.vim should be changed from:
syn region htmlTag start=+<[^/]+ end=+>+ contains=htmlTagN,htmlString,htmlArg,htmlValue,htmlTagError,htmlEvent,htmlCssDefinition,#htmlPreproc,#htmlArgCluster
to:
syn region htmlTag start=+<[^/]+ end=+>+ contains=htmlTagN,htmlString,htmlArg,htmlValue,htmlTagError,htmlEvent,htmlCssDefinition,#htmlPreproc,#htmlArgCluster,htmlEqualSign
The last step is to edit your colorscheme so that it colorizes = the way you want. You do that by adding this line somewhere in your colorscheme:
hi htmlEqualSign guifg=#00ff00
With the color of your choice, of course.
But I think that you want = to be the same color as id (that's not very clear from your question). For that, we are going to "link" the htmlEqualSign group to the one being used for attributes. Again, :call SynStack() is of great help: the syntax group for attributes is htmlArg so the line to add to your colorscheme would be:
hi link htmlEqualSign htmlArg

Find and replace entire HTML nodes with Nokogiri

i have an HTML, that should be transformed, having some tags replaced with another tags.
I don't know about these tags, because they will come from db. So, set_attribute or name methods of Nokogiri are not suitable for me.
I need to do it, in a way, like in this pseudo-code:
def preprocess_content
doc = Nokogiri::HTML( self.content )
doc.css("div.to-replace").each do |div|
# "get_html_text" will obtain HTML from db. It can be anything, even another tags, tag groups etc.
div.replace self.get_html_text
end
self.content = doc.css("body").first.inner_html
end
I found Nokogiri::XML::Node::replace method. I think, it is the right direction.
This method expects some node_or_tags parameter.
Which method should i use to create a new Node from text and replace the current one with it?
Like that:
doc.css("div.to-replace").each do |div|
new_node = doc.create_element "span"
new_node.inner_html = self.get_html_text
div.replace new_node
end

How can I retrieve a collection of values from nested HTML-like elements using RegExp?

I have a problem creating a regular expression for the following task:
Suppose we have HTML-like text of the kind:
<x>...<y>a</y>...<y>b</y>...</x>
I want to get a collection of values inside <y></y> tags located inside a given <x> tag, so the result of the above example would be a collection of two elements ["a","b"].
Additionally, we know that:
<y> tags cannot be enclosed in other <y> tags
... can include any text or other tags.
How can I achieve this with RegExp?
This is a job for an HTML/XML parser. You could do it with regular expressions, but it would be very messy. There are examples in the page I linked to.
I'm taking your word on this:
"y" tags cannot be enclosed in other "y" tags
input looks like: <x>...<y>a</y>...<y>b</y>...</x>
and the fact that everything else is also not nested and correctly formatted. (Disclaimer: If it is not, it's not my fault.)
First, find the contents of any X tags with a loop over the matches of this:
<x[^>]*>(.*?)</x>
Then (in the loop body) find any Y tags within match group 1 of the "outer" match from above:
<y[^>]*>(.*?)</y>
Pseudo-code:
input = "<x>...<y>a</y>...<y>b</y>...</x>"
x_re = "<x[^>]*>(.*?)</x>"
y_re = "<y[^>]*>(.*?)</y>"
for each x_match in input.match_all(x_re)
for each y_match in x_match.group(1).value.match_all(y_re)
print y_match.group(1).value
next y_match
next x_match
Pseudo-output:
a
b
Further clarification in the comments revealed that there is an arbitrary amount of Y elements within any X element. This means there can be no single regex that matches them and extracts their contents.
Short and simple: Use XPath :)
It would help if we knew what language or tool you're using; there's a great deal of variation in syntax, semantics, and capabilities. Here's one way to do it in Java:
String str = "<y>c</y>...<x>...<y>a</y>...<y>b</y>...</x>...<y>d</y>";
String regex = "<y[^>]*+>(?=(?:[^<]++|<(?!/?+x\\b))*+</x>)(.*?)</y>";
Matcher m = Pattern.compile(regex).matcher(str);
while (m.find())
{
System.out.println(m.group(1));
}
Once I've matched a <y>, I use a lookahead to affirm that there's a </x> somewhere up ahead, but there's no <x> between the current position and it. Assuming the pseudo-HTML is reasonably well-formed, that means the current match position is inside an "x" element.
I used possessive quantifiers heavily because they make things like this so much easier, but as you can see, the regex is still a bit of a monster. Aside from Java, the only regex flavors I know of that support possessive quantifiers are PHP and the JGS tools (RegexBuddy/PowerGrep/EditPad Pro). On the other hand, many languages provide a way to get all of the matches at once, but in Java I had to code my own loop for that.
So it is possible to do this job with one regex, but a very complicated one, and both the regex and the enclosing code have to be tailored to the language you're working in.