Create an NCX file with Notepad++ and Regular expression - html

I have a HTML Table of Contents page containing list of book chapters with hyperlinks:
Multimedia Implementation<br/>
Table of Contents<br/>
About the Author<br/>
About the Technical Reviewers<br/>
Acknowledgments<br/>
Part I: Introduction and Overview<br/>
Chapter 1. Technical Overview<br/>
...
I want create NCX file for a Kindle book which must contain details as follows:
<navPoint id="n1" playOrder="1">
<navLabel>
<text>Multimedia Implementation</text>
</navLabel>
<content src="final/main.html"/>
</navPoint>
<navPoint id="n2" playOrder="2">
<navLabel>
<text>Table of Contents</text>
</navLabel>
<content src="final/toc.html"/>
</navPoint>
<navPoint id="n3" playOrder="3">
<navLabel>
<text>About the Author</text>
</navLabel>
<content src="final/pref01.html"/>
</navPoint>
...
I'm using Notepad++: is it possible automate this process with regular expression?

You cannot do everything using regex.. you can split the problem into two parts..
generate strings like <navPoint id="n1" playOrder="1"> using program logic (increment variable)
remaining you can do with regex
Use the following regex to match:
<a\shref="([^"]*)">([^<]*)<\/a><br\/>
And replace with:
(generated string)<navLabel>\n<text>\2</text>\n<content src="\1"/>\n</navPoint>
See DEMO

Yes, it is possibly to replace the links with <navpoint> tags. The only thing I found no solution for is the incremental numbering of the <navpoint> attributes id and playOrder...
The following regex will do most of the work:
/^<a[^>]*href="([^"]+)"[^>]*([^<]+).*$/gm
substitute with:
<navpoint id="n" playOrder="">\n<navLabel><text>$2</text></navLabel>\n<content src="$1" />\n</navpoint>\n
Regex details
/^<a .. only parse lines that start with an `<a` tag
.*href=" .. find the first occurance of `href="`
([^"]+) .. capture the text and stop when a " is found
"[^>]*> .. find the end of the <a> tag
([^<]+) .. capture the text and stop when a < is found (i.e. the </a> tag)
.*$/ .. continue to end of the line
gm .. search the whole string and parse each line individually
More detailled (but also more confusing) explanation is here:
https://regex101.com/r/gA0yJ2/1
This link also demonstrates how the regex is working. You can test changes there if you like

Related

How to match text and skip HTML tags using a regular expression?

I have a bunch of records in a QuickBase table that contain a rich text field. In other words, they each contain some paragraphs of text intermingled with HTML tags like <p>, <strong>, etc.
I need to migrate the records to a new table where the corresponding field is a plain text field. For this, I would like to strip out all HTML tags and leave only the text in the field values.
For example, from the below input, I would expect to extract just a small example link to a webpage:
<p>just a small <a href="#">
example</a> link</p><p>to a webpage</p>
As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines' Text channel tool. The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.
So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need. I.e. it matches only the HTML tags:
/(<[^>]*>)/
In a sense, I need the negative image of this expression but have not be able to build it myself.
Your help in "negating" the above expression is most appreciated.
Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind.
(?:(?<=>)|^)[^<]+
See this demo at regex101
(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any >. From there [^<]+ matches one or more characters that are not < (negated character class).

Selecting all same html attributes with (different) values included in VSCode

I'm using VSCode for html editing. In VSCode it's very easy to select same occurences of a piece of code. What i need is selecting all ocuurances of an html attribute (like class, aria-label, etc.) with different values. Here's an example:
I want to select all "aria-label" occurences with the values included. So these will be selected:
aria-label="Apple"
aria-label="Oranges"
aria-label="Multiple Fruit Names"
aria-label=""
...
Is there a way to do that in VSCode?
I understood that regex knowledge essential so for last couple of days i studied Regex101, this is what worked for me on this question.
aria-[a-zA-Z]*="[A-Za-z\s]*"
You could use a regex for that:
^aria-label="[^"]*"
Explanation:
'^' ... matches newline
'aria-label=' ... that's your "search word"
'[^"]' ... any character
'*' ... zero or more occurrences of stuff within the group
Don't forget to enable regex search in search dialog (see below):
This is a good starting point to grasp the regex magic: https://en.wikipedia.org/wiki/Regular_expression#Basic_concepts .

Search and replace in Xcode

I have the following string in several places throughout a massive HTML file: <div class="author">Some Author's Name</div>. I want to replace that with <author>Some Author's Name</author> and obviously adjust the corresponding CSS classes.
Is it possible to do that in a single Find and Replace in Xcode 6.1 without having to go through line by line?
I recommend you this article replace with regex in XCode.
Use the regex <div class="author">(.*)</div> to find your entities and replace with <author>$1 </author>

REGEX in mysql table containing html data

I have a table that stores html templates in a mysql database. Now I have to perform some text replacement on them. However my target text is also present in some of the anchor tags and I don't want that to be replaced.
EX :
<body> ... (has huge html crap)... .........(Some more html crap) ... (a bit more of html crap) ... </body>
Task is to replace the occurrences of the "KEYWORD" with "NEW KEYWORD" in the body but not the urls.
It would also be helpful if I can first find such cases where the KEYWORD is a part of a link in a given template.
MySQL is not capable of such advanced string manipulation.
However, if you were to have a one-time-use PHP script do the editing (ie. select from the table, for each row process and update), you can do this:
// foreach row as $row
$newtext = preg_replace("(<a\b.*?>(*SKIP)(*FAIL)|KEYWORD)","NEW KEYWORD",$row['data']);
What this does is look for links (very approximate Regex but should suffice in almost all cases here), then skip over them. Then, it looks for KEYWORD and replaces it with NEW KEYWORD.
You can use this to quickly and easily handle the replacement.
If that "almost all cases" thing above turns out to not be enough, you can use DOMDocument to load the HTML into a parser and process text nodes only from there.
Maybe you could find the cases where the KEYWORD is a part of a link with something like this:
SELECT * FROM tbl WHERE html REGEXP '<a[^>]*KEYWORD';

Replace Placeholder HTML comment with HTML element

I'm developing an iOS app that needs to display HTML content inside a HTML powered textview (DTCoreText).
For whatever reaso nthe client has decided to provide videos inside special HTML comments that I'm supposed to turn into a tag.
The comment format is as such
<!-- placeholder_video:url_of_video.mp4 -->
I was hoping I could write a regex to match the entire comment, extract the content and replace it with a element pointing to the correct URL through NSRegularExpression's stringByReplacingMatchesInString:options:range:withTemplate: but I can't for the life of me figure out Regular Expressions.
The best I could come up with is
(?<=<!-- placeholder_video:)(.*)(?=-->)
Which matches the comment's content (the mp4 URL), but I need it to match the entire comment instead and extract the content as a sub pattern (that I would later access through \1 if I understand correctly) so I can use a replace pattern to quickly replace the comment with the proper <video src="url_of_video.mp4"> string
Can it be done? Or am I better off trying to do it in two passes instead? (match the entire comment then run another regex on that comment to extract the URL and replace the former?
Based on the way your question looks right now (Having forgotten to paste the example of how the comment looks) it's hard to give a good answer.
But since you mention that this:
(?<=<!-- placeholder_video:)(.*)(?=-->)
will manage to fetch the content of the comment. And since you say all you want is to capture the entire comment.
Then if I understand this correctly I would say all you really need to do is add a capturing group around your entire expression and drop the lookback and lookahead.
(Maybe also avoid grabbing the leading and trailing spaces)
(<!-- placeholder_video:\s*(.*)\s*-->)
When testing with the following:
<!-- placeholder_video: url_of_video.mp4 -->
I will get 2 groups:
1: <!-- placeholder_video: url_of_video.mp4 -->
2: url_of_video.mp4
You can also give your groups names if you like, to make it easier to reference them:
(?<comment><!-- placeholder_video:\s*(?<url>.*)\s*-->)
It is also true that you can use \n to reference group n inside the regular expression.
If you plan to replace the first capturing group with the second one in a single regex, then how you do it would depend on the language. Some languages like C# will allow you to provide your own replacing method, which is one option. But I'm assuming you're not in C# here.
In Javascript you can simply use $n to reference the n'th matched group as the replaced value. (You can also provide a function, but you don't need to)
A full working example in JS (Using jQuery but not needed):
<div id="example">
<!-- placeholder_video: url_of_video.mp4 -->
</div>
<script>
var str = $("#example").html();
var str2 = str.replace(/(<!-- placeholder_video:\s*(.*)\s*-->)/g, "<video src=\"$2\">");
alert(str2);
</script>
You can see the working jsfiddle example here: http://jsfiddle.net/72WeZ/