Find Replace text FOO with Style "Heading 1" with <h1>Foo</h1> - html

I am trying to find an easy way to convert my Word documents to HTML without the awful save-as that is built in. These are structured documents (designed for our screen-reader (JAWS) users), and so they use Heading 1, 2, 3, 4 & the Table of Contents.
We plan to convert these to DAISY audiobooks (https://en.wikipedia.org/wiki/DAISY_Digital_Talking_Book ) , so we need pretty clean, but structured, HTML to convert.
I tried the find-replace, using Styles, but it would just replace anything in the text part of the search. I could convert it from any one style to another, but adding text in the box messed it up.
(I think I see that CSS for DAISY means that instead of just <h2> it will have to be <level2 class=='section' <h2> and closing tags), but that's step 2 after I handle this part.)
I just want to be able to find any text using Style 2 and add text to the start of that line saying "yep, here's some style 2" so that I can do the HTML/CSS stuff.
Thanks!

You can do that with a simple Find/Replace. For example, specify the Heading 1 Style for the Find parameter and use:
Replace = <h1>^&</h1>
For a macro you could incorporate that into, see: Convert a Word Range to a String with HTML tags in VBA

Related

How to match text and skip HTML tags using a regular expression?

I have a bunch of records in a QuickBase table that contain a rich text field. In other words, they each contain some paragraphs of text intermingled with HTML tags like <p>, <strong>, etc.
I need to migrate the records to a new table where the corresponding field is a plain text field. For this, I would like to strip out all HTML tags and leave only the text in the field values.
For example, from the below input, I would expect to extract just a small example link to a webpage:
<p>just a small <a href="#">
example</a> link</p><p>to a webpage</p>
As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines' Text channel tool. The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.
So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need. I.e. it matches only the HTML tags:
/(<[^>]*>)/
In a sense, I need the negative image of this expression but have not be able to build it myself.
Your help in "negating" the above expression is most appreciated.
Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind.
(?:(?<=>)|^)[^<]+
See this demo at regex101
(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any >. From there [^<]+ matches one or more characters that are not < (negated character class).

Word html format: insert a custom TOC via field code

I am generating Word docs from html. Basically, I build a file with html and save it as a .doc. Then I open it in Word and apply a template. All good so far.
I would like to automatically generate a custom TOC via the HTML ie when I am building the document. I need to insert a field code to do that, in the same way I do to add page numbering via the HML. eg:
<span style="mso-field-code: PAGE " class="page-field"></span>
If I save my html doc as docx and apply a template, I can make a TOC based in the styles in the way one would normally create a TOC in Word. I customised the TOC so the Title style is the top level followed by H1, H2 then H3. If I then toggle the field code on the TOC, the field code looks like this:
{ TOC \t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1" }
Now, I can add HTML like this to insert the TOC:
<div style="mso-field-code: TOC " class="toc-field">TOC goes HERE</div>
When I do that, if I right click the text "TOC goes HERE" I get the option to "Update field" and if I do that a TOC is generated using the default H1,H2,H3 tags.
But, what I can't work out is how to include the
\t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1"
part so my custom style sequence is applied. I have tried all sorts of combinations and it seems that adding anything after TOC causes Word to not make a field code.
Does anyone have any suggestions?
Update:
Based on the essential help from #slightlysnarky below, I thought I would summarise the outcome here because the information I needed was in a Microsoft chm file that was taken down many years ago. If you read the following extract from that help manual and compare it to the solution below you will see how this all works.
Word marks and stores information for simple fields by means of the Span element with the mso-field-code style. The mso-field-code value represents the string value of the field code. Formatting in the original field code might be lost when saving as HTML if only the string value of the code is necessary for its calculation.
Word has a different way of storing field information to HTML for more complex fields, such as ones that have formatted text or long values. Word marks these fields with so the data is not displayed in the browser. Word uses the Span element with the mso-element: field-begin, mso-element: field-separator, and mso-element: field-end attributes to contain the three respective parts of the field code: the field start, the separator between field code and field results, and the field end. Whenever possible, Word will save the field to HTML in the method that uses the least file space.
So, basically, add tags as shown below to your HTML at the point you wish the TOC to appear.
:-)
Word recognises a "complex field format" in HTML, along the same lines as it does in the Office Open XML format. So you can use
<span style='mso-element:field-begin'></span>TOC \t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1"
<span style='mso-element:field-separator'></span>This text will show but the user will need to update the field
<span style='mso-element:field-end'></span>
This construct is outlined in a Microsoft document called "Microsoft Office HTML and XML Reference". It's a Windows .exe that unpacks to a .chm Help file. You can get it here
The info. on encoding fields is in Getting Started with Microsoft Office 2000 HTML and XML->Microsoft Word->Fields
There may be a later version but that's the only one I could find.

Using Code <> As Actual Text

Really having trouble with this and can't find any results on it.
I want my html text to utilize the carrots <> for some of my text.
Specifically for a navbar menu item. But I can't seem to build it without activating the text as an actual div.
I want it to say "< Dev>" without using quotes or spaces, but it when I take the quotes/spaces away it activates it as a div. How do I keep the entire message "< Dev>" without turning it into a div item?
E.g:
<p> Welcome to my <Dev> portfolio</p>
Also what is the term used to override reserved code functions as text? Will help me research answers for other issues too. Like when using & as text and not as code.
Thanks for the assistance!
You'll want to use <p> Welcome to my <Dev> portfolio</p>
You can find a list of HTML character codes Here
Try using the html unicode values for those characters instead.
Welcome to my &60Dev&62 portfolio
Sorry it looks like this forum reads those unicode characters and prints them correctly. Add # signs at the after the & characters to get the html code.

In html, how can I hyperlink a portion of a text that is outputted by a token?

I'm working on a template editor using HTML, and have a token that generates a series of text. For example if I put in [TOKEN1], that will generate "Text 1, Text 2, Text 3]
I was wondering how I can apply a hyperlink (a href) to each text individually. I'm not really familiar with coding and wasn't sure if there was a way to apply an argument using something like if you see 1 comma, end hyperlink, or like, start hyperlink if you see first comma, and then till second comma.
Thank you!

How to remove a div from the entire project?

I've got a project consisting of over 200 html files. There's a div repeated throughout most of these, looking like this:
<div class='foobar' id="abcdef123'></div>
I have found all uses of the class using the Find in Files function in Sublime Text 2 - now I want to remove them, i.e. completely delete any line containing that div (and its closing tag).
Is there an easy way to do it in Sublime Text 2?
EDIT: I have forgotten to mention that sometimes the div has additional classes and the ID is always different. How would I write a regexp to deal with that?
In Notepad++, open all 200 files and replace with the following regular expression.
<div class='foobar[^']*' id="[^']*"></div>
and replace it by nothing. I don't know Sublimetext2.