Deleting same, but different html code across all html documents - html

I'm just wondering how to delete the same, yet different html code in an html page. Or to be able to do this for multiple pages at once too if that's possible.
What I mean by this is html code that has the same beginning, and same end, but not the same middle content between them compared to other .html pages.
The middle may seem similar, but is actually different across all html documents, such as a slight change in a link from page to page.
Or, the middle's code can be entirely different compared to other documents, diverging.
Is there any tool out there where you can specify delete from <span STYLE= to </span> where the middle content is "font-size: x-small; color: #90c040">example link with the middle content varying between different html pages?
If you could specify the beginning and end code to be deleted, and delete everything that's inbetween it, and do this with a one push button that you can specify the parameters, that saves those parameters so you don't have to enter it every time, that would be great.
Or if it could allow you to do multiple html pages at once selecting them manually, a whole bunch at once , or possibly specify a folder and look for every html page in that folder, and delete the html code if it exists once you do it ( if it doesn't exist, then it moves to the next file. )
Thanks! I'm just wondering. Any help is much appreciated! ^_^~
~Update! I've found a program that works!~
I found a link with the programs, Notepad++, TextCrawler, Search & Replace Master, Ecobyte Replace Text, and InfoRapid Search & Replace. I also found multiple file search and replace.
− Notepad++ didn't allow wildcard * or start/end functions.
− TextCrawler as well as InfoRapid Search & Replace didn't work.
− Search & Replace Master was finicky. It didn't work at first, then it did after re-opening the program.
− Ecobyte Replace Text worked the best. This deleted everything beginning to end that I didn't want across many different .html files. I could specify what I wanted with the 'range function'.
− Multiple file search and replace worked too, but functioned differently. If you're looking to keep the beginning and end code, but not what's in the middle, then this one would work for you.
Examples:
Ecobyte will delete <span STYLE= to middle content inbetween to </span>
Leaving you with none of that code remaining.
Multiple File Search will not delete <span STYLE= & </span> but it will delete all of the middle content inbetween.
This leaves you with <span STYLE= & </span> but no code remains if that was inbetween the beginning and end code you specified.
I hope this helps anyone else looking to delete code with the same beginning and end, but different middle code. Cheers! ^_^~
Picture if anyone needs: html different text replacer

Related

Word html format: insert a custom TOC via field code

I am generating Word docs from html. Basically, I build a file with html and save it as a .doc. Then I open it in Word and apply a template. All good so far.
I would like to automatically generate a custom TOC via the HTML ie when I am building the document. I need to insert a field code to do that, in the same way I do to add page numbering via the HML. eg:
<span style="mso-field-code: PAGE " class="page-field"></span>
If I save my html doc as docx and apply a template, I can make a TOC based in the styles in the way one would normally create a TOC in Word. I customised the TOC so the Title style is the top level followed by H1, H2 then H3. If I then toggle the field code on the TOC, the field code looks like this:
{ TOC \t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1" }
Now, I can add HTML like this to insert the TOC:
<div style="mso-field-code: TOC " class="toc-field">TOC goes HERE</div>
When I do that, if I right click the text "TOC goes HERE" I get the option to "Update field" and if I do that a TOC is generated using the default H1,H2,H3 tags.
But, what I can't work out is how to include the
\t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1"
part so my custom style sequence is applied. I have tried all sorts of combinations and it seems that adding anything after TOC causes Word to not make a field code.
Does anyone have any suggestions?
Update:
Based on the essential help from #slightlysnarky below, I thought I would summarise the outcome here because the information I needed was in a Microsoft chm file that was taken down many years ago. If you read the following extract from that help manual and compare it to the solution below you will see how this all works.
Word marks and stores information for simple fields by means of the Span element with the mso-field-code style. The mso-field-code value represents the string value of the field code. Formatting in the original field code might be lost when saving as HTML if only the string value of the code is necessary for its calculation.
Word has a different way of storing field information to HTML for more complex fields, such as ones that have formatted text or long values. Word marks these fields with so the data is not displayed in the browser. Word uses the Span element with the mso-element: field-begin, mso-element: field-separator, and mso-element: field-end attributes to contain the three respective parts of the field code: the field start, the separator between field code and field results, and the field end. Whenever possible, Word will save the field to HTML in the method that uses the least file space.
So, basically, add tags as shown below to your HTML at the point you wish the TOC to appear.
:-)
Word recognises a "complex field format" in HTML, along the same lines as it does in the Office Open XML format. So you can use
<span style='mso-element:field-begin'></span>TOC \t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1"
<span style='mso-element:field-separator'></span>This text will show but the user will need to update the field
<span style='mso-element:field-end'></span>
This construct is outlined in a Microsoft document called "Microsoft Office HTML and XML Reference". It's a Windows .exe that unpacks to a .chm Help file. You can get it here
The info. on encoding fields is in Getting Started with Microsoft Office 2000 HTML and XML->Microsoft Word->Fields
There may be a later version but that's the only one I could find.

Replacing stuff of HTML using regex

I am editing a couple of hundred HTML files and I have to replace all the stuff manually, so I was wondering whether it could be done using regex.I don't think it is possible, but it might be, so please help me out.
Okay, so for example, I have many <p> tags in a file, each with a different class. eg:
<p class="class1">stuff here</p>
<p class="class2">more stuff here</p>
I wanted to replace the "stuff here" and "more stuff here" with something, for example
<p class="class1">[content]</p>
<p class="class2">[content]</p> .
I wanted to know if that is possible.
I'm using notepad++.
P.S. I'm new to regex.
I think notepad++ is great for stuff like this. Open up Find/Replace, and check the regular expressions box in the dialog's Search Mode section.
In the "Find what" field, try this:
\<p\ class\=(.*)\>(.*)\<\/p\>
and in "Replace with":
\<p\ class\=\1\>[content]\<\/p\>
the \1 here will take whatever (found by (.*)) between the class= and the angle bracket > which ends the tag, and replace it with itself, which essentially results in ignoring the class name, rather than having to specify. the second (.*) catches the current content inside the paragraph tag, which is what you want to replace. So where I wrote [content] in the "Replace with" block, that's where you'd put your new content. This does limit you to content that you can paste into the notepad++ find/replace dialog, but I think it has a pretty huge limit.
If I'm remembering that text field's limitations incorrectly, another thing you could do is just adjust my "Replace with" text to just replace the old text with some newlines:
\<p\ class\=\1\>\n\n\<\/p\>
This will delete the old text and leave a clear line where it once was, making it easy to paste whatever you want into the normal editor pane.
The first way is probably better, if your new content will fit the Replace With field, because this regex works once per line. And you can click "Replace" a couple times, and if it's working, clicking "Replace all" will iterate through every <p> element in the file.
Note: this solution assumes that your <p> tags open and close within one line, as you typed them your question description. If they break lines, you're going to want to enable . matches newline in the Replace dialog, and... you need trickier (more precise) syntax than (.*) to catch your class name and content-to-be-replaced. Let me know if this is the the case, and I'll fiddle with it and see if I can help more. The (.*) needs to change to (.*?) or something; the search needs to get more greedy, because if . matches newline, then .* matches any and every possible character infinite times, i.e., the whole document.

Notepad++ RexEx Remove everything between 2 html tags ( with line break between )

I want to remove in a html document with notepad++
everything between the marked area
So the Start point to remove is ( including ) "<imgCRLF" and then everything between including CRLF
and then including "DetailsCRLF</aCRLF" for the End ponint
I started simple with <img.*<a/> and ticked
and I tried to improve this starting point but always got either nothing was deleted or to much :)
Use <img.*?</a>[\r\n]*. The .* is too greedy. [\r\n]* will capture the whitespace after </a>.
Also, if you are only interested in matching <img with subsequent line breaks, you can use another regex:
<img[\r\n].*?</a>[\r\n]*

How to remove a div from the entire project?

I've got a project consisting of over 200 html files. There's a div repeated throughout most of these, looking like this:
<div class='foobar' id="abcdef123'></div>
I have found all uses of the class using the Find in Files function in Sublime Text 2 - now I want to remove them, i.e. completely delete any line containing that div (and its closing tag).
Is there an easy way to do it in Sublime Text 2?
EDIT: I have forgotten to mention that sometimes the div has additional classes and the ID is always different. How would I write a regexp to deal with that?
In Notepad++, open all 200 files and replace with the following regular expression.
<div class='foobar[^']*' id="[^']*"></div>
and replace it by nothing. I don't know Sublimetext2.

How to delete a similar fragment on several HTML files?

I'm converting a website to a PDF, but there are images in there and along all of them there is a text that when clicked gets you to image itself.
I think this would be the code responsible for showing that text, since I deleted it in one of the files and the text and link is not shown anymore.
<div class="v1"><a target="_self" href="images/graphics/1.jpg">[View full size image]</a></div>
The problem is that there are about 200 more HTML documents containing this similar text, only changing href.
Would there be any easy way to get rid of all this without having to go one by one? Maybe a regular expression for sed?
If the expression is always on one line and the only difference is in href, sed is a possible solution:
sed -e 's,<div class="v1"><a target="_self" href="[^"]*">\[View full size image\]</a></div>,,'
I used an alternative separator , so / does not have to be escaped in closing tags. The brackets in the links's text need to be escaped, though.
Yes, regular expressions are likely the easiest solution here. If it's simply a question of removing this line from all your files then I'd just open them up in an editor (Sublime Text 2 does this well) and perform a regex search and replace. The following search pattern will likely work:
<div class=\"v1\"><a target=\"_self\" href=\"[^"]+\">\[View full size image\]</a></div>