selective search & replace of text string in many HTML-Docs - html

I've got many html-docs that need selective replacement of the <br /> tag in two specific areas in each document (400+).
I wonder how to achieve this goal and need assistance.
In each HTML-document the <br />-tag needs to be replaced only inside the html-tag:
<span property="dc:description" content="xyz1,<br /> xyz2,<br /> xyz3"/>
and also all occurences of <br />inside the alt="-tag, like in the html-tag
<img src="xyz.jpg" alt="uvw1,<br />uvw2" />)
In all other areas of the HTML-Docs the <br />-tag must remain unchanged.
...I gave this some more thought and think the problem described above may be resolved with the aid of a script or a function equipped with start- and stop-signals. This way the script knows at which positions to start looking for the <br />-tag and replace it with a given text-string AND also knows where to stop. Then move on to the next instance in documents that are open in an editor or residing in a given folder.
I am afraid that I am not capable to write such a script myself.
Hope someone can provide feedback on how to best accomplish this,
thanks.
OS: Win7-64, Editor: Notepad++

Providing that your HTML files aren't really big, I don't think you need a script for this.
You could just:
Join the files together.
Use a regex replace in Notepad++. For this you need to replace <span([^/]*)<br />(.*)"/> with <span\1NEWTAG\2"/> where NEWTAG is whatever you want to replace the <br /> with. Note that this will only replace the first <br /> it finds each time, so you will need to do this a few times until it finds no more. Therefore if you're replacing with text that contains <br /> itself (which I doubt by the sounds of it), you'll need to modify this a little.
Split the file back up into the originals.
Personally I'd just write a Python script, as it's pretty ace at string manipulation. But I don't know if this is within your scope.

Related

Deleting same, but different html code across all html documents

I'm just wondering how to delete the same, yet different html code in an html page. Or to be able to do this for multiple pages at once too if that's possible.
What I mean by this is html code that has the same beginning, and same end, but not the same middle content between them compared to other .html pages.
The middle may seem similar, but is actually different across all html documents, such as a slight change in a link from page to page.
Or, the middle's code can be entirely different compared to other documents, diverging.
Is there any tool out there where you can specify delete from <span STYLE= to </span> where the middle content is "font-size: x-small; color: #90c040">example link with the middle content varying between different html pages?
If you could specify the beginning and end code to be deleted, and delete everything that's inbetween it, and do this with a one push button that you can specify the parameters, that saves those parameters so you don't have to enter it every time, that would be great.
Or if it could allow you to do multiple html pages at once selecting them manually, a whole bunch at once , or possibly specify a folder and look for every html page in that folder, and delete the html code if it exists once you do it ( if it doesn't exist, then it moves to the next file. )
Thanks! I'm just wondering. Any help is much appreciated! ^_^~
~Update! I've found a program that works!~
I found a link with the programs, Notepad++, TextCrawler, Search & Replace Master, Ecobyte Replace Text, and InfoRapid Search & Replace. I also found multiple file search and replace.
− Notepad++ didn't allow wildcard * or start/end functions.
− TextCrawler as well as InfoRapid Search & Replace didn't work.
− Search & Replace Master was finicky. It didn't work at first, then it did after re-opening the program.
− Ecobyte Replace Text worked the best. This deleted everything beginning to end that I didn't want across many different .html files. I could specify what I wanted with the 'range function'.
− Multiple file search and replace worked too, but functioned differently. If you're looking to keep the beginning and end code, but not what's in the middle, then this one would work for you.
Examples:
Ecobyte will delete <span STYLE= to middle content inbetween to </span>
Leaving you with none of that code remaining.
Multiple File Search will not delete <span STYLE= & </span> but it will delete all of the middle content inbetween.
This leaves you with <span STYLE= & </span> but no code remains if that was inbetween the beginning and end code you specified.
I hope this helps anyone else looking to delete code with the same beginning and end, but different middle code. Cheers! ^_^~
Picture if anyone needs: html different text replacer

RegEx to substitute tag names, leaving the content and attributes intact

I would like to replace opening and closing tag, leaving the content of tags and its attribute intact.
Here is what I have:
<div class="QText">Text to be kept</div>
to be replaced with
<span class="QText">Text to be kept</span>
I tried this expression which finds all expressions I want but there seems to be no way to replace found expressions.
<div class="QText">(.*?)</div>
Thanks in advance.
I think #AmitJoki's answer will work well enough in certain circumstances, but if you only want to replace div elements when they have an attribute or a specific set of attributes, then you would want to use a regex replacement with backreferences - how you specify and refer to a backreference, unfortunately, depends upon your chosen editor. Visual Studio has the most unique and annoying "flavor" of regex I know of, while Dreamweaver has a fairly typical implementation (both as well as I imagine whatever editor you're using do regex replacement - you just have to know the menu item or keystroke to bring up the dialog).
If memory serves, Dreamweaver has replacement options when you hit Ctrl+F, while you have to hit Ctrl+H, so try those.
Once you get a "Find" and "Replace" box, you would put something like what you have in your last example above: <div class="QText">(.*?)</div> or perhaps <div class="(QText|RText|SText)">(.*?)</div> into your "Find" box, then put something like <span class="QText">\1</span> or <span class="\1">\2</span> in the "Replacement" box. A few utilities might use $1 to refer to a backreference rather than \1, but you'll have to lookup help or experiment to be sure.
If you are using a language to run this expression, you need to tell us which language.
If you are using a specific editor to run this expression, you need to tell us which editor.
...and never forget the prevailing wisdom on regex and HTML
Just replace div.
var s="<div class='QText'>Text to be kept</div>";
alert(s.replace(/div/g,"span"));
Demo: http://jsfiddle.net/9sgvP/
Mark it as answer if it helps ;)
Posted as requested
If its going to be literal like that, capture what's to be kept, then replace the rest,
Find: <div( class="QText">.*?</)div>
Replace: <span$1span>

Extracting data from HTML files using regular expressions

I am trying to extract the specific data using regular expression but i couldn't be able to achieve what i desire, for example,
in this page
http://mnemonicdictionary.com/wordlist/GREwordlist/startingwith/A
I have to keep only the data which is between,
<div class="row-fluid">
and
<br /> <br /><i class="icon-user"></i>
SO i copied the HTML code in Notepad++ enabled Regular expression in replace, and tried replacing everything that matches,
.*<div class="row-fluid">
to delete everything before <div class="row-fluid">
but it is not working at all.
Does anyone knows why ?
P.S: I am not using any programming language i just need to perform this on an html code using Notepad++, not on an actual HTML file.
I would achieve this in several steps.
Step 1.
transform document into one line. find
\r\n
and replace with nothing. (make sure to select "Extended (\n, \r,..)" option in Replace dialog)
Step 2.
find
<div class="row-fluid">
and replace with
\r\n~<div class="row-fluid">
Make sure, that character "~" not used in the document. This character wil help us to delete unnecessary lines later
Step 3.
find
<br /> <br /><i class="icon-user"></i>
and replace with
<br /> <br /><i class="icon-user"></i>\r\n
Step 4.
Delete unnecessary lines. Check "Regular expression".
find
^[^~].+$\r\n
and replace with nothing
Step 5.
Now you have only lines that starts with
~<div class="row-fluid">
and ends with
<br /> <br /><i class="icon-user"></i>
everything you need it's just delete this tags
PS. You can try to record a macro, if you need to do the same task several times.
You should consider retrieving using Xpath. Most languages support it.
There's a great firefox plugin that infers the xpath expression when you select a page item called xpather.
There's a hacked version that works for newer firefox versions here
http://jassage.com/xpather-1.4.5b.xpi
To use Xpath with python, consider using http://xmlsoft.org/python.html
Notice that Xpath may have problem with malformed html, so you may also find tidy an interesting option to "clean up" the html and get a parseable XML.
http://tidy.sourceforge.net/
IMHO doing it with Notepad++ is difficult. According to this, you need to:
remove all lines (since regexps execute on each line of text)
perform the regexp on the whole (1-line) HTML
Either you want to learn regexps, or you want to parse the HTML. SDepending on which, solution differs.
If you want to learn regular expressions, this is (again IMHO) the wrong problem to solve.
If you want to resolve the problem (keep the data between <div> and <i>), then have a look at how to parse HTML/XML. In python you have some great libraries like BeautifulSoup (which can deal with broken html). You can do it with dom parsing or a more interesting solution (and arguably better for your problem) is to use SAX and per-event processing. Since you know that after every <div> you'll get an <i>, you could do a simple stack to push all the content between the two events...

Regex extract html source with multiple elements

Before you tell me not to use Regex to parse html, I'm aware of this but my company uses Iconico Data Extractor to extract data from its website, and it allows you to create custom scripts, but it has to be regular expressions in javascript, I am therefore stuck with using RegEx to achieve my goal.
What I need is to take the following example html and extract each line
<b>Item 1</b> Text <br>
<b>Item 2</b> Text <br>
<b>Item 3</b> Text <br>
<p><font color="#000000" face="Arial, Helvetica, sans-serif"><b>Item 4:</b></font></p>
<p><font color="#000000" face="Arial, Helvetica, sans-serif">Detailed Description</font></p>
What I need is to break down each item into an expression to retrieve all of the line complete with tags, exactly how it appears in the html. I have tried /<b>*details(.|\s)*?\/a>/gi Which gets me the Item 4. But I cannot work out how to get items 1 - 3, as what I require is just the line from to /<b>*Item 1(.|\s)*?\br>/gi simply does not work and after hours of playing around with it i'm no further forward. I also need to get rid of the font tags too if thats possible. i think it's complicated by the fact that there is a closing </b> in the middle.
can anyone offer some advice on how to set up the expression. I already know that the general consenus is no to Regex, so no need to go down that route again :)
This is all quite new to me, so hope ive explained what im trying to do.
Thanks in advance
I've used regex to parse html before it worked just fine. I used something like the following. As you can see there are a lot of ".*?" which means non-greedy match any character. Very useful.
What language are you using? You may have to set options to allow parsing of newlines, otherwise it could be treating each line as a separate input.
in python add re.DOTALL option. In PHP there is a special slash tag to use.
<b>(.*?)<br>.*?<b>(.*?)<br><b>(.*?)<br><p.*?sans-serif"><b>(.*?)</p>.*?serif">(.*?)</p>
For the purposes of using this with the data extractor, I've done some research on getting data between two keywords and (Item 1:.*?<br>)/gi works brilliantly.
Unfortunately, I've now been told that the tags have to be stripped off from now on, so I need to scratch my head over that one. I'll post a new question if I need help with it.
Thanks so much for responding and trying to help

Multiline Edit Box value into HTML to be sent in email in xpages

I found some great javascript code(xpHTMLMail file) to be able to create an HTML e-mail that the users create on the fly from an xpage document that they write a review on a salesperson. However, there are some Multiline edit boxes on there and they have carriage returns, spaces, etc in them. These do not come over when they are added to the HTML. Anything I can do to keep the formatting for the e-mail that is created? Thanks in advance.
Here's the code that deals with this part of my question(inputClosing is a Multiline Edit Box):
mail.addHTML("<br /><br /><b>Closing</b><br />"+getComponent('inputClosing').getValue())
If inputClosing has...
"Dear Joe,
Great work. Keep it up!
Thanks,
Bill"
It comes into the email as...
Dear Joe, Great Work. Keep it up! Thanks, Bill
I wrote that library so thanks!
Since you're creating an HTML mail, you need to replace the line breaks in the value of the Multiline Edit Box by <br /> tags. Since you're dealing with Java in XPages, the line breaks are stored in the value using the \r\n sequence.
You can replace them using the the replaceAll() or (SSJS) #ReplaceSubstring() function.
Your code might then look like this:
var content:string = getComponent("inputClosing").getValue();
mail.addHTML("<b>Closing</b><br />" + content.replaceAll("\r\n", "<br />") );
Mark's suggestion definitely works, but it might be easier just to wrap the text fields with <pre></pre> then it will treat them as text instead of html, no matter what kind of formatting character is in it.