How to delete a similar fragment on several HTML files? - html

I'm converting a website to a PDF, but there are images in there and along all of them there is a text that when clicked gets you to image itself.
I think this would be the code responsible for showing that text, since I deleted it in one of the files and the text and link is not shown anymore.
<div class="v1"><a target="_self" href="images/graphics/1.jpg">[View full size image]</a></div>
The problem is that there are about 200 more HTML documents containing this similar text, only changing href.
Would there be any easy way to get rid of all this without having to go one by one? Maybe a regular expression for sed?

If the expression is always on one line and the only difference is in href, sed is a possible solution:
sed -e 's,<div class="v1"><a target="_self" href="[^"]*">\[View full size image\]</a></div>,,'
I used an alternative separator , so / does not have to be escaped in closing tags. The brackets in the links's text need to be escaped, though.

Yes, regular expressions are likely the easiest solution here. If it's simply a question of removing this line from all your files then I'd just open them up in an editor (Sublime Text 2 does this well) and perform a regex search and replace. The following search pattern will likely work:
<div class=\"v1\"><a target=\"_self\" href=\"[^"]+\">\[View full size image\]</a></div>

Related

How to match text and skip HTML tags using a regular expression?

I have a bunch of records in a QuickBase table that contain a rich text field. In other words, they each contain some paragraphs of text intermingled with HTML tags like <p>, <strong>, etc.
I need to migrate the records to a new table where the corresponding field is a plain text field. For this, I would like to strip out all HTML tags and leave only the text in the field values.
For example, from the below input, I would expect to extract just a small example link to a webpage:
<p>just a small <a href="#">
example</a> link</p><p>to a webpage</p>
As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines' Text channel tool. The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.
So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need. I.e. it matches only the HTML tags:
/(<[^>]*>)/
In a sense, I need the negative image of this expression but have not be able to build it myself.
Your help in "negating" the above expression is most appreciated.
Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind.
(?:(?<=>)|^)[^<]+
See this demo at regex101
(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any >. From there [^<]+ matches one or more characters that are not < (negated character class).

Display text as html markup

I have a problem which is probably trivially easy but I can't seem to get it working. Using this post, I do a search using Regex in a text string to convert any links into html markup, but when it comes to display on the page it just displays like this:
this is link
<a href='http://www.google.com'>http://www.google.com</a>
In the view I have:
<p>#news.Body</p>
edit: great my question is now displaying how I want. So now to the actual question, how do I get the page displaying an actual link instead of the code when displayed to the user.
Use `` around your variable (e.g.)
Use "{}" icon in toolbar to insert code
Indent your code by one empty line, 4 spaces and leading empty line
E.g.:
Like this
You can edit this answer to see raw output

How to remove a div from the entire project?

I've got a project consisting of over 200 html files. There's a div repeated throughout most of these, looking like this:
<div class='foobar' id="abcdef123'></div>
I have found all uses of the class using the Find in Files function in Sublime Text 2 - now I want to remove them, i.e. completely delete any line containing that div (and its closing tag).
Is there an easy way to do it in Sublime Text 2?
EDIT: I have forgotten to mention that sometimes the div has additional classes and the ID is always different. How would I write a regexp to deal with that?
In Notepad++, open all 200 files and replace with the following regular expression.
<div class='foobar[^']*' id="[^']*"></div>
and replace it by nothing. I don't know Sublimetext2.

RegEx to substitute tag names, leaving the content and attributes intact

I would like to replace opening and closing tag, leaving the content of tags and its attribute intact.
Here is what I have:
<div class="QText">Text to be kept</div>
to be replaced with
<span class="QText">Text to be kept</span>
I tried this expression which finds all expressions I want but there seems to be no way to replace found expressions.
<div class="QText">(.*?)</div>
Thanks in advance.
I think #AmitJoki's answer will work well enough in certain circumstances, but if you only want to replace div elements when they have an attribute or a specific set of attributes, then you would want to use a regex replacement with backreferences - how you specify and refer to a backreference, unfortunately, depends upon your chosen editor. Visual Studio has the most unique and annoying "flavor" of regex I know of, while Dreamweaver has a fairly typical implementation (both as well as I imagine whatever editor you're using do regex replacement - you just have to know the menu item or keystroke to bring up the dialog).
If memory serves, Dreamweaver has replacement options when you hit Ctrl+F, while you have to hit Ctrl+H, so try those.
Once you get a "Find" and "Replace" box, you would put something like what you have in your last example above: <div class="QText">(.*?)</div> or perhaps <div class="(QText|RText|SText)">(.*?)</div> into your "Find" box, then put something like <span class="QText">\1</span> or <span class="\1">\2</span> in the "Replacement" box. A few utilities might use $1 to refer to a backreference rather than \1, but you'll have to lookup help or experiment to be sure.
If you are using a language to run this expression, you need to tell us which language.
If you are using a specific editor to run this expression, you need to tell us which editor.
...and never forget the prevailing wisdom on regex and HTML
Just replace div.
var s="<div class='QText'>Text to be kept</div>";
alert(s.replace(/div/g,"span"));
Demo: http://jsfiddle.net/9sgvP/
Mark it as answer if it helps ;)
Posted as requested
If its going to be literal like that, capture what's to be kept, then replace the rest,
Find: <div( class="QText">.*?</)div>
Replace: <span$1span>

Find and replace a lot of <a> tags, but the url are different? Is there some way to do it?

I have a page that I need to fix..
There are thousands of <a> tags like <a href="kl1j23l123l12j3"> that I need to get rid off, but the problem is that each <a> tag has a different url in them (href attribute). So, I am wondering if there is some advanced way to get rid of the whole anchor/link but keep the link-text, as that would save me a whole lot of time.
Example
Input : StackOverflow.com
Output: StackOverflow.com
Thanks.
Maybe this is a solution using JavaScript and jQuery. It also can be tweaked to only get the values of links that do not start with http. I wasn't quite sure whether this is relevant according to the links in the question.
// get all links within the document
​var links = $('a');
// simply get all link texts
var x = links.text();​​​​​​
// or just get all links that are like 'kl1j23l123l12j3' as they don't start with 'http'
var x = links.filter('[href^=http]').text();
Here's a demo: http://jsfiddle.net/rg3ET/
Instead of applying them all together into one variable ("x") you could of course loop through them and output them individual.
The following would work under the assumption that each anchor tag is on its own line.
Example:
asdf
<div>
</div>
asdf2
Notepad++ has a regex find and replace feature, which may work for your need.
Replace all </a> tags with nothing
Use a regex to find all <a href="anything"> and replace with nothing.
The following image shows what I did for step 2. You can see that I used a regex of <a .*>. For this to work properly, there should only be one > character per line. Otherwise, the regex will make the longest possible match, possibly including a bunch of other tags. This is why I said the procedure would only work for anchor tags that are on their own lines.
In case you can't see the image (again, this only works:
From Notepad++ menu: Search > Replace
Select Regular expression
In Find what box, put <a .*>
Click Replace All