Globally search and replace the path of images referenced?

Globally search and replace the path of images referenced? - html

I have various images that are referenced as such:
src="http://myblog.files.example.com/2011/08/image-1.jpg"
src="http://myblog.files.example.com/2010/05/image-2.jpg"
src="http://myblog.files.example.com/2012/01/image-3.jpg"
As you can see the only thing that changes in the image paths are the numbers (dates) at the end. What I'd like to do is simply change the path for all images to something like:
/sites/default/files/blog-images/
... so they would all be like:
src="/sites/default/files/blog-images/image-1.jpg"
src="/sites/default/files/blog-images/image-2.jpg"
src="/sites/default/files/blog-images/image-3.jpg"
I am wondering if there is a way to do this using regular expressions or some other method? There are hundreds of images all with different numbers in the path so doing this manually is not ideal.
complete sample line of code:
<a href="http://myblog.files.example.com/2011/07/myimage-1.jpg">
<img class="alignright size-medium wp-image-423" title="the title" src="http://myblog.files.example.com/2011/07/myimage-1.jpg" alt="the alt" width="300" height="199" />
</a>

src="http:\/\/myblog.files.example.com/\d{4}/\d{2}/([^\s]+)"
Searches and captures the image file names in $1. Now you can do a replace with /sites/default/files/blog-images/$1
If your editor doesn't support ranges then you'll need to repeat \d.
src="http:\/\/myblog.files.example.com/\d\d\d\d/\d\d/([^\s]+)"
http://regexr.com?30o8e

A simple sed one liner can do the trick :
This one to test :
sed -r 's#http://myblog.files.example.com/[0-9]{4}/[0-9]{2}/#/sites/default/files/blog-images/#g' FILE
This one to change the file
sed -ri 's#http://myblog.files.example.com/[0-9]{4}/[0-9]{2}/#/sites/default/files/blog-images/#g' FILE

There are many ways to do what you are trying to do. Since you tagged "grep" in this post and did not specify any particular programming language, I am going to assume that you want to use UNIX.
First, test out this command:
find . -type f | sed 's/\.\/\([0-9]\{4\}\/[0-9]\{2\}\/\)\(.*\)/mv & sites\/default\/files\/blog-images\/\2/'
It will print out a list of "mv" commands that will be executed, and if it works how you want it do, you can pipe this into sh like so:
find . -type f | sed 's/\.\/\([0-9]\{4\}\/[0-9]\{2\}\/\)\(.*\)/mv & sites\/default\/files\/blog-images\/\2/' | sh

Related

How do I do a regex only the specific selection between two tags?

There have been dozens of similar questions that was asked but my question is about a specific selection between the tags. I don't want the entire selection from <a href to </a>, I only need to target the "> between those tags itself.
I am trying to convert a href links into wikilinks. For example, if the sample text has:
Light is light.
<div class="reasons">
I wanted to edit the file itself and change from Link into [[link.html|Link]]. The basic idea that I have right now uses 3 sed edits as follows:
Link -> <a href="link.html|Link</a>
<a href="link.html|Link</a> -> [[link.html|Link</a>
[[link.html|Link</a> -> [[link.html|Link]]
My problem lies with the first step; I can't find the regex that only targets "> between <a href and </a>.
I understand that the basic idea would need to be the search target between lookaround and lookbehind. But trying it on regexr showed a fail. I also tried using conditional regex. I can't find the syntax I used but it either turned an error or it worked but it also captured the div class.
Edit: I'm on Ubuntu and using a bash script using sed to do the text manipulation.

The basic idea that I have right now uses 3 sed edits
Assuming you've also read the answers underneath those dozens of similar questions, you could've known that it's a bad idea to parse HTML with sed (regex).
With an HTML-parser like xidel this would be as simple as:
$ xidel -s 'Link' -e 'concat("[[",//a/#href,"|",//a,"]]")'
$ xidel -s 'Link' -e '"[["||//a/#href||"|"||//a||"]]"'
$ xidel -s 'Link' -e 'x"[[{//a/#href}|{//a}]]"'
[[link.html|Link]]
Three different queries to concatenate strings. The 1st query uses the XPath concat() function, the 2nd query uses the XPath || operator and the 3rd uses xidel's extended string syntax.

Grep / Regex pattern for HTML tag containing certain keyword

I'm having trouble with selecting everything between and including <p and /p> that contains www.test.com using grep find in BBEdit.
Sample HTML Code
<p>....</p><p align="center"><img src="http://www.test.com/test.jpg"></p>
Desired result
<p align="center"><img src="http://www.test.com/test.jpg"></p>
I've tried the below grep pattern....
<p.+?www\.test\.com.+?/p>
The Lookforward portion of the pattern .+?/p> works well, the problem is the Lookback pattern <p.+?, result is often too greedy, essentially selecting too many prior <p up to and including the one with keyword. I'm only after the first <p.
Essentially I want to remove these code from a large HTML file (50mb), the ideal solution would be a BBEdit grep find pattern that works as this is what I'm familiar with, however if there is a better way to do the same then I'm all ears.

grep to extract out regular expression href and rel from html

The html i'm dealing with looks a lil like this
<a class="title may-blank" data-event-action="title" href="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/" tabindex="1" data-href-url="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/" data-inbound-url="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/?utm_content=title&utm_medium=hot&utm_source=reddit&utm_name=frontpage" rel="">We can play singleplayer games OFF THE INTERNET? Are they seriously that out of touch to advertise this?</a>
Multiple lines like that
I only want the stuff that's between the quotes in href="http://xxxxxxxx" and rel="">yyyyyyyyyy, the rest is unnecessary.
Id like them to output like this, a new line for every block above
yyyyyyyyyy
Any idea how I would get around doing this?

So here is a 10s solution. It may be a little brittle but should work assuming the string is in a file called html.txt
cat html.txt | sed 's/class.*href/href/' | sed 's/data-in.*rel=/rel=/'
J

Your html example leads me to the following pattern to get the required values:
<a class=\"(.*) href=\"/(.*)\" tabindex=(.*) rel=\"\">(.*)</a>
Replace the matches by using the following pattern:
$4
You can try it out at regexe for me it works like expected.

Find duplicated id keys in html using a regex

Given an html file, how could I find if there's some repeated id value using a regular expression? I need it for searching it on SublimeText.
For example: using the id=("[^"]*").*id=\1 I can find duplicated id keys in the same line
<img id="key"><img id="key">
But what I need is to perform the same in multiple lines and with different pairs of keys. In this case for example key and key2 are repeated ids.
<img id="key">
<img id="key2">
<img id="key">
<img id="key3">
<img id="key2">
<img id="key">
Note: I'm usign the img tag only as an example, the html file is more complex.

For whatever reason, Sublime's . matcher doesn't include line breaks, so you'll need to do something like this: id=("[^"]+")(.|\n)*id=\1
Honestly though, I'd rather use Unix utilities:
grep -Eo 'id="[^"]+"' filename | sort | uniq -c
3 id="key"
2 id="key2"
1 id="key3"
If these are complete HTML documents, you could use the w3's HTML validator to catch dups along with other errors.

If all you're trying to do is find duplicated IDs, then here's a little Perl program I threw together that will do it:
use strict;
use warnings;
my %ids;
while ( <> ) {
while ( /id="([^"]+)"/g ) {
++$ids{$1};
}
}
while ( my ($id,$count) = each %ids ) {
print "$id shows up $count times\n" if $count > 1;
}
Call it "dupes.pl". Then invoke it like this:
perl dupes.pl file.html
If I run it on your sample, it tells me:
key shows up 3 times
key2 shows up 2 times
It has some restrictions, like it won't find id=foo or id='foo', but probably will help you down the road.

Sublime Text's regex search appears to default to multi-line mode, which means the . won't match line breaks. You can use a mode modifier to use single line mode to make . match new lines:
(?s)id=("[^"]+").*id=\1
The (?s) is the single line mode modifier.
However, this regex does a poor job of finding all duplicate keys since it will only match from key to key in your sample HTML. You probably need a multi-step process to find all keys, which could be programmed. As others have shown, you'll need to (1) pull all the ids out first, then (2) group them and count them to determine which are dupes.
Alternately, the manual approach would be to change the regex pattern to look-ahead for duplicate ids, then you can find the next match in Sublime Text:
(?s)id=("[^"]+")(?=.*id=\1)
With the above pattern, and your sample HTML, you'll see the following matches highlighted:
<img id="key"> <-- highlighted (dupe found on 3rd line)
<img id="key2"> <-- highlighted (dupe found on 5th line)
<img id="key"> <-- highlighted (next dupe found on last line)
<img id="key3">
<img id="key2">
<img id="key">
Notice that the look-ahead doesn't reveal the actual dupes later in the file. It will stop at the first occurrence and indicates that later on there are dupes.

Here is the AWK script to look-up for duplicated img's id values:
awk < file.txt
'{
$2 = tolower($2);
gsub(/(id|["=>])/, "", $2);
if (NF == 2)
imgs[$2]++;
}
END {
for (img in imgs)
printf "Img ID: %s\t appears %d times\n", img, imgs[img]
}'

Regular expression to add a word between html-tags (newbie)

I can't seem to create a regular expression that would work in this situation:
I have hundreds of lines that look like this:
<a title="Match" href="http://mywebsite.com/category/Match"></a>
I would need to have the title word inserted between the html tags, like so:
<a title="Match" href="http://mywebsite.com/category/Match">Match</a>
Here's my feeble attempt at it (using Notepad++):
Find:
title="([A-Za-z][A-Za-z0-9]*?)"([A-Za-z][A-Za-z0-9]*?)><
Replace:
title="\1"\2>\1<
As you can see, I really suck at regular expressions :D
Any help would be appreciated!
EDIT:
I should clarify that this is a one-time operation carried out in Notepad++ with the find and replace panel.
I should also clarify that the word "Match" is going to be different on each line.

This works in Notepad++ 6.3.2
Find what :
(title\=")([^"]+)("[^>]+>)(<)
Replace with :
\1\2\3\2\4

Use Capture Groups and Back-References
You can capture parts of your match using capture groups, and then replace them with back-references. The specific syntax may vary by language and implementation. Here are two examples.
Ruby Example
str = %q{<a title="Match" href="http://mywebsite.com/category/Match"></a>}
str.sub /(Match)(">)</, "#{$1}#{$2}#{$1}<"
# => "<a title=\"Match\" href=\"http://mywebsite.com/category/Match\">Match</a>"
GNU sed Example
$ echo '<a title="Match" href="http://mywebsite.com/category/Match"></a>' |
sed -r 's/(Match)(">)</\1\2\1</'
<a title="Match" href="http://mywebsite.com/category/Match">Match</a>

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Globally search and replace the path of images referenced? - html

A simple sed one liner can do the trick : This one to test : sed -r 's#http://myblog.files.example.com/[0-9]{4}/[0-9]{2}/#/sites/default/files/blog-images/#g' FILE This one to change the file sed -ri 's#http://myblog.files.example.com/[0-9]{4}/[0-9]{2}/#/sites/default/files/blog-images/#g' FILE

Related

How do I do a regex only the specific selection between two tags?

Grep / Regex pattern for HTML tag containing certain keyword

grep to extract out regular expression href and rel from html

Find duplicated id keys in html using a regex

Regular expression to add a word between html-tags (newbie)

Categories

Resources