grep to extract out regular expression href and rel from html - html

The html i'm dealing with looks a lil like this
<a class="title may-blank" data-event-action="title" href="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/" tabindex="1" data-href-url="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/" data-inbound-url="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/?utm_content=title&utm_medium=hot&utm_source=reddit&utm_name=frontpage" rel="">We can play singleplayer games OFF THE INTERNET? Are they seriously that out of touch to advertise this?</a>
Multiple lines like that
I only want the stuff that's between the quotes in href="http://xxxxxxxx" and rel="">yyyyyyyyyy, the rest is unnecessary.
Id like them to output like this, a new line for every block above
yyyyyyyyyy
Any idea how I would get around doing this?

So here is a 10s solution. It may be a little brittle but should work assuming the string is in a file called html.txt
cat html.txt | sed 's/class.*href/href/' | sed 's/data-in.*rel=/rel=/'
J

Your html example leads me to the following pattern to get the required values:
<a class=\"(.*) href=\"/(.*)\" tabindex=(.*) rel=\"\">(.*)</a>
Replace the matches by using the following pattern:
$4
You can try it out at regexe for me it works like expected.

Related

How do I do a regex only the specific selection between two tags?

There have been dozens of similar questions that was asked but my question is about a specific selection between the tags. I don't want the entire selection from <a href to </a>, I only need to target the "> between those tags itself.
I am trying to convert a href links into wikilinks. For example, if the sample text has:
Light is light.
<div class="reasons">
I wanted to edit the file itself and change from Link into [[link.html|Link]]. The basic idea that I have right now uses 3 sed edits as follows:
Link -> <a href="link.html|Link</a>
<a href="link.html|Link</a> -> [[link.html|Link</a>
[[link.html|Link</a> -> [[link.html|Link]]
My problem lies with the first step; I can't find the regex that only targets "> between <a href and </a>.
I understand that the basic idea would need to be the search target between lookaround and lookbehind. But trying it on regexr showed a fail. I also tried using conditional regex. I can't find the syntax I used but it either turned an error or it worked but it also captured the div class.
Edit: I'm on Ubuntu and using a bash script using sed to do the text manipulation.
The basic idea that I have right now uses 3 sed edits
Assuming you've also read the answers underneath those dozens of similar questions, you could've known that it's a bad idea to parse HTML with sed (regex).
With an HTML-parser like xidel this would be as simple as:
$ xidel -s 'Link' -e 'concat("[[",//a/#href,"|",//a,"]]")'
$ xidel -s 'Link' -e '"[["||//a/#href||"|"||//a||"]]"'
$ xidel -s 'Link' -e 'x"[[{//a/#href}|{//a}]]"'
[[link.html|Link]]
Three different queries to concatenate strings. The 1st query uses the XPath concat() function, the 2nd query uses the XPath || operator and the 3rd uses xidel's extended string syntax.

Grep / Regex pattern for HTML tag containing certain keyword

I'm having trouble with selecting everything between and including <p and /p> that contains www.test.com using grep find in BBEdit.
Sample HTML Code
<p>....</p><p align="center"><img src="http://www.test.com/test.jpg"></p>
Desired result
<p align="center"><img src="http://www.test.com/test.jpg"></p>
I've tried the below grep pattern....
<p.+?www\.test\.com.+?/p>
The Lookforward portion of the pattern .+?/p> works well, the problem is the Lookback pattern <p.+?, result is often too greedy, essentially selecting too many prior <p up to and including the one with keyword. I'm only after the first <p.
Essentially I want to remove these code from a large HTML file (50mb), the ideal solution would be a BBEdit grep find pattern that works as this is what I'm familiar with, however if there is a better way to do the same then I'm all ears.

Regular expression to add a word between html-tags (newbie)

I can't seem to create a regular expression that would work in this situation:
I have hundreds of lines that look like this:
<a title="Match" href="http://mywebsite.com/category/Match"></a>
I would need to have the title word inserted between the html tags, like so:
<a title="Match" href="http://mywebsite.com/category/Match">Match</a>
Here's my feeble attempt at it (using Notepad++):
Find:
title="([A-Za-z][A-Za-z0-9]*?)"([A-Za-z][A-Za-z0-9]*?)><
Replace:
title="\1"\2>\1<
As you can see, I really suck at regular expressions :D
Any help would be appreciated!
EDIT:
I should clarify that this is a one-time operation carried out in Notepad++ with the find and replace panel.
I should also clarify that the word "Match" is going to be different on each line.
This works in Notepad++ 6.3.2
Find what :
(title\=")([^"]+)("[^>]+>)(<)
Replace with :
\1\2\3\2\4
Use Capture Groups and Back-References
You can capture parts of your match using capture groups, and then replace them with back-references. The specific syntax may vary by language and implementation. Here are two examples.
Ruby Example
str = %q{<a title="Match" href="http://mywebsite.com/category/Match"></a>}
str.sub /(Match)(">)</, "#{$1}#{$2}#{$1}<"
# => "<a title=\"Match\" href=\"http://mywebsite.com/category/Match\">Match</a>"
GNU sed Example
$ echo '<a title="Match" href="http://mywebsite.com/category/Match"></a>' |
sed -r 's/(Match)(">)</\1\2\1</'
<a title="Match" href="http://mywebsite.com/category/Match">Match</a>

Globally search and replace the path of images referenced?

I have various images that are referenced as such:
src="http://myblog.files.example.com/2011/08/image-1.jpg"
src="http://myblog.files.example.com/2010/05/image-2.jpg"
src="http://myblog.files.example.com/2012/01/image-3.jpg"
As you can see the only thing that changes in the image paths are the numbers (dates) at the end. What I'd like to do is simply change the path for all images to something like:
/sites/default/files/blog-images/
... so they would all be like:
src="/sites/default/files/blog-images/image-1.jpg"
src="/sites/default/files/blog-images/image-2.jpg"
src="/sites/default/files/blog-images/image-3.jpg"
I am wondering if there is a way to do this using regular expressions or some other method? There are hundreds of images all with different numbers in the path so doing this manually is not ideal.
complete sample line of code:
<a href="http://myblog.files.example.com/2011/07/myimage-1.jpg">
<img class="alignright size-medium wp-image-423" title="the title" src="http://myblog.files.example.com/2011/07/myimage-1.jpg" alt="the alt" width="300" height="199" />
</a>
src="http:\/\/myblog.files.example.com/\d{4}/\d{2}/([^\s]+)"
Searches and captures the image file names in $1. Now you can do a replace with /sites/default/files/blog-images/$1
If your editor doesn't support ranges then you'll need to repeat \d.
src="http:\/\/myblog.files.example.com/\d\d\d\d/\d\d/([^\s]+)"
http://regexr.com?30o8e
A simple sed one liner can do the trick :
This one to test :
sed -r 's#http://myblog.files.example.com/[0-9]{4}/[0-9]{2}/#/sites/default/files/blog-images/#g' FILE
This one to change the file
sed -ri 's#http://myblog.files.example.com/[0-9]{4}/[0-9]{2}/#/sites/default/files/blog-images/#g' FILE
There are many ways to do what you are trying to do. Since you tagged "grep" in this post and did not specify any particular programming language, I am going to assume that you want to use UNIX.
First, test out this command:
find . -type f | sed 's/\.\/\([0-9]\{4\}\/[0-9]\{2\}\/\)\(.*\)/mv & sites\/default\/files\/blog-images\/\2/'
It will print out a list of "mv" commands that will be executed, and if it works how you want it do, you can pipe this into sh like so:
find . -type f | sed 's/\.\/\([0-9]\{4\}\/[0-9]\{2\}\/\)\(.*\)/mv & sites\/default\/files\/blog-images\/\2/' | sh

Regex match spaces in html attribute

I have a bunch of html with lines like this:
<a href="#" rel="this is a test">
I need to replace the spaces in the rel-attribute with underscores, but I'm sort of a regex-noob!
I'm using Textmate.
Can anyone help me?
/Jakob
Find: (rel="[^\s"]*)\s([^"]*")
Replace: \1_\2
This replaces only the first white space so click on "Replace All" until nothing is replaced anymore. It's not pretty but easy to understand and works with every editor.
Change rel in the find pattern if you need to clean other attributes.
I don't think you can do this properly. Though I wonder why you need to do it at one go?
I can think of a really poor way of doing it, but even if I don't recommend it, here goes:
You could sort of do it with the regex below. However, you would have to increase the number of captures and outputs with a _ on the end to the potential number of spaces in the rel. I bet that is a requirement which disallows this solution.
Search:
{\<a *href\=\"[^\"]*" *rel\=\"}{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*
Replace:
\1\2_\3_\4_\5_\6_\7_\8_
This way has two downsides, one is there might be limitations to the number of captures you can have in Textmate, two is you'll end up with a large number of _'s on the end of each line.
With your current test, with the regex above, you would end up with:
<a href="#" rel="this_is_a_test">____
PS: This regex is of the format of the visual studio search/replace box. You'll probably need to change some characters to make it fit textpad.
{} => capturing group
() => grouping
[^A] => anything but A
( |\")* => space or "
\1 => is the first capture
Suppose you already received the value of rel:
var value = document.getElementById(id).getAttribute( "rel");
var rel = (new String( value)).replace( /\s/g,"_");
document.getElementById(id).setAttribute( "rel", rel);
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
I have to get on-board the "you're using the wrong tool for the job" train here. You have Textmate, so that means OSX, which means you have sed, awk, ruby and perl that can all do this much much better and easier.
Learning how to use one of these tools to do text manipulation will give you uncountable benefits in the future. Here is a URL that will ease you into sed: http://www.grymoire.com/Unix/Sed.html
If you're using TextMate, then you're on a Mac, and therefore have Python.
Try this:
#!/usr/bin/env python
import re
input = open('test.html', 'r')
p_spaces = re.compile(r'^.*rel="[^"]+".*$')
for line in input:
matches = p_spaces.findall(line)
for match in matches:
new_rel = match.replace(' ', '_')
line = line.replace(match, new_rel)
print line,
Sample output:
$ cat test.html
testing, testing, 1, 2, 3
<a href="#" rel="this is a test">
<unrelated line>
Stuff
<a href="#" rel="this is not a test">
<a href="#" rel="this is not a test" rel="this is invalid syntax (two rels)">
aoseuaoeua
$ ./test.py
testing, testing, 1, 2, 3
<a_href="#"_rel="this_is_a_test">
<unrelated line>
Stuff
<a_href="#"_rel="this_is_not_a_test">
<a_href="#"_rel="this_is_not_a_test"_rel="this_is_invalid_syntax_(two_rels)">
aoseuaoeua