search and replace in a sample file - html

Hi I have following file
<strong>Ramandand Sagar Krishna part 34</strong> Vasudev comes back
and girl disappears from Kansa's hand and the first temple she instructs Devs to make at Vindhyachal <a href="http://www.dailymotion.com/embed/video/x3p3gu?
width=320&theme=none&wmode=transparent">http://www.dailymotion.com/embed/video/x3p3gu?width=320&theme=none&wmode=transparent</a> <a
href="http://www.dailymotion.com/video/x3p3gu_krishna-part-34_shortfilms"
target="_blank">Krishna Part 34</a> <strong>Ramandand Sagar Krishna part 35</strong> Celebrations at Yashoda's house and Vasudev Devki freed from jail <a href="http://www.dailymotion.com/embed/video/x3p3sg?width=320&theme=none&wmode=transparent">
http://www.dailymotion.com/embed/video/x3p3sg?width=320&theme=none&wmode=transparent</a> Krishna Part 35 Krishna 143</em></div>
In above file I want to replace
any HTML which is of following kind
http://www.dailymotion.com/embed/video/x5ftx3?width=320
the keyword is any HTML tag having wmode=transparent or width=320 should be replaced with a blank space.Is there an easy way to do so?There are many HTML tags like
which do not have wmode=transparent in their lines.
The file above posted is very very big approximately 30K lines are there in HTML so I have posted only relevant lines.
I am on a Ubuntu system.

As Sorpigal has pointed out, there is no simple answer to solve this. If your willing to destroy your line endings you could try my ugly concoction. It might help you:
cat file.txt | tr -d "\n" | awk '{ for (i=1; i<=NF; i++) if ($i !~ /wmode=transparent|width=320/) printf "%s ", $i} END {print ""}' file.txt | sed -e "s%<a <a%<a%g"
Output:
<strong>Ramandand Sagar Krishna part 34</strong> Vasudev comes back and girl disappears from Kansa's hand and the first temple she instructs Devs to make at Vindhyachal Krishna Part 34 <strong>Ramandand Sagar Krishna part 35</strong> Celebrations at Yashoda's house and Vasudev Devki freed from jail Krishna Part 35 Krishna 143</em></div>
I'm sure this one-liner could be improved in some way. If you do find this useful, you may then want to split the output on a boundary to tidy things up. Sed can be good for this.

here is a link where you can found answer for your question.
in your case you have to create a script file for sed like
s/wmode=transparent//g
s/width=320//g
and running something like that:
sed -f replace_file in.txt > out.txt
i hope it's helpful for you.
have a nice day

Related

Bash: Content between two complex Patterns - html

I have tried multiple times to get digits between two html patterns.
Neither sed nor awk worked for me, since the examples in the internet were too easy to fit my task.
Here is the code I want to filter:
....class="a-size-base review-text">I WANT THIS TEXT</span></div> ....
So I would need a command that output: I WANT THIS TEXT between ...review-text"> and </span>
Do you have a clue? Thanks for the effort and greetings from Germany.
Here is the plain code
Try:
tr '\n' ' ' file.html | grep -o 'review-text">[^<>]*</span> *</div>' | cut -d'>' -f2 | cut -d'<' -f 1
It should work if there are no any tags inside "I WANT THIS TEXT"
I can't see the problem here supposing the text you want to extract doesn't contains < nor >.
For instance with POSIX REGEXP:
$ HTML_FILE=/tmp/myfile.html
$ sed -n "s/.*review-text.>\([^<]*\)<.*/\1/gp" $HTML_FILE
prints the text between HTML TAGS

Get content between HTML Tags | Special Character Meaning? [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Get content between a pair of HTML tags using Bash
(7 answers)
How to use sed/grep to extract text between two words?
(14 answers)
Closed 5 years ago.
Before someone marks this as duplicate, I want to remark that I have seen posts like this one:
HTML Tags
Sadly, the solution there doesn't work for me.
After a very helpful comment, I realize now that my initial problem was not getting the text between HTML tags, but using greps with non ASCII characters. This question therefore seems to be a duplicate of this one. Thanks for all the helpful comments, and sorry for the duplication, I honestly googled for an hour before posting here!
I think the character " ’ " is the problem, because grep -E -o '().*' only matches up to this character. I am not sure why that is, and would appreciate any help or hints on this.
The problem is as follows: I have a file, that for example looks like this, and I want to extract the text between the strong tags:
> <p>Here is something</p> <ul> <li>
> <p><strong>Here is something else</strong> And I keep typing here
The strongs are always in the same line, which should make it easier, at least I think so.
My own thoughts led me to
grep -E -o '\<strong\>.*\<\/strong\>' test.txt
Until I realized \< only looks for the beginning of a word (I wasn't sure < had a special meaning so I wanted to escape it).
I then went on to try grep -E -o '(<strong>).*(<\/strong>) which, amazingly, works for the test file I gave you above.
Now, the truth is, the original file has book titles between the strong Tags, and book titles tend to contain apostrophes, and I think they mess things up.
Let's look at another sample file:
> <p>Here is something</p> <ul> <li>
> <p><strong>That`s a stupid</strong> And I keep typing here <p>>
> <strong>Complications: A Surgeon’s Notes on an Imperfect Science</strong> ?
> blablabla <p><strong>Another test, with this kind of ' apostrophe</strong> bla bla
Now using the grep from before grep -E -o '(<strong>).*(<\/strong>)' Only returns the first and third match:
> <strong>That`s a stupid</strong>
> <strong>Another test, with this kind of ' apostroph</strong>
I don't understand why
"<strong>Complications: A Surgeon’s Notes on an Imperfect Science</strong>"
isn't matched.
I am prety sure the character " ’ " is the problem, because grep -E -o '(<strong>).*' only matches "Complications: A Surgeon".
Any ideas on why the character " ’ " would make any problems? I noticed that when printing the file cat file.txt the character is also displayed incorrectly.
Also, on a similar note: Right now grep is still returning the tags. How do I disable this? I think there's just an argument I can use there (that's why I included the parentheses in my code), but I can't seem to find it.
Thank you all, any help is appreciated!
I am also very sorry for the bad formatting, I think the HTML tags in the example files spilled over into the question...

grep to extract out regular expression href and rel from html

The html i'm dealing with looks a lil like this
<a class="title may-blank" data-event-action="title" href="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/" tabindex="1" data-href-url="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/" data-inbound-url="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/?utm_content=title&utm_medium=hot&utm_source=reddit&utm_name=frontpage" rel="">We can play singleplayer games OFF THE INTERNET? Are they seriously that out of touch to advertise this?</a>
Multiple lines like that
I only want the stuff that's between the quotes in href="http://xxxxxxxx" and rel="">yyyyyyyyyy, the rest is unnecessary.
Id like them to output like this, a new line for every block above
yyyyyyyyyy
Any idea how I would get around doing this?
So here is a 10s solution. It may be a little brittle but should work assuming the string is in a file called html.txt
cat html.txt | sed 's/class.*href/href/' | sed 's/data-in.*rel=/rel=/'
J
Your html example leads me to the following pattern to get the required values:
<a class=\"(.*) href=\"/(.*)\" tabindex=(.*) rel=\"\">(.*)</a>
Replace the matches by using the following pattern:
$4
You can try it out at regexe for me it works like expected.

How to copy text between 2 html tags?

I want to copy all the text in a website between tags:
<p> and </p>
using bash.
Do you have an idea how to do it?
As the comment above states: don't even try. There is no reliable way to parse HTML with Bash internals.
But when you're using a shell you may as well use third-party command line tools such as pup which are built for HTML parsing on the command line.
Yes, an HTML parser is a better choice. But if you are just trying to grab the text in between the first set of P tags quickly, you can use Perl:
perl -n0e 'if (/<p>(.*?)<\/p>/s) { print $1; }'
For example:
echo "
<p>A test
here
today</p>
<p>whatever</p>
" | perl -n0e 'if (/<p>(.*?)<\/p>/s) { print $1; }'
This will output:
A test
here
today

Regular expression to add a word between html-tags (newbie)

I can't seem to create a regular expression that would work in this situation:
I have hundreds of lines that look like this:
<a title="Match" href="http://mywebsite.com/category/Match"></a>
I would need to have the title word inserted between the html tags, like so:
<a title="Match" href="http://mywebsite.com/category/Match">Match</a>
Here's my feeble attempt at it (using Notepad++):
Find:
title="([A-Za-z][A-Za-z0-9]*?)"([A-Za-z][A-Za-z0-9]*?)><
Replace:
title="\1"\2>\1<
As you can see, I really suck at regular expressions :D
Any help would be appreciated!
EDIT:
I should clarify that this is a one-time operation carried out in Notepad++ with the find and replace panel.
I should also clarify that the word "Match" is going to be different on each line.
This works in Notepad++ 6.3.2
Find what :
(title\=")([^"]+)("[^>]+>)(<)
Replace with :
\1\2\3\2\4
Use Capture Groups and Back-References
You can capture parts of your match using capture groups, and then replace them with back-references. The specific syntax may vary by language and implementation. Here are two examples.
Ruby Example
str = %q{<a title="Match" href="http://mywebsite.com/category/Match"></a>}
str.sub /(Match)(">)</, "#{$1}#{$2}#{$1}<"
# => "<a title=\"Match\" href=\"http://mywebsite.com/category/Match\">Match</a>"
GNU sed Example
$ echo '<a title="Match" href="http://mywebsite.com/category/Match"></a>' |
sed -r 's/(Match)(">)</\1\2\1</'
<a title="Match" href="http://mywebsite.com/category/Match">Match</a>