Shell bash how to delete text between pattern not inclusive? - html

I am trying to delete the text between <pre><\pre> html tags using:
sed -i '/<pre>/,/<\/pre>/d' file.html
But this deletes the <pre></pre> tags too. The are only one pre tag pair in the file.
How can I avoid to delete de pre tags?
Thanks.

This might work for you (GNU sed):
sed -n '/<pre>/{p;:a;n;/<\/pre>/!ba};p' file
Turn off implicit printing by using the -n option.
If a line contains <pre>, print it and fetch the next line.
If that line does not contain </pre> loop back and repeat.
Otherwise print all other lines.

Related

Bash: Content between two complex Patterns - html

I have tried multiple times to get digits between two html patterns.
Neither sed nor awk worked for me, since the examples in the internet were too easy to fit my task.
Here is the code I want to filter:
....class="a-size-base review-text">I WANT THIS TEXT</span></div> ....
So I would need a command that output: I WANT THIS TEXT between ...review-text"> and </span>
Do you have a clue? Thanks for the effort and greetings from Germany.
Here is the plain code
Try:
tr '\n' ' ' file.html | grep -o 'review-text">[^<>]*</span> *</div>' | cut -d'>' -f2 | cut -d'<' -f 1
It should work if there are no any tags inside "I WANT THIS TEXT"
I can't see the problem here supposing the text you want to extract doesn't contains < nor >.
For instance with POSIX REGEXP:
$ HTML_FILE=/tmp/myfile.html
$ sed -n "s/.*review-text.>\([^<]*\)<.*/\1/gp" $HTML_FILE
prints the text between HTML TAGS

Format XML using command line

I have a html text file and i want to format it so that paragraphs are always on the same line e.g.
<p>paragraph info here</p>
instead of
<p>paragraph
info here </p>
Is there a tool that enables me to do this
You can use sed
cat test.html |sed ':a;N;$!ba;s/\n/ /g' |sed 's/<\/p> /<\/p>\n/g'
In first run it remove all line break and then add it after paragraph tag
It is not clear but it work
While the requirement paragraphs are always on the same line would be met by simply joining the whole file to a single line, this solution is less radical:
perl -pe 'if (/<p>/../<\/p>/) { s/\n/ / unless /<\/p>/ }' test.html

Remove empty HTML tags from a file using sed

I have looked a lot to find the solution but could not find one. I know how to remove all tags using sed but I need to remove only those HTML tags that are empty or have just tabs or spaces in them and also remove tags explicitly. For example:
<p></p> or <p> </p>
I used the following command to remove all the HTML tags, it works properly but I don't want to remove all tags.
sed -e 's/<[^>]*>//g' myfile.html
same command is used here. Kindly help me out.
You could use the below sed command to remove only the empty tags.
sed 's/<[^\/][^<>]*> *<\/[^<>]*>//g' file
Through Perl,
perl -pe 's/<([^<>]*)>\s*<\/\1>//g' file
sed -r 's/<([a-zA-Z0-9]+)>[ \s\t]*<\/\1>//g' file

File composition using command line tools (Linux / Mac)

I have a file containing some text and some kind of placeHolder, and another file with some other text
Eg:
myText.txt:
some text strings plus a {{myPlaceholderText}} and some more text
myPlaceholderText.txt:
more text here
I want to be able to create a 3rd file containing the string:
"some text strings plus a more text here and some more text"
Is it possible to do that using command line tools?
I think sed is the easiest way to do it:
$ sed "s/{{myPlaceholderText}}/$(<myPlaceholder.txt)/g" myText.txt
some text strings plus a more text here and some more text
Yes. And bash is the safest common tool besides interpreted languages.
#!/bin/bash
R=$(<myPlaceholderText.txt)
while read -r LINE; do
echo "${LINE//'{{myPlaceholderText}}'/$R}"
done < myText.txt > another_file.txt
Output to another_file.txt:
some text strings plus a more text here and some more text
Another through awk:
awk 'BEGIN{getline r < ARGV[1];ARGV[1]=""}{gsub(/{{myPlaceholderText}}/,r)}1' myPlaceholderText.txt myText.txt > another_file.txt

Find specific tags in a HTML file

I have some html files and want to extract the contents between some tags:
The title of the page
some tagged content here.
<p>A paragraph comes here</p>
<p>A paragraph comes here</p><span class="more-about">Some text here</span><p class="en-cpy">Copyright © 2012 </p>
I just want these tags: head, p
but as could be seen in the second paragraph, the last tag is which starts with p but is not my desires tag, and I don't want its content.
I used following script for extracting my desired text, but I can't filter out the tags such as the last one in my example.... How is it possible to extract just <p> tags?
grep "<p>" $File | sed -e 's/^[ \t]*//'
I have to add that, the last tag (which I don't want to appear in the output) is right after one of my desired tags (as is in my example) and using grep command all the content of that line would be returned as output... (This is my problem)
Don't. Trying to use regex to parse HTML is going to be painful. Use something like Ruby and Nokogiri, or a similar language + library that you are familiar with.
to extract text between <p> and </p>, try this
perl -ne 'BEGIN{$/="</p>";$\="\n"}s/.*(<p>)/$1/&&print' < input-file > output-file
or
perl -n0l012e 'print for m|<p>.*?</p>|gs'
xmllint --html --xpath "//*[name()='head' or name()='p']" "$file"
If you're dealing with broken HTML you might need a different parser. Here's a "one-liner" basically the same using lxml. Just pass the script your URL
#!/usr/bin/env python3
from lxml import etree
import sys
print('\n'.join(etree.tostring(x, encoding="utf-8", with_tail=False).decode("utf-8") for x in (lambda i: etree.parse(i, etree.HTMLParser(remove_blank_text=1, remove_comments=1)).xpath("//*[name()='p' or name()='head']"))(sys.argv[0])))