I am trying to insert some HTML code for my Telldus script page. I want create a link-image. I have tried a lot of different alternatives for this code but nothing works.
AWK CODE to generate off.php
# awk '{print a$1b$1d}' a='' d='<img src='OFF.png'><br> ' off1.php > off.php
Result for off.php
# more off.php
Lampor<img src=OFF.png><br>
Result I want for off.php
<img src=OFF.png>Lampor<br>
So, I want the image to be a link instead of the word "Lampor".
this should do...
awk '{print "<img src=OFF.png>"$1"<br>"}' file
Related
So i have a text file with part of html code:
>>nano wynik.txt
with text:
1743: < a href="/currencies/lisk/#markets" class="price" data-usd="24.6933" data-btc= "0.00146882"
and i want to print only: 24.6933
I tried the way with the cut command but it does not work. Can anyone give me a solution?
With GNU grep and Perl Compatible Regular Expressions:
grep -Po '(?<=data-usd=").*?(?=")' file
Output:
24.6933
I have tried multiple times to get digits between two html patterns.
Neither sed nor awk worked for me, since the examples in the internet were too easy to fit my task.
Here is the code I want to filter:
....class="a-size-base review-text">I WANT THIS TEXT</span></div> ....
So I would need a command that output: I WANT THIS TEXT between ...review-text"> and </span>
Do you have a clue? Thanks for the effort and greetings from Germany.
Here is the plain code
Try:
tr '\n' ' ' file.html | grep -o 'review-text">[^<>]*</span> *</div>' | cut -d'>' -f2 | cut -d'<' -f 1
It should work if there are no any tags inside "I WANT THIS TEXT"
I can't see the problem here supposing the text you want to extract doesn't contains < nor >.
For instance with POSIX REGEXP:
$ HTML_FILE=/tmp/myfile.html
$ sed -n "s/.*review-text.>\([^<]*\)<.*/\1/gp" $HTML_FILE
prints the text between HTML TAGS
I want to copy all the text in a website between tags:
<p> and </p>
using bash.
Do you have an idea how to do it?
As the comment above states: don't even try. There is no reliable way to parse HTML with Bash internals.
But when you're using a shell you may as well use third-party command line tools such as pup which are built for HTML parsing on the command line.
Yes, an HTML parser is a better choice. But if you are just trying to grab the text in between the first set of P tags quickly, you can use Perl:
perl -n0e 'if (/<p>(.*?)<\/p>/s) { print $1; }'
For example:
echo "
<p>A test
here
today</p>
<p>whatever</p>
" | perl -n0e 'if (/<p>(.*?)<\/p>/s) { print $1; }'
This will output:
A test
here
today
So far I am using curl along w3m and sed to extract portions of a webpage like <body>....content....</body>. I want to ignore all the other headers (ex. <a></a>, <div></div>). Except the way I am doing it right now is really slow.
curl -L "http://www.somewebpage.com" | sed -n -e '\:<article class=:,\:<div id="below">: p' > file.html
w3m -dump file.html > file2.txt
These two lines above are really slow because curl was to first save the whole webpage into a file and phrase it, then w3m phrases it and saves it into another file. I just want to simply this code. I was wondering if there was a way with lynx or hmtl2text that lets you extract webpage content with specified headers. So like if I wanted to extract something from as webpage (www.badexample.com <---not actually the link) with this content:
<title>blah......blah...</title>
<body>
Some text I need to extract
</body>
more stuffs
Is there a program which i can specify the parameter in which to extract the content? So I would specify someprogram <body></body> www.badexample.com and it would extract the content only in those headers?
You can use Perl's one liner for this:
perl -MLWP::Simple -e "print get ($ARGV[0]) =~ /<$ARGV[1]>(.*?)<\/$ARGV[1]>/;" http://www.example.com/ title
Instead of the html tag, you can pass the whole regex as well:
perl -MLWP::Simple -e "print get ($ARGV[0]) =~ /$ARGV[1]/;" "http://www.example.com/" "<body>(.*?)</body>"
Must it be in bash? What about PHP and DOMDocument()?
$dom = new DOMDocument();
$new_dom = new DOMDocument();
$url_value = 'http://www.google.com';
$html = file_get_contents($url_value);
$dom->loadHTML($html);
$body = $dom->getElementsByTagName('body')->item(0);
foreach ($body->childNodes as $child){
$new_dom->appendChild($new_dom->importNode($child, true));
}
echo $new_dom->saveHTML();
I would like to match the contents of a paragraph tag using a perl reg ex one liner. The paragraph is something like this:
<p style="font-family: Calibri,Helvetica,serif;">Text I want to extract</p>
so I have been using something like this:
perl -nle 'm/<p>($.)<\/p>/ig; print $1' file.html
Any ideas appreciated
thanks
Mandatory link to what happens when you try to parse HTML with regular expressions.
David Dorward's comment, to use HTML::TreeBuilder, is a good one.
Another good way to do this, is by using HTML::DOM:
perl -MHTML::DOM -e 'my $dom = HTML::DOM->new(); $dom->parse_file("file.html"); my #p = $dom->getElementsByTagName("p"); print $p[0]->innerText();'
$ in matching part means 'end-of-the-string' and you need also match all in p-tag non-greedy way:
perl -nle 'm/<p.*?>(.+)<\/p/ig; print $1' test.html