Replacing HTML tag content using sed - html

I'm trying to replace the content of some HTML tags in an HTML page using sed in a bash script. For some reason I'm not getting the proper result as it's not replacing anything. It has to be something very simple/stupid im overlooking, anyone care to help me out?
HTML to search/replace in:
Unlocked <span id="unlockedCount"></span>/<span id="totalCount"></span> achievements for <span id="totalPoints"></span> points.
sed command used:
cat index.html | sed -i -e "s/\<span id\=\"unlockedCount\"\>([0-9]\{0,\})\<\/span\>/${unlockedCount}/g" index.html
The point of this is to parse the HTML page and update the figures according to some external data. For a first run, the contents of the tags will be empty, after that they will be filled.
EDIT:
I ended up using a combination of the answers which resulted in the following code:
sed -i -e 's|<span id="unlockedCount">\([0-9]\{0,\}\)</span>|<span id="unlockedCount">'"${unlockedCount}"'</span>|g' index.html
Many thanks to #Sorpigal, #tripleee, #classic for the help!

Try this:
sed -i -e "s/\(<span id=\"unlockedCount\">\)\(<\/span>\)/\1${unlockedCount}\2/g" index.html

What you say you want to do is not what you're telling sed to do.
You want to insert a number into a tag or replace it if present. What you're trying to tell sed to do is to replace a span tag and its contents, if any or a number, with the value of in a shell variable.
You're also employing a lot of complex, annoying and erorr-prone escape sequences which are just not necessary.
Here's what you want:
sed -r -i -e 's|<span id="unlockedCount">([0-9]{0,})</span>|<span id="unlockedCount">'"${unlockedCount}"'</span>|g' index.html
Note the differences:
Added -r to turn on extended expressions without which your capture pattern would not work.
Used | instead of / as the delimiter for the substitution so that escaping / would not be necessary.
Single-quoted the sed expression so that escaping things inside it from the shell would not be necessary.
Included the matched span tag in the replacement section so that it would not get deleted.
In order to expand the unlockedCount variable, closed the single-quoted expression, then later re-opened it.
Omitted cat | which was useless here.
I also used double quotes around the shell variable expansion, because this is good practice but if it contains no spaces this is not really necessary.
It was not, strictly speaking, necessary for me to add -r. Plain old sed will work if you say \([0-9]\{0,\}\), but the idea here was to simplify.

sed -i -e 's%<span id="unlockedCount">([0-9]*)</span\>/'"${unlockedCount}/g" index.html
I removed the Useless Use of Cat, took out a bunch of unnecessary backslashes, added single quotes around the regex to protect it from shell expansion, and fixed the repetition operator. You might still need to backslash the grouping parentheses; my sed, at least, wants \(...\).
Note the use of single and double quotes next to each other. Single quotes protect against shell expansion, so you can't use them around "${unlockedCount}" where you do want the shell to interpolate the variable.

Related

Remove specific tag with its contents using sed

I would like to remove following tag from HTML including its constantly varying contents:
<span class="the_class_name">li4tuq734g23r74r7Whatever</span>
A following BASH script
.... | sed -e :a -re 's/<span class="the_class_name"/>.*</span>//g' > "$NewFile"
ends with error
sed: -e expression #2, char XX: unknown option to `s'
I tried to escape quotes, slashes and "less than" symbols in various combinations and still get this error.
I suggest using a different sed separator than / when / is contained within the thing you want to match on. Also, prefer -E instead of -r for extended regex to be Posix compatible. Also note that you have a / in your first span in your regex that doesn't belong there.
Also, .* will make it overly greedy and eat up any </span> that follows the first </span> on the line. It's better to match on [^<]*. That is, any character that is not <.
sed -E 's,<span class="the_class_name">[^<]*</span>,,g'
A better option is of course to use a HTML parser for this.

Find and replace text in JSON with sed [duplicate]

I am trying to change the values in a text file using sed in a Bash script with the line,
sed 's/draw($prev_number;n_)/draw($number;n_)/g' file.txt > tmp
This will be in a for loop. Why is it not working?
Variables inside ' don't get substituted in Bash. To get string substitution (or interpolation, if you're familiar with Perl) you would need to change it to use double quotes " instead of the single quotes:
# Enclose the entire expression in double quotes
$ sed "s/draw($prev_number;n_)/draw($number;n_)/g" file.txt > tmp
# Or, concatenate strings with only variables inside double quotes
# This would restrict expansion to the relevant portion
# and prevent accidental expansion for !, backticks, etc.
$ sed 's/draw('"$prev_number"';n_)/draw('"$number"';n_)/g' file.txt > tmp
# A variable cannot contain arbitrary characters
# See link in the further reading section for details
$ a='foo
bar'
$ echo 'baz' | sed 's/baz/'"$a"'/g'
sed: -e expression #1, char 9: unterminated `s' command
Further Reading:
Difference between single and double quotes in Bash
Is it possible to escape regex metacharacters reliably with sed
Using different delimiters for sed substitute command
Unless you need it in a different file you can use the -i flag to change the file in place
Variables within single quotes are not expanded, but within double quotes they are. Use double quotes in this case.
sed "s/draw($prev_number;n_)/draw($number;n_)/g" file.txt > tmp
You could also make it work with eval, but don’t do that!!
This may help:
sed "s/draw($prev_number;n_)/draw($number;n_)/g"
You can use variables like below. Like here, I wanted to replace hostname i.e., a system variable in the file. I am looking for string look.me and replacing that whole line with look.me=<system_name>
sed -i "s/.*look.me.*/look.me=`hostname`/"
You can also store your system value in another variable and can use that variable for substitution.
host_var=`hostname`
sed -i "s/.*look.me.*/look.me=$host_var/"
Input file:
look.me=demonic
Output of file (assuming my system name is prod-cfm-frontend-1-usa-central-1):
look.me=prod-cfm-frontend-1-usa-central-1
I needed to input github tags from my release within github actions. So that on release it will automatically package up and push code to artifactory.
Here is how I did it. :)
- name: Invoke build
run: |
# Gets the Tag number from the release
TAGNUMBER=$(echo $GITHUB_REF | cut -d / -f 3)
# Setups a string to be used by sed
FINDANDREPLACE='s/${GITHUBACTIONSTAG}/'$(echo $TAGNUMBER)/
# Updates the setup.cfg file within version number
sed -i $FINDANDREPLACE setup.cfg
# Installs prerequisites and pushes
pip install -r requirements-dev.txt
invoke build
Retrospectively I wish I did this in python with tests. However it was fun todo some bash.
Another variant, using printf:
SED_EXPR="$(printf -- 's/draw(%s;n_)/draw(%s;n_)/g' $prev_number $number)"
sed "${SED_EXPR}" file.txt
or in one line:
sed "$(printf -- 's/draw(%s;n_)/draw(%s;n_)/g' $prev_number $number)" file.txt
Using printf to build the replacement expression should be safe against all kinds of weird things, which is why I like this variant.

How to remove all content before and after two different HTML tags from the Linux command line

I have a file, links.html, and want to remove everything before:
<div id="data_div"
and everything after AND including
<div style="\"overflow:auto\"">
and save to a new file; links1.html.
Is it possible to complete this in a single operation, either with BASH string manipulation or sed?
If so, how?
Yes it is entirely possible. Below are two examples, although you will have to substitute "pattern" with what you referenced above as the pattern.
To remove everything after a "pattern" to include the pattern itself, you can use the following command
sed -i '/pattern/,$d' filename
To remove everything before a pattern to include the pattern itself, you can use the following command
sed -i '1,/pattern/d' filename
If you would like to not edit the file in place, you could remove the "-i" portion and redirect the output to a separate file.

using sed in script to add html tag to text

I'm trying to use sed in a shell script to add html hyperlink tags to a url in a plain text file.
This is the content of my newtext.txt:
www.example.com
And here is the desired content of newtext.txt that I would like after running my script:
www.example.com
Here is the content of my current script, addhtml.sh:
#!/bin/bash
newtextv='cat newtext.txt'
sed -i.bak 's|\(www.*\)|\1|' newtext.txt
But unfortunately, after running the script, the content of newtext.txt becomes:
www.example.com
I believe my error somehow relates to how my variable is being quoted?
I eventually want this script to also be able to convert full urls (containing http:// )... I obviously need to improve my sed knowledge a good deal (it's taken me a few days to get this far), but I can't wrap my head around this one.
Thank you!
If you want to put the file's content into a variable:
newtextv=$(cat newtext.txt)
But really, you probably want something like this (but with a better regex, obviously):
sed 's|www\.[^ ]*|&|g' <newtext.txt >newtext.html
Sed replaces every & with the matched string.
Why mess around with a variable?
sed -i 's|\(www.*\)|\1|' newtext.txt
or
sed -i 's|www.*|&|' newtext.txt
If you happen to have the URL in a variable you can also do it without sed:
newtextv=www.example.com
echo "$newtextv"
returns
www.example.com
In Bash you can manipulate variables as a subset of variable substitution.
Here ${newtextv#www.} basically means take $newtextv and cut "www." from its beginning
Your trouble is two little syntax errors:
cat newtext.txt will never execute, you need to use backquotes ` or $()
using single quotes ' prevents variables from expanding. To allow variable expansion use double quotes "
here is what you want to do:
#!/bin/bash
newtextv=$(cat newtext.txt)
sed -i.bak "s|\(www.*\)|\1|" newtext.txt

How can I trim "/" using shell scripting?

I've been playing around with a little shell script to get some info out of a HTML page downloaded with lynx.
My problem is that I get this string: <span class="val3">MPPTN: 0.9384</span></td>
I can trim the first part of that using:
trimmed_info=`echo ${info/'<span class="val3">'/}`
And the string becomes: "MPPTN: 0.9384"
But how can I trim the last part? Seem like the "/" is messing up with the echo command... I tried:
echo ${finalt/'</span></td>'/};
Not sure if using sed is ok -- one way to extract out the number could be something like ...
echo '<span class="val3">MPPTN: 0.9384</span></td>' | sed 's/^[^:]*..//' | sed 's/<.*$//'
The behavior of ${VARIABLE/PATTERN/REPLACEMENT} depends on what shell you're using, and for bash what version. Under ksh, or under recent enough (I think ≥ 4.0) versions of bash, ${finalt/'</span></td>'/} strips that substring as desired. Under older versions of bash, the quoting is rather quirky; you need to write ${finalt/<\/span><\/td>/} (which still works in newer versions).
Since you're stripping a suffix, you can use the ${VARIABLE%PATTERN} or ${VARIABLE%%PATTERN} construct instead. Here, you're removing everything after the first </, i.e. the longest suffix that matches the pattern </*. Similarly, you can strip the leading HTML tags with ${VARIABLE##PATTERN}.
trimmed=${finalt%%</*}; trimmed=${trimmed##*>}
Added benefit: unlike ${…/…/…}, which is specific to bash/ksh/zsh and works slightly differently in all three, ${…#…} and ${…%…} are fully portable. They don't do as much, but here they're sufficient.
Side note: although it didn't cause any problem in this particular instance, you should always put double quotes around variable substitutions, e.g.
echo "${finalt/'</span></td>'/}"
Otherwise the shell will expand wildcards and spaces in the result. The simple rule is that if you don't have a good reason to leave the double quotes out, you put them.
The solution largely depends on what exactly you want to do. If all your strings are going to be of the form <span class="val3">XXXXX: X.XXXX</span></td>, then the simplest solution is
echo $info | cut -c 20-32
If they're of the form <span class="val3">variable length</span></td>, then the simplest solution is
echo $info | sed 's/<span class="val3">//' | sed 's/<\/span><\/td>//'
If it's more general, you can use regexes like in Sai's answer.
I'd recommend using the sed command for this kind of thing:
echo "$string" | sed "s/$regex/$replace/"