How can I trim "/" using shell scripting? - html

I've been playing around with a little shell script to get some info out of a HTML page downloaded with lynx.
My problem is that I get this string: <span class="val3">MPPTN: 0.9384</span></td>
I can trim the first part of that using:
trimmed_info=`echo ${info/'<span class="val3">'/}`
And the string becomes: "MPPTN: 0.9384"
But how can I trim the last part? Seem like the "/" is messing up with the echo command... I tried:
echo ${finalt/'</span></td>'/};

Not sure if using sed is ok -- one way to extract out the number could be something like ...
echo '<span class="val3">MPPTN: 0.9384</span></td>' | sed 's/^[^:]*..//' | sed 's/<.*$//'

The behavior of ${VARIABLE/PATTERN/REPLACEMENT} depends on what shell you're using, and for bash what version. Under ksh, or under recent enough (I think ≥ 4.0) versions of bash, ${finalt/'</span></td>'/} strips that substring as desired. Under older versions of bash, the quoting is rather quirky; you need to write ${finalt/<\/span><\/td>/} (which still works in newer versions).
Since you're stripping a suffix, you can use the ${VARIABLE%PATTERN} or ${VARIABLE%%PATTERN} construct instead. Here, you're removing everything after the first </, i.e. the longest suffix that matches the pattern </*. Similarly, you can strip the leading HTML tags with ${VARIABLE##PATTERN}.
trimmed=${finalt%%</*}; trimmed=${trimmed##*>}
Added benefit: unlike ${…/…/…}, which is specific to bash/ksh/zsh and works slightly differently in all three, ${…#…} and ${…%…} are fully portable. They don't do as much, but here they're sufficient.
Side note: although it didn't cause any problem in this particular instance, you should always put double quotes around variable substitutions, e.g.
echo "${finalt/'</span></td>'/}"
Otherwise the shell will expand wildcards and spaces in the result. The simple rule is that if you don't have a good reason to leave the double quotes out, you put them.

The solution largely depends on what exactly you want to do. If all your strings are going to be of the form <span class="val3">XXXXX: X.XXXX</span></td>, then the simplest solution is
echo $info | cut -c 20-32
If they're of the form <span class="val3">variable length</span></td>, then the simplest solution is
echo $info | sed 's/<span class="val3">//' | sed 's/<\/span><\/td>//'
If it's more general, you can use regexes like in Sai's answer.

I'd recommend using the sed command for this kind of thing:
echo "$string" | sed "s/$regex/$replace/"

Related

How can I change specific recurring text on a very large HTML file?

I have a very big HTML file (talking about 20MB) and I need to remove from the file a large amount of nodes of the form:
<tr><td>SPECIFIC-STRING</td><td>RANDOM-STRING</td><td>RANDOM-STRING</td></tr><tr><td style="padding-top:0" colspan="3">RANDOM-STRING</td></tr>
The file I need to work on is basically made of thousands of these strings, and I only need to remove those that have a specific first string, for instance, all those with the first string being "banana":
<tr><td>banana</td><td>RANDOM-STRING</td><td>RANDOM-STRING</td></tr><tr><td style="padding-top:0" colspan="3">RANDOM-STRING</td></tr>
I tried achieving this opening the file in Geany and using the replace feature with this regex:
<tr><td>banana<\/td><td>(.*)<\/td><td>(.*)<\/td><\/tr><tr><td(.*)<\/td><\/tr>
but the console output was that it removed X amount of occurrences, when I know there are way more occurrences than that in the file.
Firefox, Chrome and Brackets fail even to view the html code of the file due to it's size. I can't think of another way to do this due to my large unexperience with HTML.
You could be using a stream editor which as the name suggest streams the file content, thus never loads the whole file into the main memory.
A popular editor is sed. It does support RegEx.
Your command would have the following structure.
sed -i -E 's/SEARCH_REGEX/REPLACEMENT/g' INPUTFILE
-E for support of extended RegEx
-i for in-place editing mode
s denotes that you want to replace values
g is for global. By default sed would only replace the first occurrence so to replace all occrrences you must provide g
SEARCH_REGEX is the RegEx you need to find the substrings you want to replace
REPLACEMENT is the value you want to replace all matches with
INPUTFILE is the file sed is gonna read line-by line and do the replacement for you.
While regex may not be the best tool to do this kinda job, try this adjustment to your pattern:
<tr><td>banana<\/td><td>(.*?)<\/td><td>(.*?)<\/td><\/tr><tr><td(.*?)<\/td><\/tr>
That's making your .* matches lazy. I am wondering if those patterns are consuming too much.

Replace JSON key using SED

I am trying to get a JSON key and change its property in the file using -i flag with sed. The issue is, I cannot get this regex right. It works perfectly fine for the simple replace case, but I cannot get it working using this regex. For simplicity, I have just tried a simple echo, rather than saving it to the file. Ideas?
x=0.0.179
echo "version: 0.0.178" | sed 's/^[ ]*\"version\"[ ]*:[ ]*\"\([0-9]+\.[0-9]+\.[0-9]+\)\".*$/\$x/'
0.0.178
Why should you be complicating it with sed, just do,
x="0.0.179"
echo "version: 0.0.178" | sed "s/version: .*/version: $x/"
version: 0.0.179
and BTW if your JSON input can be parsed and modified via jq, go for it. Use this ONLY as a last resort.
I think your sed regexp is looking for the version number within double-quotes. Your input to sed above is not quoted as such, and hence doesn't get replaced (I'd expect your JSON to be double-quoted though, hence my interest above re. your real JSON input).
This works. Mention the 's around $x
echo "version: 0.0.178" | sed 's/^[ ]*\"version\"[ ]*:[ ]*\"\([0-9]+\.[0-9]+\.[0-9]+\)\".*$/\'$x'/'
Bash variables do not get replaced inside strings with '. Use " whenever possible to avoid this, or just concat your variable into the command as shown above.

using sed in script to add html tag to text

I'm trying to use sed in a shell script to add html hyperlink tags to a url in a plain text file.
This is the content of my newtext.txt:
www.example.com
And here is the desired content of newtext.txt that I would like after running my script:
www.example.com
Here is the content of my current script, addhtml.sh:
#!/bin/bash
newtextv='cat newtext.txt'
sed -i.bak 's|\(www.*\)|\1|' newtext.txt
But unfortunately, after running the script, the content of newtext.txt becomes:
www.example.com
I believe my error somehow relates to how my variable is being quoted?
I eventually want this script to also be able to convert full urls (containing http:// )... I obviously need to improve my sed knowledge a good deal (it's taken me a few days to get this far), but I can't wrap my head around this one.
Thank you!
If you want to put the file's content into a variable:
newtextv=$(cat newtext.txt)
But really, you probably want something like this (but with a better regex, obviously):
sed 's|www\.[^ ]*|&|g' <newtext.txt >newtext.html
Sed replaces every & with the matched string.
Why mess around with a variable?
sed -i 's|\(www.*\)|\1|' newtext.txt
or
sed -i 's|www.*|&|' newtext.txt
If you happen to have the URL in a variable you can also do it without sed:
newtextv=www.example.com
echo "$newtextv"
returns
www.example.com
In Bash you can manipulate variables as a subset of variable substitution.
Here ${newtextv#www.} basically means take $newtextv and cut "www." from its beginning
Your trouble is two little syntax errors:
cat newtext.txt will never execute, you need to use backquotes ` or $()
using single quotes ' prevents variables from expanding. To allow variable expansion use double quotes "
here is what you want to do:
#!/bin/bash
newtextv=$(cat newtext.txt)
sed -i.bak "s|\(www.*\)|\1|" newtext.txt

Extracting URLs from large text/HTML files

I have a lot of text that I need to process for valid URLs.
The input is vaguely HTMLish, in that it's mostly html. However, It's not really valid HTML.
I*ve been trying to do it with regex, and having issues.
Before you say (or possibly scream - I've read the other HTML + regex questions) "use a parser", there is one thing you need to consider:
The files I am working with are about 5 GB in size
I don't know any parsers that can handle that without failing, or taking days. Furthermore, the fact that, while the text content is largely html, but not necessarily valid html means it would require a very tolerant parser. Lastly, not all links are necessarily in <a> tags (some may be just plaintext).
Given that I don't really care about document structure, are there any better alternatives WRT extracting links?
Right now I'm using the regex:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))) (in grep -E)
but even with that, I gave up after letting it run for about 3 hours.
Are there significant differences in Regex engine performance? I'm using MacOS's command-line grep. If there are other compatible implementations with better performance, that might be an option.
I don't care too much about language/platform, though MacOS/command line would be nice.
I wound up string a couple grep commands together:
pv -cN source allContent | grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )" | grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)" | pv -cN out > extrLinks1
I used pv to give me a progress indicator.
grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )"
Pulls out anything that looks like a word or quoted text, and has no spaces.
grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)"
Filters the output for anything that looks like it could be a URL.
Finally,
pv -cN out > extrLinks1
Outputs it to a file, and gives a nice activity meter.
I'll probably push the generated file through sort -u to remove duplicate entries, but I didn't want to string that on the end because it would add another layer of complexity, and I'm pretty sure that sort will try to buffer the whole file, which could cause a crash.
Anyways, as it's running right now, it looks like it's going to take about 40 minutes. I didn't know about pv before. It's a really cool utility!
I think you are in the right track, and grep should be able to handle a 5Gb file. Try simplifying your regex avoid the | operator and so many parenthesis. Also, use the head command to grab the first 100Kb before running against the whole file, and chain the greps using pipes to achieve more specificity. For example,
head -c 100000 myFile | grep -E "((src)|(href))\b*=\b*[\"'][\w://\.]+[\"']"
That should be super fast, no?

Replacing HTML tag content using sed

I'm trying to replace the content of some HTML tags in an HTML page using sed in a bash script. For some reason I'm not getting the proper result as it's not replacing anything. It has to be something very simple/stupid im overlooking, anyone care to help me out?
HTML to search/replace in:
Unlocked <span id="unlockedCount"></span>/<span id="totalCount"></span> achievements for <span id="totalPoints"></span> points.
sed command used:
cat index.html | sed -i -e "s/\<span id\=\"unlockedCount\"\>([0-9]\{0,\})\<\/span\>/${unlockedCount}/g" index.html
The point of this is to parse the HTML page and update the figures according to some external data. For a first run, the contents of the tags will be empty, after that they will be filled.
EDIT:
I ended up using a combination of the answers which resulted in the following code:
sed -i -e 's|<span id="unlockedCount">\([0-9]\{0,\}\)</span>|<span id="unlockedCount">'"${unlockedCount}"'</span>|g' index.html
Many thanks to #Sorpigal, #tripleee, #classic for the help!
Try this:
sed -i -e "s/\(<span id=\"unlockedCount\">\)\(<\/span>\)/\1${unlockedCount}\2/g" index.html
What you say you want to do is not what you're telling sed to do.
You want to insert a number into a tag or replace it if present. What you're trying to tell sed to do is to replace a span tag and its contents, if any or a number, with the value of in a shell variable.
You're also employing a lot of complex, annoying and erorr-prone escape sequences which are just not necessary.
Here's what you want:
sed -r -i -e 's|<span id="unlockedCount">([0-9]{0,})</span>|<span id="unlockedCount">'"${unlockedCount}"'</span>|g' index.html
Note the differences:
Added -r to turn on extended expressions without which your capture pattern would not work.
Used | instead of / as the delimiter for the substitution so that escaping / would not be necessary.
Single-quoted the sed expression so that escaping things inside it from the shell would not be necessary.
Included the matched span tag in the replacement section so that it would not get deleted.
In order to expand the unlockedCount variable, closed the single-quoted expression, then later re-opened it.
Omitted cat | which was useless here.
I also used double quotes around the shell variable expansion, because this is good practice but if it contains no spaces this is not really necessary.
It was not, strictly speaking, necessary for me to add -r. Plain old sed will work if you say \([0-9]\{0,\}\), but the idea here was to simplify.
sed -i -e 's%<span id="unlockedCount">([0-9]*)</span\>/'"${unlockedCount}/g" index.html
I removed the Useless Use of Cat, took out a bunch of unnecessary backslashes, added single quotes around the regex to protect it from shell expansion, and fixed the repetition operator. You might still need to backslash the grouping parentheses; my sed, at least, wants \(...\).
Note the use of single and double quotes next to each other. Single quotes protect against shell expansion, so you can't use them around "${unlockedCount}" where you do want the shell to interpolate the variable.