Find a string between 2 other strings in document - html

I have found a ton of solutions do do what I want with only one exception.
I need to search a .html document and pull a string.
The line containing the string will look like this (1 line, no newlines)
<script type="text/javascript">g_initHeader(0);LiveSearch.attach(ge('oh2345v5ks'));var _ = g_items;_[60]={icon:'INV_Chest_Leather_09',name_enus:'Layered Tunic'};_[6076]={icon:'INV_Pants_11',name_enus:'Tapered Pants'};_[3070]={icon:'INV_Misc_Cape_01',name_enus:'Ensign Cloak'};</script>
The text I need to get is
INV_CHEST_LEATHER_09
When I use awk, grep, and sed, I extract the data between icon:' and ',name_
The problem is, all three of these scripts scan the entire line and use the last occurring ',name_ thus I end up with
INV_Chest_Leather_09',name_enus:'Layered
Tunic'};_[6076]={icon:'INV_Pants_11',name_enus:'Tapered
Pants'};_[3070]={icon:'INV_Misc_Cape_01
Here's the last one I tried
grep -Po -m 1 "(?<=]={icon:').*(?=',name_)"
I've tried awk and sed too, and I don't really have a preference of which one to use.
So basically, I need to search the entire html file, find the first occurrence of icon:', extract the text right after it until the first occurrence after icon:' of ',name_.

With GNU awk for the 3rd arg to match():
$ awk 'match($0,/icon:\047([^\047]+)/,a){print a[1]}' file
INV_Chest_Leather_09

Simple perl approach:
perl -ne 'print "$1\n" if /\bicon:\047([^\047]+)/' file
The output:
INV_Chest_Leather_09

The .* in your regular expression is a greedy matcher, so the pattern will match till the end of the string and then backtrack to match the ,name_ portion. You could try replacing the .* with something like [^,]* (i.e. match anything except comma):
grep -Po -m 1 "(?<=]={icon:')[^,]*(?=',name_)"

Related

Getting IPs from a .html file

Their is a site with socks4 proxies online that I use in a proxychains program. Instead of manually entering new IPs in, I was trying to automate the process. I used wget to turn it into a .html file on my home directory, this is some of the output if i cat the file:
</font></a></td><td colspan=1><font class=spy1>111.230.138.177</font> <font class=spy14>(Shenzhen Tencent Computer Systems Company Limited)</font></td><td colspan=1><font class=spy1>6.531</font></td><td colspan=1><TABLE width='13' height='8' CELLPADDING=0 CELLSPACING=0><TR BGCOLOR=blue><TD width=1></TD></TR></TABLE></td><td colspan=1><font class=spy1><acronym title='311 of 436 - last check status=OK'>71% <font class=spy1>(311)</font> <font class=spy5>-</font></acronym></font></td><td colspan=1><font class=spy1><font class=spy14>05-jun-2020</font> 23:06 <font class=spy5>(4 mins ago)</font></font></td></tr><tr class=spy1x onmouseover="this.style.background='#002424'" onmouseout="this.style.background='#19373A'"><td colspan=1><font class=spy14>139.99.104.233<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(a1j0e5^q7p6)+(m3f6f6^r8c3)+(a1j0e5^q7p6)+(t0b2s9^y5m3)+(w3c3m3^z6j0))</script></font></td><td colspan=1>SOCKS5</td><td colspan=1><a href='/en/anonymous-proxy-list/'><font class=spy1>HIA</font></a></td><td colspan=1><a href='/free-proxy-list/CA/'><font class=spy14>Canada</
As you can see the IP is usually followed by a spy[0-19]> . I tried to parse out the actual IP's with awk using the following code:
awk '/^spy/{FS=">"; print $2 } file-name.html
This is problematic because their would be a bunch of other stuff trailing after the IP, also I guess the anchor on works for the beginning of a line? Anyway I was wondering if anyone could give me any ideas on how to parse out the IP addresses with awk. I just started learning awk, so sorry for the noob question. Thanks
Using a proper XML/HTML parser and a xpath expression:
xidel -se '(//td[#colspan=1]/font[#class="spy1"])[1]/text()' file.html
 Output:
111.230.138.177
Or if it's not all the time the first xpath match:
xidel -se '//td[#colspan=1]/font[#class="spy1"]/text()' file.html |
perl -MRegexp::Common -lne 'print $1 if /($RE{net}{IPv4})/'
AWK is great for hacking IP addresses:
gawk -v RS="spy[0-9]*" '{match($0,/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/); ip = substr($0,RSTART,RLENGTH); if (ip) {print ip}}' file.html
Result:
111.230.138.177
139.99.104.233
Explanation.
You must use GAWK if you want the record break to contain a regular expression.
We divide the file into lines containing one IP address using regex in the RS variable.
The match function finds the second regex in the entire line. Regex is 4 groups from 1 to 3 numbers, separated by a dot (the IP address).
Then the substract function retrieves from the entire line ($0) a fragment of RLENGTH length starting from RSTART (the beginning of the searched regex).
IF checks if the result has a value and if so prints it. This protects against empty lines in the result.
This method of hulling IP addresses is independent of the correctness of the file, it does not have to be html.
There's already solutions provided here, I'm rather putting a different one for future readers using egrep utility.
egrep -o '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' file.html

Change column if some regex expression is true with awk or sed

I have a file (lets call it data.csv) similar to this
"123","456","ud,h-match","moredata"
with many rows in the same format and embedded commas. What I need to do is look at the third column and see if it has an expression. In this case I want to know if the third column has "match" anywhere (which it does). If there is any, then I to replace the whole column to something else like "replaced". So to relate it to the example data.csv file, I would want it to look this.
"123","456","replaced","moredata"
Ideally, I want the file data.csv itself to be changed (time is of the essence since I have a big file) but it's also fine if you write it to another file.
Edit:
I have tried using awk:
awk -F'","' -OFS="," '{if(tolower($3) ~ "stringI'mSearchingFor"){$3="replacement"; print}else print}' file
but it dosen't change anything. If I remove the OFS portion then it works but it gets separated by spaces and the columns don't get enclosed by double quotes.
Depending on the answer to my question about what you mean by column, this may be what you want (uses GNU awk for FPAT):
$ awk -v FPAT='[^,]+|"[^"]+"' -v OFS=',' '$3~/match/{$3="\"replaced\""} 1' file
"123","456","replaced","moredata"
Use awk -i inplace ... if you want to do "in place" editing.
With any awk (but slightly more fragile than the above since it leaves the leading/trailing " on the first and last fields, and has no -i inplace):
$ awk 'BEGIN{FS=OFS="\",\""} $3~/match/{$3="replaced"} 1' file
"123","456","replaced","moredata"

Similar strings, different results

I'm creating a Bash script to parse the air pollution levels from the webpage:
http://aqicn.org/city/beijing/m/
There is a lot of stuff in the file, but this is the relevant bit:
"iaqi":[{"p":"pm25","v":[59,21,112],"i":"Beijing pm25 (fine
particulate matter) measured by U.S Embassy Beijing Air Quality
Monitor
(\u7f8e\u56fd\u9a7b\u5317\u4eac\u5927\u4f7f\u9986\u7a7a\u6c14\u8d28\u91cf\u76d1\u6d4b).
Values are converted from \u00b5g/m3 to AQI levels using the EPA
standard."},{"p":"pm10","v":[15,5,69],"i":"Beijing pm10
(respirable particulate matter) measured by Beijing Environmental
Protection Monitoring Center
I want the script to parse and display 2 numbers: current PM2.5 and PM10 levels (the numbers in bold in the above paragraph).
CITY="beijing"
AQIDATA=$(wget -q 0 http://aqicn.org/city/$CITY/m/ -O -)
PM25=$(awk -v FS="(\"p\":\"pm25\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
PM100=$(awk -v FS="(\"p\":\"pm10\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
echo $PM25 $PM100
Even though I can get PM2.5 levels to display correctly, I cannot get PM10 levels to display. I cannot understand why, because the strings are similar.
Anyone here able to explain?
The following approach is based on two steps:
(1) Extracting the relevant JSON;
(2) Extracting the relevant information from the JSON using a JSON-aware tool -- here jq.
(1) Ideally, the web service would provide a JSON API that would allow one to obtain the JSON directly, but as the URL you have is intended for viewing with a browser, some form of screen-scraping is needed. There is a certain amount of brittleness to such an approach, so here I'll just provide something that currently works:
wget -O - http://aqicn.org/city/beijing/m |
gawk 'BEGIN{RS="function"}
$1 ~/getAqiModel/ {
sub(/.*var model=/,"");
sub(/;return model;}/,"");
print}'
(gawk or an awk that supports multi-character RS can be used; if you have another awk, then first split on "function", using e.g.:
sed $'s/function/\\\n/g' # three backslashes )
The output of the above can be piped to the following jq command, which performs the filtering envisioned in (2) above.
(2)
jq -c '.iaqi | .[]
| select(.p? =="pm25" or .p? =="pm10") | [.p, .v[0]]'
The result:
["pm25",59]
["pm10",15]
I think your problem is that you have a single line HTML file that contains a script that contains a variable that contains the data you are looking for.
Your field delimiters are either "p":"pm100", "v":[ or a comma and some digits.
For pm25 this works, because it is the first, and there are no occurrences of ,21 or something similar before it.
However, for pm10, there are some that are associated with pm25 ahead of it. So the second field contains the empty string between ,21 and ,112
#karakfa has a hack that seems to work -- but he doesn't explain very well why it works.
What he does is use awk's record separator (which is usually a newline) and sets it to either of :, ,, or [. So in your case, one of the records would be "pm25", because it is preceded by a colon, which is a separator, and succeeded by a comma, also a separator.
Once it hits the matching content ("pm25") it sets a counter to 4. Then, for this and the next records, it counts this counter down. "pm25" itself, "v", the empty string between : and [, and finally reaches one when hitting the record with the number you want to output: 4 && ! 3 is false, 3 && ! 2 is false, 2 && ! 1 is false, but 1 && ! 0 is true. Since there is no execution block, awk simply prints this record, which is the value you want.
A more robust work would probably be using xpath to find the script, then use some json parser or similar to get the value.
chw21's helpful answer explains why your approach didn't work.
peak's helpful answer is the most robust, because it employs proper JSON parsing.
If you don't want to or can't use third-party utility jq for JSON parsing, I suggest using sed rather than awk, because awk is not a good fit for field-based parsing of this data.
$ sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA"
59 15
The above should work with both GNU and BSD/OSX sed.
To read the result into variables:
read pm25 pm10 < \
<(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA")
Note how I've chosen lowercase variable names, because it's best to avoid all upper-case variables in shell programming, so as to avoid conflicts with special shell and environment variables.
If you can't rely on the order of the values in the source string, use two separate sed commands:
pm25=$(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
pm10=$(sed -E 's/^.*"pm10"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
awk to the rescue!
If you have to, you can use this hacky way using smart counters with hand-crafted delimiters. Setting RS instead of FS transfers looping through fields to awk itself. Multi-char RS is not available for all awks (gawk supports it).
$ awk -v RS='[:,[]' '$0=="\"pm25\""{c=4} c&&!--c' file
59
$ awk -v RS='[:,[]' '$0=="\"pm10\""{c=4} c&&!--c' file
15

Expect: extract specific string from output

I am navigating a Java-based CLI menu on a remote machine with expect inside a bash script and I am trying to extract something from the output without leaving the expect session.
Expect command in my script is:
expect -c "
spawn ssh user#host
expect \"#\"
send \"java cli menu command here\r\"
expect \"java cli prompt\"
send \"java menu command\"
"
###I want to extract a specific string from the above output###
Expect output is:
Id Name
-------------------
abcd 12 John Smith
I want to extract abcd 12 from the above output into another expect variable for further use within the expect script. So that's the 3rd line, first field by using a double-space delimiter. The awk equivalent would be: awk -F ' ' 'NR==3 {$1}'
The big issue is that the environment through which I am navigating with Expect is, as I stated above, a Java CLI based menu so I can't just use awk or anything else that would be available from a bash shell.
Getting out from the Java menu, processing the output and then getting in again is not an option as the login process lasts for 15 seconds so I need to remain inside and extract what I need from the output using expect internal commands only.
You can use regexp in expect itself directly with the use of -re flag. Thanks to Donal on pointing out the single quote and double quote issues. I have given solution using both ways.
I have created a file with the content as follows,
Id Name
-------------------
abcd 12 John Smith
This is nothing but your java program's console output. I have tested this in my system with this. i.e. I just simulated your program's output with cat. You just replace the cat code with your program commands. Simple. :)
Double Quotes :
#!/bin/bash
expect -c "
spawn ssh user#domain
expect \"password\"
send \"mypassword\r\"
expect {\\\$} { puts matched_literal_dollar_sign}
send \"cat input_file\r\"; # Replace this code with your java program commands
expect -re {-\r\n(.*?)\s\s}
set output \$expect_out(1,string)
#puts \$expect_out(1,string)
puts \"Result : \$output\"
"
Single Quotes :
#!/bin/bash
expect -c '
spawn ssh user#domain
expect "password"
send "mypasswordhere\r"
expect "\\\$" { puts matched_literal_dollar_sign}
send "cat input_file\r"; # Replace this code with your java program commands
expect -re {-\r\n(.*?)\s\s}
set output $expect_out(1,string)
#puts $expect_out(1,string)
puts "Result : $output"
'
As you can see, I have used {-\r\n(.*?)\s\s}. Here the braces prevent any variable substitutions. In your output, we have a 2nd line with full of hyphens. Then a newline. Then your 3rd line content. Let's decode the regex used.
-\r\n is to match one literal hyphen and a new line together. This will match the last hyphen in the 2nd line and the newline which in turn make it to 3rd line now. So, .*? will match the required output (i.e. abcd 12) till it encounters double space which is matched by \s\s.
You might be wondering why I need parenthesis which is used to get the sub-match patterns.
In general, expect will save the expect's whole match string in expect_out(0,string) and buffer all the matched/unmatched input to expect_out(buffer). Each sub match will be saved in subsequent numbering of string such as expect_out(1,string), expect_out(2,string) and so on.
As Donal pointed out, it is better to use single quote's approach since it looks less messy. :)
It is not required to escape the \r with the backslash in case of double quotes.
Update :
I have changed the regexp from -\r\n(\w+\s+\w+)\s\s to -\r\n(.*?)\s\s.
With this way - your requirement - such as match any number of letters and single spaces until you encounter first occurrence of double spaces in the output
Now, let's come to your question. You have mentioned that you have tried -\r\n(\w+)\s\s. But, there is a problem here with \w+. Remember \w+ will not match space character. Your output has some spaces in it till double spaces.
The use of regexp will matter based on your requirements on the input string which is going to get matched. You can customize the regular expressions based on your needs.
Update version 2 :
What is the significance of .*?. If you ask separately, I am going to repeat what you commented. In regular expressions, * is a greedy operator and ? is our life saver. Let us consider the string as
Stackoverflow is already overflowing with number of users.
Now, see the effect of the regular expression .*flow as below.
* matches any number of characters. More precisely, it matches the longest string possible while still allowing the pattern itself to match. So, due to this, .* in the pattern matched the characters Stackoverflow is already over and flow in pattern matched the text flow in the string.
Now, in order to prevent the .* to match only up to the first occurrence of the string flow, we are adding the ? to it. It will help the pattern to behave as non-greedy manner.
Now, again coming back to your question. If we have used .*\s\s, then it will match the whole line since it is trying to match as much as possible. This is common behavior of regular expressions.
Update version 3:
Have your code in the following way.
x=$(expect -c "
spawn ssh user#host
expect \"password\"
send \"password\r\"
expect {\\\$} { puts matched_literal_dollar_sign}
send \"cat input\r\"
expect -re {-\r\n(.*?)\s\s}
if {![info exists expect_out(1,string)]} {
puts \"Match did not happen :(\"
exit 1
}
set output \$expect_out(1,string)
#puts \$expect_out(1,string)
puts \"Result : \$output\"
")
y=$?
# $x now contains the output from the 'expect' command, and $y contains the
# exit status
echo $x
echo $y;
If the flow happened properly, then exit code will have value as 0. Else, it will have 1. With this way, you can check the return value in bash script.
Have a look at here to know about the info exists command.

Grep for a pattern

I have an HTML file with the following code
<html>
<body>
Test #1 '<%aaa(x,y)%>'
Test #2 '<%bbb(p)%>'
Test #3 '<%pqr(z)%>'
</body>
</html>
Please help me with the regex for a command (grep or awk) which displays the output as follows:
'<%aaa(x,y)%>'
'<%bbb(p)%>'
'<%pqr(z)%>'
I think that sed is a better choice than awk, but it is not completely clear cut.
sed -n '/ *Test #[0-9]* */s///p' <<!
<html>
<body>
Test #1 '<%aaa(x,y)%>'
Test #2 '<%bbb(p)%>'
Test #3 '<%pqr(z)%>'
</body>
</html>
!
You can't use grep; it returns lines that match a pattern, but doesn't normally edit those lines.
You could use awk:
awk '/Test #[0-9]+/ { print $3 }'
The pattern matches the test lines and prints the third field. It works because there are no spaces after the test number third field. If there could be spaces there, then the sed script is easier; it already handles them, whereas the awk script would have to be modified to handle them properly.
Judging from the comments, the desired output is the material between '<%' and '%>'. So, we use sed, as before:
sed -n '/.*\(<%.*%>\).*/s//\1/p'
On lines which match 'anything-<%-anything-%>-anything', replace the whole line with the part between '<%' and '%>' (including the markers) and print the result. Note that if there are multiple patterns on the line which match, only the last will be printed. (The question and comments do not cover what to do in that case, so this is acceptable. The alternatives are tough and best handled in Perl or perhaps Python.)
If the single quotes on the lines must be preserved, then you can use either of these - I'd use the first with the double quotes surrounding the regex, but they both work and are equivalent. OTOH, if there were expressions involving $ signs or back-ticks in the regex, the single-quotes are better; there are no metacharacters within a single-quoted string at the shell level.
sed -n "/.*\('<%.*%>'\).*/s//\1/p"
sed -n '/.*\('\''<%.*%>'\''\).*/s//\1/p'
The sequence '\'' is how you embed a single quote into a single-quoted string in a shell script. The first quote terminates the current string; the backslash-quote generates a single quote, and the last quote starts a new single-quoted string.
the -o option for grep is what you want:
grep -o "'.*'" filename
grep -P "^Test" 1.htm |awk '{print $3}'