Their is a site with socks4 proxies online that I use in a proxychains program. Instead of manually entering new IPs in, I was trying to automate the process. I used wget to turn it into a .html file on my home directory, this is some of the output if i cat the file:
</font></a></td><td colspan=1><font class=spy1>111.230.138.177</font> <font class=spy14>(Shenzhen Tencent Computer Systems Company Limited)</font></td><td colspan=1><font class=spy1>6.531</font></td><td colspan=1><TABLE width='13' height='8' CELLPADDING=0 CELLSPACING=0><TR BGCOLOR=blue><TD width=1></TD></TR></TABLE></td><td colspan=1><font class=spy1><acronym title='311 of 436 - last check status=OK'>71% <font class=spy1>(311)</font> <font class=spy5>-</font></acronym></font></td><td colspan=1><font class=spy1><font class=spy14>05-jun-2020</font> 23:06 <font class=spy5>(4 mins ago)</font></font></td></tr><tr class=spy1x onmouseover="this.style.background='#002424'" onmouseout="this.style.background='#19373A'"><td colspan=1><font class=spy14>139.99.104.233<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(a1j0e5^q7p6)+(m3f6f6^r8c3)+(a1j0e5^q7p6)+(t0b2s9^y5m3)+(w3c3m3^z6j0))</script></font></td><td colspan=1>SOCKS5</td><td colspan=1><a href='/en/anonymous-proxy-list/'><font class=spy1>HIA</font></a></td><td colspan=1><a href='/free-proxy-list/CA/'><font class=spy14>Canada</
As you can see the IP is usually followed by a spy[0-19]> . I tried to parse out the actual IP's with awk using the following code:
awk '/^spy/{FS=">"; print $2 } file-name.html
This is problematic because their would be a bunch of other stuff trailing after the IP, also I guess the anchor on works for the beginning of a line? Anyway I was wondering if anyone could give me any ideas on how to parse out the IP addresses with awk. I just started learning awk, so sorry for the noob question. Thanks
Using a proper XML/HTML parser and a xpath expression:
xidel -se '(//td[#colspan=1]/font[#class="spy1"])[1]/text()' file.html
Output:
111.230.138.177
Or if it's not all the time the first xpath match:
xidel -se '//td[#colspan=1]/font[#class="spy1"]/text()' file.html |
perl -MRegexp::Common -lne 'print $1 if /($RE{net}{IPv4})/'
AWK is great for hacking IP addresses:
gawk -v RS="spy[0-9]*" '{match($0,/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/); ip = substr($0,RSTART,RLENGTH); if (ip) {print ip}}' file.html
Result:
111.230.138.177
139.99.104.233
Explanation.
You must use GAWK if you want the record break to contain a regular expression.
We divide the file into lines containing one IP address using regex in the RS variable.
The match function finds the second regex in the entire line. Regex is 4 groups from 1 to 3 numbers, separated by a dot (the IP address).
Then the substract function retrieves from the entire line ($0) a fragment of RLENGTH length starting from RSTART (the beginning of the searched regex).
IF checks if the result has a value and if so prints it. This protects against empty lines in the result.
This method of hulling IP addresses is independent of the correctness of the file, it does not have to be html.
There's already solutions provided here, I'm rather putting a different one for future readers using egrep utility.
egrep -o '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' file.html
When I add the body to output list, some wrong names get output. I expected it to output only names for nfl subreddit in both examples. Feature or bug? How can I only output the tuples for subreddit nfl?
The file:
{"author":"403and780","author_flair_css_class":"NHL-EDM4-sheet1-col01-row17","author_flair_text":"EDM - NHL","body":"Don't get why we do this but can't have a Grey Cup GDT.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn6","is_submitter":false,"link_id":"t3_7v9yqa","parent_id":"t3_7v9yqa","permalink":"/r/hockey/comments/7v9yqa/game_thread_super_bowl_lii_philadelphia_eagles_vs/dtqrsn6/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"hockey","subreddit_id":"t5_2qiel","subreddit_type":"public"}
{"author":"kygiacomo","author_flair_css_class":null,"author_flair_text":null,"body":"lol missed the extra wtf","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn7","is_submitter":false,"link_id":"t3_7vad8n","parent_id":"t3_7vad8n","permalink":"/r/nfl/comments/7vad8n/super_bowl_lii_game_thread_philadelphia_eagles/dtqrsn7/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"nfl","subreddit_id":"t5_2qmg3","subreddit_type":"public"}
{"author":"shitpostlord4321","author_flair_css_class":null,"author_flair_text":null,"body":"I really hope we get Bleeding Edge before we get the all new all different armor. ","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn8","is_submitter":false,"link_id":"t3_7v7whz","parent_id":"t3_7v7whz","permalink":"/r/marvelstudios/comments/7v7whz/a_great_new_look_at_iron_mans_avengers_infinity/dtqrsn8/","retrieved_on":1518931297,"score":1,"stickied":false,"subreddit":"marvelstudios","subreddit_id":"t5_2uii8","subreddit_type":"public"}
{"author":"namohysip","author_flair_css_class":null,"author_flair_text":null,"body":"Maybe. I mostly am just doing this to get a story out, and it\u2019s a huge one, so I\u2019m not sure that I\u2019ll be making another fic for many more months. I guess Pokemon Mystery Dungeon just isn\u2019t as popular with the older demographics of AO3.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn9","is_submitter":true,"link_id":"t3_7v9psr","parent_id":"t1_dtqrm3t","permalink":"/r/FanFiction/comments/7v9psr/how_do_you_deal_with_bad_reviews/dtqrsn9/","retrieved_on":1518931297,"score":1,"stickied":false,"subreddit":"FanFiction","subreddit_id":"t5_2r5kb","subreddit_type":"public"}
{"author":"SDsc0rch","author_flair_css_class":null,"author_flair_text":null,"body":"if it rates an upvote, I'll click it - I'm not gonna click on low quality \nnot gonna apologize for it either ","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsna","is_submitter":false,"link_id":"t3_7vaam4","parent_id":"t3_7vaam4","permalink":"/r/The_Donald/comments/7vaam4/daily_reminderif_you_see_any_gray_arrows_on_the/dtqrsna/","retrieved_on":1518931297,"score":4,"stickied":false,"subreddit":"The_Donald","subreddit_id":"t5_38unr","subreddit_type":"public"}
{"author":"scarletcrawford","author_flair_css_class":null,"author_flair_text":null,"body":"Honestly, I wanted Takeshi to stay with Poe, but to each their own ship, I guess.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnb","is_submitter":false,"link_id":"t3_7upyc0","parent_id":"t1_dtppyry","permalink":"/r/alteredcarbon/comments/7upyc0/season_1_series_discussion/dtqrsnb/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"alteredcarbon","subreddit_id":"t5_3bzvp","subreddit_type":"public"}
{"author":"immortalis","author_flair_css_class":"vikings","author_flair_text":"Vikings","body":"The ghost of MN kickers will haunt this game.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnc","is_submitter":false,"link_id":"t3_7vad8n","parent_id":"t3_7vad8n","permalink":"/r/nfl/comments/7vad8n/super_bowl_lii_game_thread_philadelphia_eagles/dtqrsnc/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"nfl","subreddit_id":"t5_2qmg3","subreddit_type":"public"}
{"author":"KryptoFreak405","author_flair_css_class":"48","author_flair_text":"","body":"His original backstory had him training to be an Imperial officer until a commanding officer ordered him to transport a shipment of slaves. He refused, freed the slaves, one of which was Chewie, and defected to become a smuggler","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnd","is_submitter":false,"link_id":"t3_7vanzc","parent_id":"t1_dtqr5q5","permalink":"/r/StarWars/comments/7vanzc/solo_a_star_wars_story_big_game_tv_spot/dtqrsnd/","retrieved_on":1518931297,"score":1102,"stickied":false,"subreddit":"StarWars","subreddit_id":"t5_2qi4s","subreddit_type":"public"}
{"author":"thwinks","author_flair_css_class":null,"author_flair_text":null,"body":"Oh. TIL","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsne","is_submitter":false,"link_id":"t3_7v8o0z","parent_id":"t1_dtqg97a","permalink":"/r/gifs/comments/7v8o0z/silly_walk_champion/dtqrsne/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"gifs","subreddit_id":"t5_2qt55","subreddit_type":"public"}
{"author":"Mimi108","author_flair_css_class":"lions","author_flair_text":"Lions","body":"The Big. The Dick. The Nick. ","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnf","is_submitter":false,"link_id":"t3_7vad8n","parent_id":"t3_7vad8n","permalink":"/r/nfl/comments/7vad8n/super_bowl_lii_game_thread_philadelphia_eagles/dtqrsnf/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"nfl","subreddit_id":"t5_2qmg3","subreddit_type":"public"}
Code example 1, which works OK:
$ cat head_rc.txt | jq -r 'select(.subreddit=="nfl") .author'
kygiacomo
immortalis
Mimi108
Code example 2, which is wrong or unexpected to me:
$ cat head_rc.txt | jq -r 'select(.subreddit=="nfl") .body, .author'
403and780
lol missed the extra wtf
kygiacomo
shitpostlord4321
namohysip
SDsc0rch
scarletcrawford
The ghost of MN kickers will haunt this game.
immortalis
KryptoFreak405
thwinks
The Big. The Dick. The Nick.
Mimi108
You can see that author 403and780 commented to a hockey subreddit, not nfl, unfortunately.
jq solution:
jq -r 'select(.subreddit == "nfl") as $o | $o.body, $o.author' head_rc.txt
... as $o - assign filtered object to variable o
The output:
lol missed the extra wtf
kygiacomo
The ghost of MN kickers will haunt this game.
immortalis
The Big. The Dick. The Nick.
Mimi108
Also need to add streaming once I get this syntax correct.
Some good news - you won't need to use the so-called "streaming parser", because your input has already been chopped up. The "streaming parser" is only needed when the input has one or more individually ginormous JSON entities, whereas you have a (long) stream of small JSON objects.
p.s.
As Charles Duffy suggested, the simplest solution to your selection problem is to use parentheses
jq -r 'select(.subreddit=="nfl") | (.body, .author)' input.json
If CSV or TSV makes sense, then change the parentheses to brackets, and tack on #csv or #tsv, e.g.
select(.subreddit=="nfl") | [.body, .author] | #tsv
Is it possible to grep somehow all prices from a file and list the output? Price begins with "$" and may contain digits, "," and ".".
I've tried best solutions from this question, but they output all file or entire string containing a price.
The pattern I use is simple: \$
The page on the web I want to grep: http://www.ned.org/
Example of the page source:
<p><strong>Better Understanding Public Attitudes and Opinions</strong>
</p>
<p>Democratic Ideas and Values</p>
<p>$43,270</p>
<p>To monitor and better understand public views on key social, political, and economic developments. Citizens’ opinions will be tracked, documented, and studied ahead of and after the country’s September 2016 parliamentary elections. The results and accompanying analysis will be disseminated through print and electronic publications, a website, and independent media.</p>
<p><strong> </strong></p>
I want to output from this piece of html something like 43,270 or may be 43270. Just to lazy to write a parser :)
Something like this seems to work fine for my tests:
$ echo "$prices"
tomato $30.10
potato $19.1
apples=$2,222.1
oranges:$1
peach="$22.1",discount 10%,final price=$20
$ egrep -o '\$[0-9]+([.,][0-9]+)*' <<<"$prices"
$30.10
$19.1
$2,222.1
$1
$22.1
$20
Real test with your web page:
$ links -dump "http://www.ned.org/region/central-and-eastern-europe/belarus-2016/" |egrep -o '\$[0-9]+([.,][0-9]+)*'
$43,270
$25,845
$55,582
$14,940
$44,100
$35,610
$54,470
$60,200
$33,150
$15,720
$35,160
$45,500
$72,220
$26,330
$53,020
$27,710
$22,570
$40,145
#more prices following bellow