Grep all prices from file - html

Is it possible to grep somehow all prices from a file and list the output? Price begins with "$" and may contain digits, "," and ".".
I've tried best solutions from this question, but they output all file or entire string containing a price.
The pattern I use is simple: \$
The page on the web I want to grep: http://www.ned.org/
Example of the page source:
<p><strong>Better Understanding Public Attitudes and Opinions</strong>
</p>
<p>Democratic Ideas and Values</p>
<p>$43,270</p>
<p>To monitor and better understand public views on key social, political, and economic developments. Citizens’ opinions will be tracked, documented, and studied ahead of and after the country’s September 2016 parliamentary elections. The results and accompanying analysis will be disseminated through print and electronic publications, a website, and independent media.</p>
<p><strong> </strong></p>
I want to output from this piece of html something like 43,270 or may be 43270. Just to lazy to write a parser :)

Something like this seems to work fine for my tests:
$ echo "$prices"
tomato $30.10
potato $19.1
apples=$2,222.1
oranges:$1
peach="$22.1",discount 10%,final price=$20
$ egrep -o '\$[0-9]+([.,][0-9]+)*' <<<"$prices"
$30.10
$19.1
$2,222.1
$1
$22.1
$20
Real test with your web page:
$ links -dump "http://www.ned.org/region/central-and-eastern-europe/belarus-2016/" |egrep -o '\$[0-9]+([.,][0-9]+)*'
$43,270
$25,845
$55,582
$14,940
$44,100
$35,610
$54,470
$60,200
$33,150
$15,720
$35,160
$45,500
$72,220
$26,330
$53,020
$27,710
$22,570
$40,145
#more prices following bellow

Related

Getting IPs from a .html file

Their is a site with socks4 proxies online that I use in a proxychains program. Instead of manually entering new IPs in, I was trying to automate the process. I used wget to turn it into a .html file on my home directory, this is some of the output if i cat the file:
</font></a></td><td colspan=1><font class=spy1>111.230.138.177</font> <font class=spy14>(Shenzhen Tencent Computer Systems Company Limited)</font></td><td colspan=1><font class=spy1>6.531</font></td><td colspan=1><TABLE width='13' height='8' CELLPADDING=0 CELLSPACING=0><TR BGCOLOR=blue><TD width=1></TD></TR></TABLE></td><td colspan=1><font class=spy1><acronym title='311 of 436 - last check status=OK'>71% <font class=spy1>(311)</font> <font class=spy5>-</font></acronym></font></td><td colspan=1><font class=spy1><font class=spy14>05-jun-2020</font> 23:06 <font class=spy5>(4 mins ago)</font></font></td></tr><tr class=spy1x onmouseover="this.style.background='#002424'" onmouseout="this.style.background='#19373A'"><td colspan=1><font class=spy14>139.99.104.233<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(a1j0e5^q7p6)+(m3f6f6^r8c3)+(a1j0e5^q7p6)+(t0b2s9^y5m3)+(w3c3m3^z6j0))</script></font></td><td colspan=1>SOCKS5</td><td colspan=1><a href='/en/anonymous-proxy-list/'><font class=spy1>HIA</font></a></td><td colspan=1><a href='/free-proxy-list/CA/'><font class=spy14>Canada</
As you can see the IP is usually followed by a spy[0-19]> . I tried to parse out the actual IP's with awk using the following code:
awk '/^spy/{FS=">"; print $2 } file-name.html
This is problematic because their would be a bunch of other stuff trailing after the IP, also I guess the anchor on works for the beginning of a line? Anyway I was wondering if anyone could give me any ideas on how to parse out the IP addresses with awk. I just started learning awk, so sorry for the noob question. Thanks
Using a proper XML/HTML parser and a xpath expression:
xidel -se '(//td[#colspan=1]/font[#class="spy1"])[1]/text()' file.html
 Output:
111.230.138.177
Or if it's not all the time the first xpath match:
xidel -se '//td[#colspan=1]/font[#class="spy1"]/text()' file.html |
perl -MRegexp::Common -lne 'print $1 if /($RE{net}{IPv4})/'
AWK is great for hacking IP addresses:
gawk -v RS="spy[0-9]*" '{match($0,/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/); ip = substr($0,RSTART,RLENGTH); if (ip) {print ip}}' file.html
Result:
111.230.138.177
139.99.104.233
Explanation.
You must use GAWK if you want the record break to contain a regular expression.
We divide the file into lines containing one IP address using regex in the RS variable.
The match function finds the second regex in the entire line. Regex is 4 groups from 1 to 3 numbers, separated by a dot (the IP address).
Then the substract function retrieves from the entire line ($0) a fragment of RLENGTH length starting from RSTART (the beginning of the searched regex).
IF checks if the result has a value and if so prints it. This protects against empty lines in the result.
This method of hulling IP addresses is independent of the correctness of the file, it does not have to be html.
There's already solutions provided here, I'm rather putting a different one for future readers using egrep utility.
egrep -o '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' file.html

Unexpected results when Select objects using jq

When I add the body to output list, some wrong names get output. I expected it to output only names for nfl subreddit in both examples. Feature or bug? How can I only output the tuples for subreddit nfl?
The file:
{"author":"403and780","author_flair_css_class":"NHL-EDM4-sheet1-col01-row17","author_flair_text":"EDM - NHL","body":"Don't get why we do this but can't have a Grey Cup GDT.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn6","is_submitter":false,"link_id":"t3_7v9yqa","parent_id":"t3_7v9yqa","permalink":"/r/hockey/comments/7v9yqa/game_thread_super_bowl_lii_philadelphia_eagles_vs/dtqrsn6/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"hockey","subreddit_id":"t5_2qiel","subreddit_type":"public"}
{"author":"kygiacomo","author_flair_css_class":null,"author_flair_text":null,"body":"lol missed the extra wtf","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn7","is_submitter":false,"link_id":"t3_7vad8n","parent_id":"t3_7vad8n","permalink":"/r/nfl/comments/7vad8n/super_bowl_lii_game_thread_philadelphia_eagles/dtqrsn7/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"nfl","subreddit_id":"t5_2qmg3","subreddit_type":"public"}
{"author":"shitpostlord4321","author_flair_css_class":null,"author_flair_text":null,"body":"I really hope we get Bleeding Edge before we get the all new all different armor. ","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn8","is_submitter":false,"link_id":"t3_7v7whz","parent_id":"t3_7v7whz","permalink":"/r/marvelstudios/comments/7v7whz/a_great_new_look_at_iron_mans_avengers_infinity/dtqrsn8/","retrieved_on":1518931297,"score":1,"stickied":false,"subreddit":"marvelstudios","subreddit_id":"t5_2uii8","subreddit_type":"public"}
{"author":"namohysip","author_flair_css_class":null,"author_flair_text":null,"body":"Maybe. I mostly am just doing this to get a story out, and it\u2019s a huge one, so I\u2019m not sure that I\u2019ll be making another fic for many more months. I guess Pokemon Mystery Dungeon just isn\u2019t as popular with the older demographics of AO3.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn9","is_submitter":true,"link_id":"t3_7v9psr","parent_id":"t1_dtqrm3t","permalink":"/r/FanFiction/comments/7v9psr/how_do_you_deal_with_bad_reviews/dtqrsn9/","retrieved_on":1518931297,"score":1,"stickied":false,"subreddit":"FanFiction","subreddit_id":"t5_2r5kb","subreddit_type":"public"}
{"author":"SDsc0rch","author_flair_css_class":null,"author_flair_text":null,"body":"if it rates an upvote, I'll click it - I'm not gonna click on low quality \nnot gonna apologize for it either ","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsna","is_submitter":false,"link_id":"t3_7vaam4","parent_id":"t3_7vaam4","permalink":"/r/The_Donald/comments/7vaam4/daily_reminderif_you_see_any_gray_arrows_on_the/dtqrsna/","retrieved_on":1518931297,"score":4,"stickied":false,"subreddit":"The_Donald","subreddit_id":"t5_38unr","subreddit_type":"public"}
{"author":"scarletcrawford","author_flair_css_class":null,"author_flair_text":null,"body":"Honestly, I wanted Takeshi to stay with Poe, but to each their own ship, I guess.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnb","is_submitter":false,"link_id":"t3_7upyc0","parent_id":"t1_dtppyry","permalink":"/r/alteredcarbon/comments/7upyc0/season_1_series_discussion/dtqrsnb/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"alteredcarbon","subreddit_id":"t5_3bzvp","subreddit_type":"public"}
{"author":"immortalis","author_flair_css_class":"vikings","author_flair_text":"Vikings","body":"The ghost of MN kickers will haunt this game.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnc","is_submitter":false,"link_id":"t3_7vad8n","parent_id":"t3_7vad8n","permalink":"/r/nfl/comments/7vad8n/super_bowl_lii_game_thread_philadelphia_eagles/dtqrsnc/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"nfl","subreddit_id":"t5_2qmg3","subreddit_type":"public"}
{"author":"KryptoFreak405","author_flair_css_class":"48","author_flair_text":"","body":"His original backstory had him training to be an Imperial officer until a commanding officer ordered him to transport a shipment of slaves. He refused, freed the slaves, one of which was Chewie, and defected to become a smuggler","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnd","is_submitter":false,"link_id":"t3_7vanzc","parent_id":"t1_dtqr5q5","permalink":"/r/StarWars/comments/7vanzc/solo_a_star_wars_story_big_game_tv_spot/dtqrsnd/","retrieved_on":1518931297,"score":1102,"stickied":false,"subreddit":"StarWars","subreddit_id":"t5_2qi4s","subreddit_type":"public"}
{"author":"thwinks","author_flair_css_class":null,"author_flair_text":null,"body":"Oh. TIL","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsne","is_submitter":false,"link_id":"t3_7v8o0z","parent_id":"t1_dtqg97a","permalink":"/r/gifs/comments/7v8o0z/silly_walk_champion/dtqrsne/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"gifs","subreddit_id":"t5_2qt55","subreddit_type":"public"}
{"author":"Mimi108","author_flair_css_class":"lions","author_flair_text":"Lions","body":"The Big. The Dick. The Nick. ","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnf","is_submitter":false,"link_id":"t3_7vad8n","parent_id":"t3_7vad8n","permalink":"/r/nfl/comments/7vad8n/super_bowl_lii_game_thread_philadelphia_eagles/dtqrsnf/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"nfl","subreddit_id":"t5_2qmg3","subreddit_type":"public"}
Code example 1, which works OK:
$ cat head_rc.txt | jq -r 'select(.subreddit=="nfl") .author'
kygiacomo
immortalis
Mimi108
Code example 2, which is wrong or unexpected to me:
$ cat head_rc.txt | jq -r 'select(.subreddit=="nfl") .body, .author'
403and780
lol missed the extra wtf
kygiacomo
shitpostlord4321
namohysip
SDsc0rch
scarletcrawford
The ghost of MN kickers will haunt this game.
immortalis
KryptoFreak405
thwinks
The Big. The Dick. The Nick.
Mimi108
You can see that author 403and780 commented to a hockey subreddit, not nfl, unfortunately.
jq solution:
jq -r 'select(.subreddit == "nfl") as $o | $o.body, $o.author' head_rc.txt
... as $o - assign filtered object to variable o
The output:
lol missed the extra wtf
kygiacomo
The ghost of MN kickers will haunt this game.
immortalis
The Big. The Dick. The Nick.
Mimi108
Also need to add streaming once I get this syntax correct.
Some good news - you won't need to use the so-called "streaming parser", because your input has already been chopped up. The "streaming parser" is only needed when the input has one or more individually ginormous JSON entities, whereas you have a (long) stream of small JSON objects.
p.s.
As Charles Duffy suggested, the simplest solution to your selection problem is to use parentheses
jq -r 'select(.subreddit=="nfl") | (.body, .author)' input.json
If CSV or TSV makes sense, then change the parentheses to brackets, and tack on #csv or #tsv, e.g.
select(.subreddit=="nfl") | [.body, .author] | #tsv

Converting the output of MediaWiki to plain text

Using the MediaWiki API, this gives me an output like so, for search term Tiger
https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1
Response:
{"batchcomplete":"","query":{"pages":{"9796":{"pageid":9796,"ns":0,"title":"Tiger","extract":"<p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>\n<p></p>"}}}}
How do I get an output as
The tiger (Panthera tigris) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.
Please can someone also tell me how to store everything in a text file? I'm a beginner here so please be nice. I need this for a project I'm doing in Bash, on a Raspberry Pi 2, with Raspbian
It's usually recommended to use JSON parser for handling JSON, one that I like is jq
% jq -r '.query.pages[].extract' file
<p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>
<p></p>
To remove the HTML tags you can do something like:
... | sed 's/<[^>]*>//g'
Which will remove HTML tags that are not on continues lines:
% jq -r '.query.pages[].extract' file | sed 's/<[^>]*>//g'
The tiger (Panthera tigris) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.
file is the file the JSON is stored in, eg:
curl -so - 'https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1' > file
jq '...' file
or
jq '...' <(curl -so - 'https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1')
You can install jq with:
sudo apt-get install jq
For your example input you can also use grep with -P (PCRE). But using a proper JSON parser as above is recommended
grep -oP '(?<=extract":").*?(?=(?<!\\)")' file
<p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>\n<p></p>
If you're using PHP, you can do it fairly easily such as below.
Accessing the text
We know that the text is stored inside the extract property, so we need to access that.
The easiest way to do this would be to parse the string from the API into an object format, which is done with the json_decode method in PHP. You can then access the extract property from that object, and this will give you your string. The code would be something like this:
//Get the string from the API, however you've already done it
$JSONString = getFromAPI();
//Use the inbuilt method to create a JSON object
$JSONObject = json_decode($JSONString);
//Follow the structure to get the pages property
$Pages = JSONObject->query->pages;
//Here, we don't know what the Page ID is (because the MediaWiki API returns a different number, depending on the page)
//Therefore we need to simply get the first key, and within it should be our desired 'extract' key
$Extract = "";
foreach($Pages as $value) {
$Extract = $value->extract;
break;
}
//$Extract now contains our desired text
Writing it to a file
Now we need to write the contents of $Extract to a file, as you mentioned. This can be done as follows, by utilizing the file_put_contents method.
//Can be anything you want
$file = 'APIResult.txt';
// Write the contents to the file,
// using the LOCK_EX flag to prevent anyone else writing to the file at the same time
file_put_contents($file, $Extract, LOCK_EX);
Aaand we're done!
Documentation
The documentation for these functions (json_decode and file_put_contents) can be found at:
http://php.net/manual/en/function.json-decode.php
http://php.net/manual/en/function.file-put-contents.php
You may find pandoc helpful, from http://pandoc.org/ - it understands a number of file formats on input including Mediawiki, and also has a bunch of file formats on output including plain text. It's more the "Swiss army knife" approach, and since Mediawiki is arbitrarily complicated to parse, you'll want to use something like this that's been through a big test suite.

searching .CSV file with AWK - only working with first row

I've been trying to search through a specific column of a .csv file to find cells containing a particular word. However, it's only working for the first row (i.e. the headings) in my .csv file.
The file is a series of over 10,000 forum posts, with column 1 as the post key and column 2 as the post text. The headings as below are 'key', 'annotated sentence'.
key,annotated sentence
"(212, 2)","Got evidence to back that up??
I'm not sure how a stoner's worse than an alcoholic really.
-Wez"
"(537, 5)","Forgive me for laughing; no, not really ha, ha, ha ha ha
Could it be that people here as well as Canada and the rest of the world has figured out your infantile ""grading system of States"" is a complete sham and your very reason for existing is but an anti-constitutional farce and has lost any claims you have or will make? You stand alone now brady, with simply a few still clinging to the false hope of having others pay for your failures and unConstitutional stance so you can sit on your hands while you keep harping on overturning the 2A."
"(595, 0)",So you're actually claiming that it is a lie to say that the UK has a lower gun crime rate than the US? Even if the police were miscounting crimes it's still a huge and unjustified leap in logic to conclude from that that the UK does not have a lower gun crime rate.
"(736, 3)","The anti-abortionists claim a load of **** on many issues. I don't listen to them. To put the ""life"" of an unfertilized egg above that of a person is grotesquely sick IMO. I support any such stem cell research wholeheartedly."
The CSV separator is a comma, and the text delimiter is ".
if I try:
awk -F, '$1 ~ /key/ {print}' posts_file.csv > output_file.csv
it will output the headings row no problem. However, I have tried:
awk -F, '$1 ~ /212/ {print}' posts_file.csv > output_file.csv
awk -F, '$2 ~ /Canada/ {print}' posts_file.csv > output_file.csv
and neither of these work - no matches are found though there should be. I can't figure out why? Any ideas? Thanks in advance.
awk to the rescue!
In general complex csv doesn't work but in your case since key and annotated sentence have very distinct value types you can extend your pattern search to the whole record instead of key and value, the trick is defining the record, which again based on your format can be done as well. For example
$ awk -v RS='\n"' '/Canada/{print RT $0}' csv
"(537, 5)","Forgive me for laughing; no, not really ha, ha, ha ha ha
Could it be that people here as well as Canada and the rest of the world has figured out your infantile ""grading syst
em of States"" is a complete sham and your very reason for existing is but an anti-constitutional farce and has lost a
ny claims you have or will make? You stand alone now brady, with simply a few still clinging to the false hope of havi
ng others pay for your failures and unConstitutional stance so you can sit on your hands while you keep harping on ove
rturning the 2A."
and this
$ awk -v RS='\n"' '/(212, 2)/{print RT $0}' csv
"(212, 2)","Got evidence to back that up??
I'm not sure how a stoner's worse than an alcoholic really.
-Wez"
Python's CSV parsing supports your format out of the box.
Below is a simple script that you could call as follows:
# csvfilter <1-basedFieldNdx> <regexStr> < <infile> > <outfile>
csvfilter 1 'key' < posts_file.csv > output_file.csv
csvfilter 1 '212' < posts_file.csv > output_file.csv
csvfilter 2 'Canada' < posts_file.csv > output_file.csv
Sample script csvfilter:
#!/usr/bin/env python
# coding=utf-8
import csv, sys, re
# Assign arguments to variables.
fieldNdx = int(sys.argv[1]) - 1 # Index of field to filter; Python arrays are 0-based!
reStr = sys.argv[2] if (len(sys.argv) > 2) else '' # Filter regex
# Read from stdin...
reader = csv.reader(sys.stdin)
# ... write to stdout.
writer = csv.writer(sys.stdout, reader.dialect)
# Read each line...
for row in reader:
# Match the target field against the filter regex and
# print the row only if it matches.
if (re.search(reStr, row[fieldNdx])):
writer.writerow(row)
OpenRefine could help with the search.
One way to use awk safely with complex CSV is to use a "csv2tsv" utility to convert the CSV file to a format that can be handled properly by awk.
Usually the TSV ("tab-separated values") format is just right for the job.
(If the final output must be CSV, then either a complementary "tsv2csv" utility can be used, or awk itself can do the job -- though some care may be required to get it exactly right.)
So the pipeline might look like this:
csv2tsv < input.csv | awk -F\\t 'BEGIN{OFS=FS} ....' | tsv2csv
There are several alternatives for csv-to-tsv conversion, ranging from roll-your-own scripts to Excel, but I'd recommend taking the time to check that whichever tool or toolset you select satisfies the "edge case" requirements that are of interest to you.

Windows - How to grep (or findstr) html files and showing the first matching expression

using grep or findstr I want to get the correct IMDB number, when searching by a specific movie via it's real name.
For example the movie "Das Boot" is listed at IMDB with movie number tt0082096.
Actually I'm trying to grep (or findstr) through html files that are generated by a search machine.
The generated html file contains several parts like this:
<div id="statbox">
<span class="uschr2">1. </span> Das Boot (1981) - IMDb <br>
<div id="descbox">
www.imdb.com/title/tt0082096/ - Im Cache - Ähnliche Seiten <BR>
</div>
The string I'm looking for is the one containing the URL of the movie. In this case it's:
http://www.imdb.com/title/tt0082096/
The string format is like:
http://www.imdb.com/title/tt???????/
Where '?' stands for a digit 0...9
My question is:
How can grep or findstr return only the first occurrence of the matching string itself and not the complete line containing a match?
Thank you a lot for your assistance!
Best regards
Windows findstr returns complete lines. You can avoid this with GNU sed:
sed -rn "\#http://www.imdb.com/title/tt#s#.*href=\"(.*)\"\s.*#\1#p" file
http://www.imdb.com/title/tt0082096/
In addition you can use grep -o:
-o, --only-matching show only the part of a line matching PATTERN
With grep you can do something like:
grep -oP '(?<=href=\")[^"]+(?=\")' html.file
This is not the ideal way of parsing an html file. However, if it is a one off thing then you can probably get away with it. ?<=href=\" is a look behind search. If the above it returning a lot of stuff then you can probably add which is unique to the url lines.