Converting the output of MediaWiki to plain text - json

Using the MediaWiki API, this gives me an output like so, for search term Tiger
https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1
Response:
{"batchcomplete":"","query":{"pages":{"9796":{"pageid":9796,"ns":0,"title":"Tiger","extract":"<p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>\n<p></p>"}}}}
How do I get an output as
The tiger (Panthera tigris) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.
Please can someone also tell me how to store everything in a text file? I'm a beginner here so please be nice. I need this for a project I'm doing in Bash, on a Raspberry Pi 2, with Raspbian

It's usually recommended to use JSON parser for handling JSON, one that I like is jq
% jq -r '.query.pages[].extract' file
<p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>
<p></p>
To remove the HTML tags you can do something like:
... | sed 's/<[^>]*>//g'
Which will remove HTML tags that are not on continues lines:
% jq -r '.query.pages[].extract' file | sed 's/<[^>]*>//g'
The tiger (Panthera tigris) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.
file is the file the JSON is stored in, eg:
curl -so - 'https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1' > file
jq '...' file
or
jq '...' <(curl -so - 'https://simple.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Tiger&format=json&exintro=1')
You can install jq with:
sudo apt-get install jq
For your example input you can also use grep with -P (PCRE). But using a proper JSON parser as above is recommended
grep -oP '(?<=extract":").*?(?=(?<!\\)")' file
<p>The <b>tiger</b> (<i>Panthera tigris</i>) is a carnivorous mammal. It is the largest living member of the cat family, the Felidae. It lives in Asia, mainly India, Bhutan, China and Siberia.</p>\n<p></p>

If you're using PHP, you can do it fairly easily such as below.
Accessing the text
We know that the text is stored inside the extract property, so we need to access that.
The easiest way to do this would be to parse the string from the API into an object format, which is done with the json_decode method in PHP. You can then access the extract property from that object, and this will give you your string. The code would be something like this:
//Get the string from the API, however you've already done it
$JSONString = getFromAPI();
//Use the inbuilt method to create a JSON object
$JSONObject = json_decode($JSONString);
//Follow the structure to get the pages property
$Pages = JSONObject->query->pages;
//Here, we don't know what the Page ID is (because the MediaWiki API returns a different number, depending on the page)
//Therefore we need to simply get the first key, and within it should be our desired 'extract' key
$Extract = "";
foreach($Pages as $value) {
$Extract = $value->extract;
break;
}
//$Extract now contains our desired text
Writing it to a file
Now we need to write the contents of $Extract to a file, as you mentioned. This can be done as follows, by utilizing the file_put_contents method.
//Can be anything you want
$file = 'APIResult.txt';
// Write the contents to the file,
// using the LOCK_EX flag to prevent anyone else writing to the file at the same time
file_put_contents($file, $Extract, LOCK_EX);
Aaand we're done!
Documentation
The documentation for these functions (json_decode and file_put_contents) can be found at:
http://php.net/manual/en/function.json-decode.php
http://php.net/manual/en/function.file-put-contents.php

You may find pandoc helpful, from http://pandoc.org/ - it understands a number of file formats on input including Mediawiki, and also has a bunch of file formats on output including plain text. It's more the "Swiss army knife" approach, and since Mediawiki is arbitrarily complicated to parse, you'll want to use something like this that's been through a big test suite.

Related

Getting IPs from a .html file

Their is a site with socks4 proxies online that I use in a proxychains program. Instead of manually entering new IPs in, I was trying to automate the process. I used wget to turn it into a .html file on my home directory, this is some of the output if i cat the file:
</font></a></td><td colspan=1><font class=spy1>111.230.138.177</font> <font class=spy14>(Shenzhen Tencent Computer Systems Company Limited)</font></td><td colspan=1><font class=spy1>6.531</font></td><td colspan=1><TABLE width='13' height='8' CELLPADDING=0 CELLSPACING=0><TR BGCOLOR=blue><TD width=1></TD></TR></TABLE></td><td colspan=1><font class=spy1><acronym title='311 of 436 - last check status=OK'>71% <font class=spy1>(311)</font> <font class=spy5>-</font></acronym></font></td><td colspan=1><font class=spy1><font class=spy14>05-jun-2020</font> 23:06 <font class=spy5>(4 mins ago)</font></font></td></tr><tr class=spy1x onmouseover="this.style.background='#002424'" onmouseout="this.style.background='#19373A'"><td colspan=1><font class=spy14>139.99.104.233<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(a1j0e5^q7p6)+(m3f6f6^r8c3)+(a1j0e5^q7p6)+(t0b2s9^y5m3)+(w3c3m3^z6j0))</script></font></td><td colspan=1>SOCKS5</td><td colspan=1><a href='/en/anonymous-proxy-list/'><font class=spy1>HIA</font></a></td><td colspan=1><a href='/free-proxy-list/CA/'><font class=spy14>Canada</
As you can see the IP is usually followed by a spy[0-19]> . I tried to parse out the actual IP's with awk using the following code:
awk '/^spy/{FS=">"; print $2 } file-name.html
This is problematic because their would be a bunch of other stuff trailing after the IP, also I guess the anchor on works for the beginning of a line? Anyway I was wondering if anyone could give me any ideas on how to parse out the IP addresses with awk. I just started learning awk, so sorry for the noob question. Thanks
Using a proper XML/HTML parser and a xpath expression:
xidel -se '(//td[#colspan=1]/font[#class="spy1"])[1]/text()' file.html
 Output:
111.230.138.177
Or if it's not all the time the first xpath match:
xidel -se '//td[#colspan=1]/font[#class="spy1"]/text()' file.html |
perl -MRegexp::Common -lne 'print $1 if /($RE{net}{IPv4})/'
AWK is great for hacking IP addresses:
gawk -v RS="spy[0-9]*" '{match($0,/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/); ip = substr($0,RSTART,RLENGTH); if (ip) {print ip}}' file.html
Result:
111.230.138.177
139.99.104.233
Explanation.
You must use GAWK if you want the record break to contain a regular expression.
We divide the file into lines containing one IP address using regex in the RS variable.
The match function finds the second regex in the entire line. Regex is 4 groups from 1 to 3 numbers, separated by a dot (the IP address).
Then the substract function retrieves from the entire line ($0) a fragment of RLENGTH length starting from RSTART (the beginning of the searched regex).
IF checks if the result has a value and if so prints it. This protects against empty lines in the result.
This method of hulling IP addresses is independent of the correctness of the file, it does not have to be html.
There's already solutions provided here, I'm rather putting a different one for future readers using egrep utility.
egrep -o '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' file.html

Is it possible to extract from a map JSON file a list of a city's neighborhoods in a tree-structure format?

Forgive my ignorance, I am not experienced with JSON files. I've been trying to get a tree structure list of all the neighborhoods and locations in the city of Cape Town and this seems to be my last resort.
Unfortunately, I can't even open the file that can be found on this website - http://odp.capetown.gov.za/datasets/official-suburbs?geometry=18.107%2C-34.187%2C19.034%2C-33.988
Could someone tell me if it's possible to extract such as list.
I'd be forever thankful if someone could help me. Thank you in advance
[I am making my comments an answer since I see no other suggestions and no information provided]
I am on a unix/linux shell but the following tools can also be found for windows. My solution for getting a quick list would be:
curl https://opendata.arcgis.com/datasets/8ebcd15badfe40a4ab759682aacf8439_75.geojson |\
jq '.features | .[] | .properties.OFC_SBRB_NAME'
Which gives you:
"HYDE PARK"
"SPRINGFIELD"
"NIEUW MAASTRECHT-2"
"CHARLESVILLE"
"WILDWOOD"
"MALIBU VILLAGE"
"TUSCANY GLEN"
"VICTORIA MXENGE"
"KHAYELITSHA"
"CASTLE ROCK"
"MANSFIELD INDUSTRIA"
...
Explanation:
curl https://... - curl downloads the JSON file from the API you are using
jq: can process JSON on terminal and extract information. I do this in three steps:
.features: GeoJSON format seems to have a standard schema. All the retuned entries are in features array
.[] returns all elements in the array docs here
.properties.OFC_SBRB_NAME: Each element of the array has a field called "properties" which from my understanding carries/includes metadata of this entry. One of those properties in OFC_SBRB_NAME which looks like a name and is the only string in each element. Thus I extract this.
Hope it helps. If you add more detail as to which platform you are using or language, etc I can update the answer, however, the methodology should remain the same I think

Unexpected results when Select objects using jq

When I add the body to output list, some wrong names get output. I expected it to output only names for nfl subreddit in both examples. Feature or bug? How can I only output the tuples for subreddit nfl?
The file:
{"author":"403and780","author_flair_css_class":"NHL-EDM4-sheet1-col01-row17","author_flair_text":"EDM - NHL","body":"Don't get why we do this but can't have a Grey Cup GDT.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn6","is_submitter":false,"link_id":"t3_7v9yqa","parent_id":"t3_7v9yqa","permalink":"/r/hockey/comments/7v9yqa/game_thread_super_bowl_lii_philadelphia_eagles_vs/dtqrsn6/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"hockey","subreddit_id":"t5_2qiel","subreddit_type":"public"}
{"author":"kygiacomo","author_flair_css_class":null,"author_flair_text":null,"body":"lol missed the extra wtf","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn7","is_submitter":false,"link_id":"t3_7vad8n","parent_id":"t3_7vad8n","permalink":"/r/nfl/comments/7vad8n/super_bowl_lii_game_thread_philadelphia_eagles/dtqrsn7/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"nfl","subreddit_id":"t5_2qmg3","subreddit_type":"public"}
{"author":"shitpostlord4321","author_flair_css_class":null,"author_flair_text":null,"body":"I really hope we get Bleeding Edge before we get the all new all different armor. ","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn8","is_submitter":false,"link_id":"t3_7v7whz","parent_id":"t3_7v7whz","permalink":"/r/marvelstudios/comments/7v7whz/a_great_new_look_at_iron_mans_avengers_infinity/dtqrsn8/","retrieved_on":1518931297,"score":1,"stickied":false,"subreddit":"marvelstudios","subreddit_id":"t5_2uii8","subreddit_type":"public"}
{"author":"namohysip","author_flair_css_class":null,"author_flair_text":null,"body":"Maybe. I mostly am just doing this to get a story out, and it\u2019s a huge one, so I\u2019m not sure that I\u2019ll be making another fic for many more months. I guess Pokemon Mystery Dungeon just isn\u2019t as popular with the older demographics of AO3.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsn9","is_submitter":true,"link_id":"t3_7v9psr","parent_id":"t1_dtqrm3t","permalink":"/r/FanFiction/comments/7v9psr/how_do_you_deal_with_bad_reviews/dtqrsn9/","retrieved_on":1518931297,"score":1,"stickied":false,"subreddit":"FanFiction","subreddit_id":"t5_2r5kb","subreddit_type":"public"}
{"author":"SDsc0rch","author_flair_css_class":null,"author_flair_text":null,"body":"if it rates an upvote, I'll click it - I'm not gonna click on low quality \nnot gonna apologize for it either ","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsna","is_submitter":false,"link_id":"t3_7vaam4","parent_id":"t3_7vaam4","permalink":"/r/The_Donald/comments/7vaam4/daily_reminderif_you_see_any_gray_arrows_on_the/dtqrsna/","retrieved_on":1518931297,"score":4,"stickied":false,"subreddit":"The_Donald","subreddit_id":"t5_38unr","subreddit_type":"public"}
{"author":"scarletcrawford","author_flair_css_class":null,"author_flair_text":null,"body":"Honestly, I wanted Takeshi to stay with Poe, but to each their own ship, I guess.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnb","is_submitter":false,"link_id":"t3_7upyc0","parent_id":"t1_dtppyry","permalink":"/r/alteredcarbon/comments/7upyc0/season_1_series_discussion/dtqrsnb/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"alteredcarbon","subreddit_id":"t5_3bzvp","subreddit_type":"public"}
{"author":"immortalis","author_flair_css_class":"vikings","author_flair_text":"Vikings","body":"The ghost of MN kickers will haunt this game.","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnc","is_submitter":false,"link_id":"t3_7vad8n","parent_id":"t3_7vad8n","permalink":"/r/nfl/comments/7vad8n/super_bowl_lii_game_thread_philadelphia_eagles/dtqrsnc/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"nfl","subreddit_id":"t5_2qmg3","subreddit_type":"public"}
{"author":"KryptoFreak405","author_flair_css_class":"48","author_flair_text":"","body":"His original backstory had him training to be an Imperial officer until a commanding officer ordered him to transport a shipment of slaves. He refused, freed the slaves, one of which was Chewie, and defected to become a smuggler","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnd","is_submitter":false,"link_id":"t3_7vanzc","parent_id":"t1_dtqr5q5","permalink":"/r/StarWars/comments/7vanzc/solo_a_star_wars_story_big_game_tv_spot/dtqrsnd/","retrieved_on":1518931297,"score":1102,"stickied":false,"subreddit":"StarWars","subreddit_id":"t5_2qi4s","subreddit_type":"public"}
{"author":"thwinks","author_flair_css_class":null,"author_flair_text":null,"body":"Oh. TIL","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsne","is_submitter":false,"link_id":"t3_7v8o0z","parent_id":"t1_dtqg97a","permalink":"/r/gifs/comments/7v8o0z/silly_walk_champion/dtqrsne/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"gifs","subreddit_id":"t5_2qt55","subreddit_type":"public"}
{"author":"Mimi108","author_flair_css_class":"lions","author_flair_text":"Lions","body":"The Big. The Dick. The Nick. ","can_gild":true,"controversiality":0,"created_utc":1517788800,"distinguished":null,"edited":false,"gilded":0,"id":"dtqrsnf","is_submitter":false,"link_id":"t3_7vad8n","parent_id":"t3_7vad8n","permalink":"/r/nfl/comments/7vad8n/super_bowl_lii_game_thread_philadelphia_eagles/dtqrsnf/","retrieved_on":1518931297,"score":2,"stickied":false,"subreddit":"nfl","subreddit_id":"t5_2qmg3","subreddit_type":"public"}
Code example 1, which works OK:
$ cat head_rc.txt | jq -r 'select(.subreddit=="nfl") .author'
kygiacomo
immortalis
Mimi108
Code example 2, which is wrong or unexpected to me:
$ cat head_rc.txt | jq -r 'select(.subreddit=="nfl") .body, .author'
403and780
lol missed the extra wtf
kygiacomo
shitpostlord4321
namohysip
SDsc0rch
scarletcrawford
The ghost of MN kickers will haunt this game.
immortalis
KryptoFreak405
thwinks
The Big. The Dick. The Nick.
Mimi108
You can see that author 403and780 commented to a hockey subreddit, not nfl, unfortunately.
jq solution:
jq -r 'select(.subreddit == "nfl") as $o | $o.body, $o.author' head_rc.txt
... as $o - assign filtered object to variable o
The output:
lol missed the extra wtf
kygiacomo
The ghost of MN kickers will haunt this game.
immortalis
The Big. The Dick. The Nick.
Mimi108
Also need to add streaming once I get this syntax correct.
Some good news - you won't need to use the so-called "streaming parser", because your input has already been chopped up. The "streaming parser" is only needed when the input has one or more individually ginormous JSON entities, whereas you have a (long) stream of small JSON objects.
p.s.
As Charles Duffy suggested, the simplest solution to your selection problem is to use parentheses
jq -r 'select(.subreddit=="nfl") | (.body, .author)' input.json
If CSV or TSV makes sense, then change the parentheses to brackets, and tack on #csv or #tsv, e.g.
select(.subreddit=="nfl") | [.body, .author] | #tsv

Grep all prices from file

Is it possible to grep somehow all prices from a file and list the output? Price begins with "$" and may contain digits, "," and ".".
I've tried best solutions from this question, but they output all file or entire string containing a price.
The pattern I use is simple: \$
The page on the web I want to grep: http://www.ned.org/
Example of the page source:
<p><strong>Better Understanding Public Attitudes and Opinions</strong>
</p>
<p>Democratic Ideas and Values</p>
<p>$43,270</p>
<p>To monitor and better understand public views on key social, political, and economic developments. Citizens’ opinions will be tracked, documented, and studied ahead of and after the country’s September 2016 parliamentary elections. The results and accompanying analysis will be disseminated through print and electronic publications, a website, and independent media.</p>
<p><strong> </strong></p>
I want to output from this piece of html something like 43,270 or may be 43270. Just to lazy to write a parser :)
Something like this seems to work fine for my tests:
$ echo "$prices"
tomato $30.10
potato $19.1
apples=$2,222.1
oranges:$1
peach="$22.1",discount 10%,final price=$20
$ egrep -o '\$[0-9]+([.,][0-9]+)*' <<<"$prices"
$30.10
$19.1
$2,222.1
$1
$22.1
$20
Real test with your web page:
$ links -dump "http://www.ned.org/region/central-and-eastern-europe/belarus-2016/" |egrep -o '\$[0-9]+([.,][0-9]+)*'
$43,270
$25,845
$55,582
$14,940
$44,100
$35,610
$54,470
$60,200
$33,150
$15,720
$35,160
$45,500
$72,220
$26,330
$53,020
$27,710
$22,570
$40,145
#more prices following bellow

Windows - How to grep (or findstr) html files and showing the first matching expression

using grep or findstr I want to get the correct IMDB number, when searching by a specific movie via it's real name.
For example the movie "Das Boot" is listed at IMDB with movie number tt0082096.
Actually I'm trying to grep (or findstr) through html files that are generated by a search machine.
The generated html file contains several parts like this:
<div id="statbox">
<span class="uschr2">1. </span> Das Boot (1981) - IMDb <br>
<div id="descbox">
www.imdb.com/title/tt0082096/ - Im Cache - Ähnliche Seiten <BR>
</div>
The string I'm looking for is the one containing the URL of the movie. In this case it's:
http://www.imdb.com/title/tt0082096/
The string format is like:
http://www.imdb.com/title/tt???????/
Where '?' stands for a digit 0...9
My question is:
How can grep or findstr return only the first occurrence of the matching string itself and not the complete line containing a match?
Thank you a lot for your assistance!
Best regards
Windows findstr returns complete lines. You can avoid this with GNU sed:
sed -rn "\#http://www.imdb.com/title/tt#s#.*href=\"(.*)\"\s.*#\1#p" file
http://www.imdb.com/title/tt0082096/
In addition you can use grep -o:
-o, --only-matching show only the part of a line matching PATTERN
With grep you can do something like:
grep -oP '(?<=href=\")[^"]+(?=\")' html.file
This is not the ideal way of parsing an html file. However, if it is a one off thing then you can probably get away with it. ?<=href=\" is a look behind search. If the above it returning a lot of stuff then you can probably add which is unique to the url lines.