I would like to parse the content of a wikipedia page, but I do miss something which I do not understand. Can someone help me ?
Example:
I have a wikipedia page:
https://it.wikipedia.org/wiki/Anni_690_a.C.
In this page a chinese politican is mentoined: "Jin Wen Gong"
I try to use the following webservice to get the content, but in the json there is no data about "Jin Wen Gong".
https://it.wikipedia.org/w/api.php?action=query&prop=revisions&rvlimit=1&titles=Anni_690_a.C.&rvprop=content&format=json
How do I parse wikipedia correctly ?
The part you are looking for is not directly in the contents of that page, which you can see if you start editing the page: you will also not see any note of jin wen gong
The part where you see it is generated from this piece of wiki-code:
{{Bio decennio a.C.|Morti|69}}
This code is in the JSON.
On Wikipedia that translates to a list of people (probably people that have died in the mentioned year, if I guess the italian?).
Related
Is there a way to get the intro content from wikipedia page to my mediawiki page? I was thinking of using wikipedia's api but i dont know how to parse the url on my page and also with templates. I just want a query that will display the introduction part of a wikipedia page on my page?d
I used the External_Data Extension and Wikipedia's api to achieve this.
The API
http://en.wikipedia.org/w/api.php? action=query&prop=extracts&format=json&exintro=&titles=[title of wikipedia page]
How I used it
{{#get_web_data:
url=http://en.wikipedia.org/w/api.php? action=query&prop=extracts&format=json&exintro=&titles={{PAGENAME}}
|format=JSON|data=extract=extract}}
How I displayed the extract on pages
{{#external_value:extract}}
I however need to figure out how to get only a paragraph from the return text. Will probably use a parser function.
I'm looking for a library to parse html pages, specifically wikipedia articles for example: http://en.wikipedia.org/wiki/Railgun, I want to extract the article's text and images (full scale or original image not the thumb).
Is there an html parser out there ?
I would prefer not to use the wikimedia api since I can't seem to figure out how to extract an article's text and the fullsize images with them.
Thanks and sorry for my english.
EDIT: I forgot to say that the ending result should be valid html
EDIT: I got the json string with this: https://en.wikipedia.org/w/api.php?action=parse&pageid=218930&prop=text&format=json so now I need to parse the json.
I know that in javascript I can do something like this:
var pageHTML = JSON.parse("the json string").parse.text["*"];
Since I know a bit of html/javascript and python, how can I make that http request and parse the json in python 3 ?
I think you should be able to get everything with the webapi,
https://www.mediawiki.org/wiki/API:Main_page
https://www.mediawiki.org/wiki/API:Parsing_wikitext
or you could download the whole wikipedia
https://meta.wikimedia.org/wiki/Research:Data
You can get the html from the api too, check the info on https://www.mediawiki.org/wiki/Extension:TextExtracts/pt, it's like this example: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exchars=175&titles=hello%20world .
Depending on how many pages you'll need, you should consider using public dumps if the volume of pages is high.
I made a Node.js module called wikipedia-to-json (written in javascript) that parses the HTML in wikipedia articles and gives you back structed JSON objects that describe the layout of the article in-order. (titles, paragraphs, images, lists, sub-titles...)
That might be useful if you just want to do a quick extractions of text and sections and understand how things look like.
I've been going through the doc's for past few hours and simply can't seem to figure this out though probably simple.
I have this link:
http://en.wikipedia.org/w/api.php?format=xml&action=expandtemplates&titles=Arabinose&text={{Chembox%20Elements}}&prop=wikitext
Which obviously will give me the schema of Template Chembox | Chembox Elements in this case.
All I simply want is to retrieve the Molecular forumla content/data/value for the given page/title without having to parse the entire wiki content at my end.
Understand I have prop=wikitext which will be returning wikitext in the above example, there's no option in expandtemplates for prop=text. I've been back and forth with action=query, expandedtemplates etc and no joy.
MediaWiki's API won't do the work for you. You'll have to parse your own results.
How convert Html into Prolog?
I need to extract from an html page its tag and i describe it into Prolog.
Example, if my file contains this html code
<title>Prove<title>
<select id="data_nastere_zi" name="data_nastere_zi">
i should get
title(Prove),
select(id(data_nastere_zi)).
I tried to see various library but i couldn't.
Thanks.
You can parse well formed HTML using SWI-Prolog library(sgml), in particular load_html/2.
My experience, scraping 'real world' websites, isn't really pleasant, because of insufficient error handling.
Anyway, when you will have loaded the page structure, you will have available library(xpath) to inspect such complex data.
edit getting a table inside a div:
xpath(Page, //div, Div),
xpath(Div, //table, Table)...
SWI-Prolog has a package for SGML/XML parsing based on the SWI-Prolog interface to SP by Anjo Anjewierden: "SWI-Prolog SGML/XML parser".
hi so I need to retrieve the url for the first article on a term I search up on nytimes.com
So if I search for Apple. This link would return the result
http://query.nytimes.com/search/sitesearch?query=Apple&srchst=cse
And you just replace Apple with the term you are searching for.
If you click on that link you would see that NYtimes ask you if you mean Apple Inc.
I want to get the url for this link, and go to it.
Then you will just get a lot of information on Apple Inc.
If you scroll down you will see the articles related to Apple.
So what I ultimately want is the URL of the first article on this page.
So I really do not know how to go about this. Do I use Java, or what do I use? Any help would be greatly appreciated and I would put a bounty on this later, but I need the answer ASAP.
Thanks
EDIT: Can we do this in Java?
You can use Python with the standard urllib module to fetch the pages and the great HTML parser BeautifulSoup to obtain the information you need from the pages.
From the documentation of BeautifulSoup, here's sample code that fetches a web page and extracts some info from it:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
where, linebreak, what = incident.contents[:3]
print where.strip()
print what.strip()
print
This this is a nice and detailed article on the topic.
You certainly can do it in Java. Look at the HttpURLConnection class. Basically, you give it a URL, call the connect function, and you get back an input stream with the contents of the page, i.e. HTML text. You can then process that and parse out whatever information you want.
You're facing two challenges in the project you are describing. The first, and probably really the lesser challenge, is figuring out the mechanics of how to connect to a web page and get hold of the text within your program. The second and probably bigger challenge will be to figure out exactly how to extract the information you want from that text. I'm not clear on the details of your requirements, but you're going to have to sort through a ton of text to find what you're looking for. Without actually looking at the NY Times site at the momemnt, I'm sure it has all sorts of decorations like pretty pictures and the company logo and headlines and so on, and then there are going to be menus and advertisements and all sorts of stuff. I sincerely doubt that the NY Times or almost any other commercial web site is going to return a search page that includes nothing but a link to the article you are interested in. Somehow your program will have to figure out that the first link is to the "subscribe on line" page, the second is to an advertisement, the third is to customer service, the fourth and fifth are additional advertisements, the sixth is to the home page, etc etc until you finally get to the one you're actually interested in. How will you identify the interesting link? There are probably headings or formatting that make it recognizable to a human being, but you use a lot of intuition to screen out the clutter that can be difficult to reproduce in a program.
Good luck!
You can do this in C# using the HTML Agility Pack, or using LINQ to XML if the site is valid XHTML. EDIT: It isn't valid XHTML; I checked.
The following (tested) code will get the URL of the first search result:
var doc = new HtmlWeb().Load(#"http://query.nytimes.com/search/sitesearch?query=Apple&srchst=cse");
var url = HtmlEntity.DeEntitize(doc.DocumentNode.Descendants("ul")
.First(ul => ul.Attributes["class"] != null
&& ul.Attributes["class"].Value == "results")
.Descendants("a")
.First()
.Attributes["href"].Value);
Note that if their website changes, this code might stop working.