Get an article summary from the MediaWiki API - mediawiki

I am looking for a mediawiki api using which I can get short description about any query string. For example , if I search for Nicolas Cage then it should return the short description for him.
I tried http://en.wikipedia.org/w/api.php?%20format=json&action=query&titles=Nicolas%20Cage&prop=revisions&rvprop=content
I am not sure if prop=revisions is right. My intention is to get a short description on the final version of the page.
Also I need another api which can give the link of the wikipedia page (web / mobile) from the query string. i.e. For Nicolas Cage, http://en.wikipedia.org/wiki/Nicolas_cage should be returned.

There is no such thing as a page summary in MediaWiki by default,but you can get the first paragraph of a page like this: http://en.wikipedia.org/w/api.php?action=parse&page=Nicolas_Cage&prop=text&section=0
If the wiki has the extension PageSummaries installed, you can use that to get exactly what you are asking for (like in this example from the extension description page).
To find pages matching a string, you use the open search function, like this: http://en.wikipedia.org/w/api.php?action=opensearch&search=Nicolas%20cage&namespace=0
edit: #Bergi point out in the comments that open search also gives a summary of the page. I had somehow missed that.

Say, you want to get the summary of a search string Nicolas Cage.
Step 1. Get the page id: "https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=Nicolas%20Cage&format=json&srlimit=1"
Step 2. Use this page id to get section 0 of the page:
"https://en.wikipedia.org/w/api.php?action=parse&section=0&pageid=21111&prop=text&format=json"
Step 3. Parse as per requirements.
Step 3 extended for Python: Use BeautifulSoup for target tags and get_text() gives plaintext.
use rvprop to get latest revision, further go through mediaWIKI documentation.
Alternate Solution:
Step 1. Get page title using step 1 above.
Step 2. Use the title as follows: https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=Nicolas%20Cage

Related

Trying to extract data from Google results pages for a specific domain

so I'm trying to extract the URL, Title and decription from the SERPs for just 1 domain. This means I have to use the URL in some sort of "contains" function in order to get the corresponding title and description, right?
Since Google has the URL and the Title within the same element, I could get these easily via xpath.
My issue right now is the description, which is outside the initial where the URL is. So far I have tried Xpath as well as regex and couldn't find a way to make it work.
Here is some code that didn't work:
Xpath:
//div[#class="jtfYYd"]/a[starts-with(#href,'https://www.example.com')]//*[#class="NJo7tc Z26q7c uUuwM"]/div
Regex:
A: ["']href="https://www.example.com["']<div class="NJo7tc Z26q7c uUuwM"["']>(.*?)
B: (?=["']https://www.example.com["'])(?=["']NJo7tc Z26q7c uUuwM["'])(.*?)
I can only use Xpath1.0 since the tool (Screaming Frog) doesn't support Xpath 2.0. I hope someone has a solution.

Send and receive data to and from a website using the TWebbrowser component in Delphi

I'm creating a VCL Application with Delpi 10.3 and want to support some web functionality by having the user enter the ISBN of a book into a TEdit component and from there passing/sending this value to a search field on this website: https://isbnsearch.org after which the website looks up the ISBN and displays the Author of the book. I want to somehow access the information (i.e Author) presented by the search result and again use it in my application.
This is my GUI, for a better idea of what I want to accomplish:
What code can I use for this? Any other feasible suggestions or approaches are acceptable.
When performing a search on that website, it simply loads a page with a specific URL query string...
https://isbnsearch.org/search?s=suess
The above example is when I search for "suess", so you can easily concatenate a search URL.
You can use any HTTP component, such as TIdHTTP, to load this search page, then use an HTML parser to scrape the page and read what you need. Much, much easier than trying to read through the TWebBrowser.
In the end, you won't actually display the HTML (I mean you can if you want to), but the idea is to read the data and display it in your own format.
On that specific page, start by locating the ul element with id searchresults. Then, each li element contains individual results. Unfortunately, this website uses pagination, and only shows 10 results per page. To do this, call this page again with another parameter &p=2 for the 2nd page, &p=3 for the 3rd page, and so on.
On the other hand, that is the worst way to acquire such information. What you should be doing is using a proper API which gives you machine-friendly data. The service you are referencing doesn't appear to have an option, but here's an example of one which does:
https://openlibrary.org/dev/docs/api/books - this also appears to provide you MUCH more information than the one you're using.

Onenote page hierarchy

Let's say I have a notebooks with name 'MyNotebook'. Now this notebook have a section group 'Group1' and now 'Group1' have another section group 'Group2'. Now inside 'Group2' I have section 'Section1' which has a page 'Page1'.
If we look this at like a directory structure the path to page will be -MyNotebook/Group1/Group2/Section1/Page1
When I try to get page using get page api I am able to get only immediate parent i.e Section1. So let's say I want get this complete hierarchy how I can get that ?
What API specifically are you using to get pages?
If you are using GET https://www.onenote.com/api/v1.0/me/notes/pages, this will give you all the pages, though that API has limitations (For example, it is paginated, so it will only give you the most recent 20 pages. In addition, it won't work if the user has a big number of sections).
https://blogs.msdn.microsoft.com/onenotedev/2017/07/21/a-few-performance-tips-for-using-the-onenote-api/
See the section "When getting all pages for a user, do so for each section separately"
I recommend you make a call like:
GET https://www.onenote.com/api/v1.0/me/notes/Notebooks?$expand=sections,sectionGroups($expand=sections,sectionGroups($levels=max;$expand=sections))
To obtain all the sections, and then make a call like:
GET https://www.onenote.com/api/v1.0/me/notes/sections/{id}/pages
To obtain each section's pages.
In addition to what Jorge said, if you specifically want the upwards hierarchy (and not downwards), you could do:
GET https://www.onenote.com/api/v1.0/me/notes/pages?$expand=parentSection($expand=parentSectionGroup($expand=parentSectionGroup($expand=parentNotebook)))
But as Jorge said, be careful when using the GET pages API since it has some limitations

mediawiki api. how to chose page from response

When I make api query sometimes I have list with few pages. For example
http://en.wikipedia.org/wiki/Ask gives a lot of pages, I need website "Ask.com, a web search engine, formerly Ask Jeeves"
can I make query only for some category ("websites")?
How I can check category for each page in response?
Thanks
There is no trivial way to do what you're asking. You could do something like this:
Get the list of pages the disambiguation page list. You could do this by listing the links on that page (action=query&prop=links).
Get the categories of all the pages from the previous step and use that to decide which one is the one you're looking for. This is not that simple, because Ask.com is not directly in Category:Websites, it's in one of its subcategories.
I have list with few pages, for example http://en.wikipedia.org/wiki/Ask
The problem is that you're not getting a list of pages, you just are getting an ordinary page which is in the disambiguation pages category. To get the list, you need to get the links in that page.
can I make query only for some category ("websites")?
No, mediawiki does not support that.
How I can check category for each page in response?
Use the links property as a title list generator and get the categories of each page in the response. In your case, that would be http://en.wikipedia.org/w/api.php?action=query&titles=Ask&generator=links&prop=categories (don't forget to continue the query).
If you are OK with "full-text search" for "ask",
you can do that like this:
http://en.wikipedia.org/w/api.php?format=json&action=query&generator=search&gsrsearch=ask%20incategory:%22Online%20companies%22&prop=info
As you can see, "search" text is [ask incategory:"Online companies"]
The same solution also can be seen at:
Wikipedia API: how to search for a term in a specific category

Parsing a website and getting the info I need

hi so I need to retrieve the url for the first article on a term I search up on nytimes.com
So if I search for Apple. This link would return the result
http://query.nytimes.com/search/sitesearch?query=Apple&srchst=cse
And you just replace Apple with the term you are searching for.
If you click on that link you would see that NYtimes ask you if you mean Apple Inc.
I want to get the url for this link, and go to it.
Then you will just get a lot of information on Apple Inc.
If you scroll down you will see the articles related to Apple.
So what I ultimately want is the URL of the first article on this page.
So I really do not know how to go about this. Do I use Java, or what do I use? Any help would be greatly appreciated and I would put a bounty on this later, but I need the answer ASAP.
Thanks
EDIT: Can we do this in Java?
You can use Python with the standard urllib module to fetch the pages and the great HTML parser BeautifulSoup to obtain the information you need from the pages.
From the documentation of BeautifulSoup, here's sample code that fetches a web page and extracts some info from it:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
where, linebreak, what = incident.contents[:3]
print where.strip()
print what.strip()
print
This this is a nice and detailed article on the topic.
You certainly can do it in Java. Look at the HttpURLConnection class. Basically, you give it a URL, call the connect function, and you get back an input stream with the contents of the page, i.e. HTML text. You can then process that and parse out whatever information you want.
You're facing two challenges in the project you are describing. The first, and probably really the lesser challenge, is figuring out the mechanics of how to connect to a web page and get hold of the text within your program. The second and probably bigger challenge will be to figure out exactly how to extract the information you want from that text. I'm not clear on the details of your requirements, but you're going to have to sort through a ton of text to find what you're looking for. Without actually looking at the NY Times site at the momemnt, I'm sure it has all sorts of decorations like pretty pictures and the company logo and headlines and so on, and then there are going to be menus and advertisements and all sorts of stuff. I sincerely doubt that the NY Times or almost any other commercial web site is going to return a search page that includes nothing but a link to the article you are interested in. Somehow your program will have to figure out that the first link is to the "subscribe on line" page, the second is to an advertisement, the third is to customer service, the fourth and fifth are additional advertisements, the sixth is to the home page, etc etc until you finally get to the one you're actually interested in. How will you identify the interesting link? There are probably headings or formatting that make it recognizable to a human being, but you use a lot of intuition to screen out the clutter that can be difficult to reproduce in a program.
Good luck!
You can do this in C# using the HTML Agility Pack, or using LINQ to XML if the site is valid XHTML. EDIT: It isn't valid XHTML; I checked.
The following (tested) code will get the URL of the first search result:
var doc = new HtmlWeb().Load(#"http://query.nytimes.com/search/sitesearch?query=Apple&srchst=cse");
var url = HtmlEntity.DeEntitize(doc.DocumentNode.Descendants("ul")
.First(ul => ul.Attributes["class"] != null
&& ul.Attributes["class"].Value == "results")
.Descendants("a")
.First()
.Attributes["href"].Value);
Note that if their website changes, this code might stop working.