get internal links from Introduction in article - mediawiki

For a given article on Wikipedia, I would like to use the Mediawiki API to extract the internal links from the introduction section of an article.
Eximilar to the prop=extracts&exintro= setting but with the contents of the links.

Get the lead wikitext with either extracts or revisions for section 0 (the first has sanity checks against a very long intro / an article with no section headings at all, but might be cut off at an awkward position), pass it to parse and set prop=links.

Related

MediaWiki API to fetch all links from the See Also section of an article

https://www.mediawiki.org/wiki/API:Query
From above link please suggest me an appropriate API to fetch/get all the links of See Also section of an article.
For Example:
I want a list of the above 5 links.
There is an API to get the external links associated with an article:
https://en.wikipedia.org/w/api.php?format=json&action=query&titles=Pune&prop=extlinks
Now, coming back to the original question about See Also links - If there is no proper API then how can we extract the same links if we have the wikitext contentmodel.
Example of wikitext:
https://en.wikipedia.org/w/api.php?format=json&action=query&titles=Pune&prop=revisions&rvprop=content
As far as I know, there's no way to do this in a single call, but you can use https://en.wikipedia.org/w/api.php?action=parse&page=Pune&format=json&prop=sections to give all of the sections in an article then iterate through the results to find the index of the section where 'line' == 'See also' e.g. in this case 42 and then use https://en.wikipedia.org/w/api.php?action=parse&page=Pune&format=json&section=42 to give you just that section.

Algorithm to develop an article extractor

I have undertaken a project which will extract the main content from any webpage. For example, if I input the URL of any news article, it will return the article part only. The first step would be getting the source code of the given URL. There are many ways to do it. After getting HTML code of given webpage, I will keep the part inside <body> tag because obviously article will be somewhere inside body.
After this, I am selecting each div element and checking how much text it contains. At end I am selecting the div with most text inside it.
Other way I am thinking is, for each <p> element, I will check the parent of it. At end, I will select the div which has most <p> child directly. To understand it better check this tree- Tree of an HTML
Now I know that these methods are the basic and that's why I am asking this question. I want to know the suggestions of the community about this. What approaches you all use?
I like the idea of implementing your own 'News' crawler...
A few suggestions:
Check the source ('Right Click' > 'Inspect' at chrome) of some popular sites (e.g. The New York Times); search for common html object names, ids or classes they use to identify the different blocks in the html; for instance: divs with 'story' or 'story-body' ids.
I would go with the word count, but also use a dictionary of common phrases, which are likely to appear in a news article.
I would search for the block within 'header' and 'footer', excluding comments section or advertisements (again, by searching the values of the object id or class names).
Start your crawling from the main page, it will probably have references to the sub pages or articles - once you have the reference (e.g. a header or article name), it will help you navigate in the sub page itself.
In any case, I suggest working with java jsoup library - it will make your life easier; use it with the jquery-like selectors.
Goodluck.

Extract content from Wikipedia to Mediawiki

Is there a way to get the intro content from wikipedia page to my mediawiki page? I was thinking of using wikipedia's api but i dont know how to parse the url on my page and also with templates. I just want a query that will display the introduction part of a wikipedia page on my page?d
I used the External_Data Extension and Wikipedia's api to achieve this.
The API
http://en.wikipedia.org/w/api.php? action=query&prop=extracts&format=json&exintro=&titles=[title of wikipedia page]
How I used it
{{#get_web_data:
url=http://en.wikipedia.org/w/api.php? action=query&prop=extracts&format=json&exintro=&titles={{PAGENAME}}
|format=JSON|data=extract=extract}}
How I displayed the extract on pages
{{#external_value:extract}}
I however need to figure out how to get only a paragraph from the return text. Will probably use a parser function.

Full urls of images of a given page on Wikipedia (only those I see on the page)

I'd want to extract all full urls of images of "Google"'s page on Wikipedia
I have tried with:
http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json
but, in this way, I got also not google-related images, such as:
http://upload.wikimedia.org/wikipedia/en/a/a4/Flag_of_the_United_States.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg
http://upload.wikimedia.org/wikipedia/commons/f/fe/Crystal_Clear_app_browser.png
How can I extract just only images that I see on Google page
Retrieve page source code, https://en.wikipedia.org/w/index.php?title=Google&action=raw
Scan it for substrings like [[File:Google web search.png|thumb|left|On February 14, 2012, Google updated its homepage with a minor twist. There are no red lines above the options in the black bar, and there is a tab space before the "+You". The sign-in button has also changed, it is no longer in the black bar, instead under it as a button.]]
Ask API for all pictures on page, http://en.wikipedia.org/w/api.php?action=query&titles=Google&generator=images&gimlimit=10&prop=imageinfo&iiprop=url|dimensions|mime&format=json
Filter out urls but those which match picture names found in step 2.
Steps 2 and 4 need more explanation.
#2. Regexp /\b(File|Image):[^]|\n\r]+/ should be enough. In Ruby's regexps, \b denotes word boundary which might be unsupported in language of your choice. Regexp I proposed will match all cases which come to my mind: [[File:something.jpg]], gallery tags: <gallery>\nFile:one.jpg\nFile:two.jpg\n</gallery>, templates: {{Infobox|pic = File:something.jpg}}. However, it won't match filenames which contain ]. I'm not sure if they're legal, but if they are, they must be very uncommon and it should not be a big deal.
If you want to match only constructs like this: [[File:something.jpg|thumb|description]], following regexp will work better: /\[\[(File|Image):[^]|]+/
#4. I'd remove all characters from names which match /[^A-Za-z0-9]/. It's easier than escaping them and, in most cases, enough.
Icons are most often attached in templates, contrary to pictures related to article subject, which are most often attached directly ([[File:…]]). There are exceptions though, for example in some articles pictures are attached with {{Gallery}} template. There is also <gallery> tag which introduces special syntax for galleries. You got to tune my solution to your needs, and even then it won't be perfect, but it should be good enough.

MediaWiki LocalSettings.php $wgSiteName issue

So I have the wiki exactly as I want it but the sitename variable always reverts to lowercase...
I assume there is a special character maybe..like in c# you have \t or \n to format text but I cannot find anything of the sort..
THis is directly from the wikipage describing sitename
"Site name
The $wgSitename variable holds the name of your wiki setup. This name gets included many times throughout the system, such as via MediaWiki:Pagetitle. For instance, the Wikipedia tagline "'From Wikipedia, the free encyclopedia."' makes use of this setting."
It states that wikipedia uses sitename to title its site as .... 'From Wikipedia, the free encyclopedia.'
They have multiple capital letters
Thanks ahead of time.
That text actually comes from the page MediaWiki:Tagline. Edit that page on your wiki and you can make that line say anything you want. You also have to edit MediaWiki:Common.css like it says in that link to make the tagline show up on pages.