Is there MediaWiki API to convert mediawiki text to HTML? - mediawiki

Is there any MediaWIki API where I can submit entire (potentially very large) mediawiki text (for wikipedia article) that will give me HTML that is exactly the same as HTML article viewed on wikipedia for the english language?

You can use action=parse for this. I'm not sure what the limits are, though you might consider sending the text in the body of a POST request, instead of in the URL of a GET request.

What svick said is the standard method. Alternatively you can use the Parsoid API which returns different HTML (but it will look the same).
Unlike action=parse, this is not part of MediaWiki; if you want to use it for your own wiki, see its documentation on how to set it up.

It's not a mediawiki api but you can use Pandoc to convert mediawiki text to html or an other format.

Related

How to embed text from wikipedia?

I have pages on my site of some famous personalities, I want to embed short description of them from wikipedia (similar to what google shows on the side when you search for subject that exists on wikipedia), and have the possibility to style the text too, is there a ways to do that dynamically?
You can actually use the Freebase API (in particular, the Topic API) to do something like this. Basically, you want to fetch the /common/topic/description attribute, like this:
https://www.googleapis.com/freebase/v1/topic/m/02mjmr?filter=/common/topic/description
(You can also use Freebase to get most of the other attributes that display in the Knowledge Graph).

When using the Wikipedia API, how to retrieve the style sheets?

I am trying to use the Wikipedia API to get HTML content from Wikipedia. I can't find the correct way to embed this. Since styles are not expanded, I woudld imagine I would need to specify the stylesheets used in the head of the HTML that embeds it.
How do I know what the correct CSS to include are?
The answer is to add headitems in the api request, like so:
http://en.wikipedia.org/w/api.php?action=parse&page=Doritos&prop=text|headitems
That gives you another xml node, the content of which you can plop in the HEAD element of your HTML.

Parsing Random Web Pages

I need to parse a bunch of random pages and add them to a DB. I am thinking of using regular expressions but I was wondering if there are any 'special' techniques (other than looking for content between known text/tags). The content is more(not always) like:
Some Title
Text related to Title
I guess I don't need to extract complete Text but some way to know where the Title/Paragraph and extract the content from there. The content itself may have images/links that I would like to retain.
Thanks!
Please see this answer: RegEx match open tags except XHTML self-contained tags
Use Python. http://www.python.org/
Use Beautiful Soup. http://www.crummy.com/software/BeautifulSoup/
You need to use a proper HTML parser, and extract the elements you’re interested in via the parser’s API (or via the DOM).
Since I don’t know what language you’re programming in, it’s rather difficult to recommend a parser, but some well known ones are Jericho for Java, and Beautiful Soup for Python.

How will you customise a html page so that it accepts multiple language?

How will you customise a html page so that it accepts multiple language?
I will cite W3 Internationalization Quick Tips for the Web :
Encoding. Use Unicode wherever possible for content, databases, etc. Always declare the encoding of content.
Escapes. Use characters rather than escapes (e.g. á á or á) whenever you can.
Language. Declare the language of documents and indicate internal language changes.
Presentation vs. content. Use style sheets for presentational information. Restrict markup to semantics.
Images, animations & examples. Check for translatability and inappropriate cultural bias.
Forms. Use an appropriate encoding on both form and server. Support local formats of names/addresses, times/dates, etc.
Text authoring. Use simple, concise text. Use care when composing sentences from multiple strings.
Navigation. On each page include clearly visible navigation to localized pages or sites, using the target language.
Right-to-left text. For XHTML, add dir="rtl" to the html tag. Only re-use it to change the base direction.
Check your work. Validate! Use techniques, tutorials, and articles at http://www.w3.org/International/
For more information follow W3 recommendations : http://www.w3.org/International/
One way to do this would be to use a decent server-side web technology, there are many to choose from, which contains support for internationalization. Essentially it comes down to specifying the different pieces of text that the site needs to display, assigning a label to each message, creating different versions of each label in separate language files, and using the server-side code, reference the label name and a country code to display the text in the appropriate language.
The first step is to determine your requirements, your hosting environment and then figure out what options are available to you. If you can provide some more information we might be able to steer you in a better direction.
If I make a bunch of assumptions about what you are trying to achieve:
Serve the document as UTF-8
Browsers will tend to then return a UTF-8 response to the server when any forms are submitted (forms being the only way that a page is going to "accept" anything), and UTF-8 can handle the characters used in just about every language.

What is the best way to embed HTML in an RSS feed?

I am using Django's RSS capabilities to build an RSS feed. The <description> of the RSS feed items contains HTML markup. Currently, I am just injecting the HTML markup into the feed using the following template:
{{ obj.post }}
Django, of course, translates special characters (<, >, &, etc.) to their respective HTML entities.
I know I could just output the HTML and wrap all the HTML code in <![CDATA[...]]> sections. This page says that either method is acceptable. If that's true, is there a good reason to pick one method over the other? And if I use example #2, is there a filter for Django to automatically wrap the HTML text in CDATA tags, or should I just change my template to:
<![CDATA[
{{ obj.post|safe }}
]]>
Edit
It seems that Django autoescapes special characters in RSS feeds (or any XML for that matter) no matter what, regardless of whether you pass it through the safe filter or not (the issue is discussed in this ticket). However, general answers are welcome.
When I run into issues like this with Django my first instinct is to run off and find a normal Python lib that does what I want. In this case PyRSS2Gen might be your saviour.
It'll probably require a bit more fannying around (because it'll be unaware of what Django objects are) but it should be raw enough to let you do as you wish.
And if it isn't, it's just a script. You can hack it apart to allow raw HTML if you please =)
Embedding HTML is CDATA has troubled me in the past. Hope RSS readers have evolved to handle such embeds.
Instead of writing your own RSS XML feed, consider using the Django syndication framework from django.contrib.syndication:
https://docs.djangoproject.com/en/dev/ref/contrib/syndication/