Meta tags in skin from MediaWiki template - mediawiki

Let's say i have a template in my MediaWiki like
<includeonly>
<div id="custom-person">
* <span>Birthday:</span> {{#if: {{{birth date|}}} | <b>{{#ol-time:|{{{birth date}}}}}</b> | — }}
{{#if: {{{full name|}}} | * <span>full name:</span> <b>{{{full name}}}</b>}}
{{#if: {{{birth place|}}} | * <span>birth place:</span> <b>{{{birth place}}}</b>}}
{{#if: {{{age|}}} | * <span> age:</span> <b>{{{age}}}</b>}}
{{#if: {{{nationality|}}} | * <span> nationality:</span> <b>{{{nationality}}}</b>}}
<div class="clear"></div>
</div>
[[Category:Person]]
__NOTOC__
</includeonly>
All these pages are in one Namespace (0).
I need to generate head meta tags with data from this template.
I figured out how to filter such a pages and add title tags in my SkinPerson.php
if ( $out->getTitle()->getNamespace() == 0 ) {
$out->addMeta( "description", $out->getPageTitle());
$out->addHeadItem( 'og:description', '<meta property="og:description" content="' . $out->getPageTitle() . '">');
}
But I'm really stuck on how can I insert in, say, 'og:description' tag something like {{{full name}}} + {{{age}}} ?

That's simply not possible, and I would wonder what your use case here would be, why you want to do that. First some explanation, why this is not possible in the way you want to achieve that:
The template is evaluated by a piece of software we call the Parser. The parser is generating a html representation of your wikitext, including all the templates and so on. The result of that is then saved in the ParserOutput and probably cached in ParserCache (so that not every time it needs to be parsed again).
However, the skin, where you want to add the head item, is using the output of the parser directly, so it does not really know about the wikitext (including template parameters) anymore, and really shouldn't.
One possible solution for what you want to achieve is probably to extend the wikitext markup language by providing a tag extension, parsing that during the parsing of the wikitext, and save the values for the head items in the database. During the output of the page you can then retrieve these values from the database again and add them into the head items like you want. See more information about that in the documentation.
There might be other ways, apart from the database, to get information from the parsing time into the output time, which I'm not aware of.

Related

Formatting query result of Url as given string in Semantic MediaWiki

I want to embed the result of an ask query on a page in Semantic MediaWiki. The result column has the type Url as in [[Has type:Url]]. However, when embedding the code I would like to show the URL not explicitly since it is very long but as a fixed string (e.g. "Website" )as I would have typed:
[https://someURL.com Website]
I tried to set the name on the page in the property assignment itself by including:
[[Has website::https://someURL.com | Website]]
This is the basic structure of an example query.
{{#ask: [[Has website_example::true]]
|?Has website
|format=table
|limit=50 |offset=0
|link=all
|sort=
|order=asc
|headers=show
|searchlabel=... further results
|class=sortable wikitable smwtable
}}
Is it possible to render the ?Has website as a link with the text "Website" in the table?
You may be have to pass through a 'template' format, and reconstruct your rows based on this template.
Example :
|-
! [{{{1}}} | Website]
then just surround your 'ask' request by table headers and footers.

Inline query for listing all pages from a namespace without any subobjects

I need an inline query that lists all pages from a specific namespace, but without listing subobjects specified on these pages.
Restricting results to a namespace is possible like that:
{{#ask: [[ExampleNamespace:+]] }}
But it lists all subobjects, too.
Workarounds:
Specify a category on these pages (subobjects don’t inherit it) and query for the category instead:
{{#ask: [[ExampleCategory]] }}
Specify a property on these pages (and never on the subobjects) and query for the property (with a wildcard value) instead:
{{#ask: [[ExampleProperty::+]] }}
But both workarounds require editing, which I would like to avoid. Is there a better way to solve this?
Not sure if it's a better way, but it looks like array formats/arrays and their #arraymap and #arrayunique functions are a way to go in order to trim SMW subobject tags and make the DISTINCT operation. Unfortunately, the solution below has a query result limit issue described as well (at least out of what I understand in SMW). In general, it may look like the following, and I will appreciate if someone suggests a nicer solution:
<!-- Fetch all pages from the "Live event" namespace -->
{{#arraydefine: QUERY_RESULT
| {{#ask: [[Live event:+]]
| format = array
| link = none <!-- NOTE: array item link -->
| limit = 10000 <!-- NOTE: limit -->
}}
}}
<!-- Store the mapped result into another array -->
{{#arraydefine: MAPPED_QUERY_RESULT
| {{#arraymap: {{#arrayprint: QUERY_RESULT}}
| ,
| $O <!-- NOTE: array map iterator value -->
| {{#explode: $O <!-- NOTE: explode by hash -->
| #
| 0
}}
}}
| ,
| unique
}}
<!-- Generate links markup -->
{{#arraymap: {{#arrayprint: MAPPED_QUERY_RESULT}}
| ,
| $O
| [[$O]] <!-- NOTE: plain links -->
}}
The notes from the code above:
NOTE: array item link - Not suppressing the links causes the mapper to be more complicated (including parsing HTML <span> tags and class attributes).
NOTE: limit - This is probably the biggest issue here as the number of subobjects affects the query result. SMW by default limits the query results, and the maximum query limit cannot be overridden as far as I know. Having more rows, which count is greater than the limits is, will cause the 'Further limits' link to appear. Actually speaking, I have no idea how to work around it nicely.
NOTE: array map iterator value - {{#arraymap}} seems to replace strings in the simplest way like sed or a simple text editor app do. So $O is used as the iterator value placeholder for the formula parameter trying not to clash with other string tokens.
NOTE: explode by hash - #ask subobject results generate hashed links like PageA#_159c1f213de2fcaf165f2c9c5c56686b. Just getting rid of them. In case you need to strip wiki links, you might also play around with [[ or | (encoded like [<nowiki/>[ and <nowiki>|</nowiki> respectively)
NOTE: plain links - The generated links will have underscores instead of spaces. Unfortunately, [[{{#replace: $O | _ | <nowiki> </nowiki>}}]] didn't work for me -- the underscores are simply consumed for some reason, however this approach is also recommended at the #replace function wiki page.
Some links:
SMW array result format
SMW configuration
SMW further results
#arraymap:
#explode:
#replace:
Help:List the set of unique values for a property (pay attention at the "Limitations and issues" section)

REGEX in mysql table containing html data

I have a table that stores html templates in a mysql database. Now I have to perform some text replacement on them. However my target text is also present in some of the anchor tags and I don't want that to be replaced.
EX :
<body> ... (has huge html crap)... .........(Some more html crap) ... (a bit more of html crap) ... </body>
Task is to replace the occurrences of the "KEYWORD" with "NEW KEYWORD" in the body but not the urls.
It would also be helpful if I can first find such cases where the KEYWORD is a part of a link in a given template.
MySQL is not capable of such advanced string manipulation.
However, if you were to have a one-time-use PHP script do the editing (ie. select from the table, for each row process and update), you can do this:
// foreach row as $row
$newtext = preg_replace("(<a\b.*?>(*SKIP)(*FAIL)|KEYWORD)","NEW KEYWORD",$row['data']);
What this does is look for links (very approximate Regex but should suffice in almost all cases here), then skip over them. Then, it looks for KEYWORD and replaces it with NEW KEYWORD.
You can use this to quickly and easily handle the replacement.
If that "almost all cases" thing above turns out to not be enough, you can use DOMDocument to load the HTML into a parser and process text nodes only from there.
Maybe you could find the cases where the KEYWORD is a part of a link with something like this:
SELECT * FROM tbl WHERE html REGEXP '<a[^>]*KEYWORD';

Can I replace placeholder text in a rendered HTML page dynamically?

I wish I could think of a better way to word my question, but basically here is what I want to do: in an HTML file, I would like to fill the body with a specific string multiple times. For example:
<div>
This is some content. XXX
</div>
<div>
This is some more content. XXX
</div>
<div>
This is even more content. XXX
</div>
Then, I would like some script to go through the page, and replace every instance of the string (in this case XXX but it could be anything) with an incrementing number, so, like:
<div>
This is some content. 001
</div>
<div>
This is some more content. 002
</div>
<div>
This is even more content. 003
</div>
This is a simple example of course, and you might be thinking well that's dumb, just type the numbers. But obviously this is simpler than what I'm intending to do, and right now what I'm building, the order of all the content has not been decided yet, so things could move up or down in their placement on the page, but I'd like all the numbers to be sequential in order of their appearance on the page.
So, final thoughts: I am super sure there's a way better way to do this than I'm even thinking of, methodology wise (i.e., make an XML table or something). I am definitely open to ANY suggestion on how to do this, but I am kind of an idiot so if your answer is "pff this would be super easy in Ruby just use Ruby", that's not gonna really get me where I need to be. Also if this has already been answered, it was hard to think of how to word the question to search for previous answers so I apologize in advance if I didn't find the pre-existing answer when I was searching.
You can easily do this with CSS counters, sample here:
CSS
ul {
counter-reset:list;
}
li:after {
counter-increment:list;
content: " (" counter(list) ")";
}
For some more advanced examples visit the MDN documentation page.
You could use PHP to achieve this. If you've had no experience with it, it does integrate with HTML easily. Basically you write your html as usual, but you name the file .php instead of .html. Then you insert php scripts as follows, for example: <p>I can count to <?php nextNumber(); ?></p>.
at the top of the page you should insert more script with a counter function:
<?php
$i = 1;
$places = 4;
function nextNumber() {
GLOBAL $i, $places;
print str_pad($i++,$places,'0',STR_PAD_LEFT);
}
?>
This may be better than CSS. It's not browser-dependant.
Change $places to the number of digits you'd like to have (for leading zeros)

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE