Using Jython to print html title? - html

In jython, is there a way to create a function that takes in a url (html) as a parameter, and returns the title of the url (whatever's in between the <title> and </title>)?

Of course it is!
At first download page you want to analyse. You can do it with urllib2 module. Read its documentation and at the bottom you will find examples of how to read page content.
When you have page content you must locate title in it. You can do it in many ways. There are modules for parsing HTML but for such simple task you can use regular expression (module re) or even string functions (find() method).
Be aware that HTML tags are case sensitive, so if you are going to use find() method to locate start and end of the title you may also need lower() method and copy of original page.

Related

Trying to determine why my xpath is failing in Scrapy

I'm trying to run a Scrapy spider on pages like this:
https://careers.mitre.org/us/en/job/R104514/Chief-Engineer-Technical-Analysis-Department
And I'd like the spider to retrieve the bullet points with qualifications and responsibilities. I can write an xpath expression that gets exactly that, and it works in my browsers:
//*/section/div/ul/li
But when I try to use the Scrapy shell:
response.xpath("//*/section/div/ul/li")
It returns an empty list. Based on copying the response.text and loading it in a browser, it seems like the text is accessible, but I still can't access those bullets.
Any help would be much appreciated!
Looking at the page you have linked, the list items you are targeting are not actually in the document response itself but later loaded into the DOM by JavaScript.
To access these I'd recommend looking at scrapy's documentation on Selecting dynamically-loaded content. The section that applies here in particuler is the Parsing JavaScript code section.
Following the second example, we can use chompjs (you'll need to first install it with pip) to extract the JavaScript data, unescape the html string, and then load it into scrapy for parsing. e.g.:
scrapy shell https://careers.mitre.org/us/en/job/R104514/Chief-Engineer-Technical-Analysis-Department
Then:
import html # Used to unescape the HTML stored in JS
import chompjs # Used to parse the JS
javascript = response.css('script::text').get()
data = chompjs.parse_js_object(javascript)
description_html = html.unescape(data['description'])
description = scrapy.Selector(text=description_html, type="html")
description.xpath("//*/ul/li")
This should output your desired list items:
[<Selector xpath='//*/ul/li' data='<li>Ensure the strength ...

AEM Rich Text Source Editor Anchor Tag Stripping href formed like Sightly tag

In my AEM project, we have client-side dynamic variable functionality which checks for any strings that are formed inside of a ${ } wrapper. The dynamic variable values are coming from our cookies. Replacing this with a more friendly format that does not conflict with Sightly is not an option at the moment, so please don't tell me to do that :)
When creating an anchor tag in the source editor of the Text core component, I am setting the href as the following: href="/content/en/opt-in.html?hash=${/profile/hash}". The anti-Samy configuration is blocking the href attribute from being rendered on this element, but I have tried to add the following to the overlayed file /apps/cq/xssprotection/config.xml:
<regexp name="expressionURLWithSpecialCharacters" value="(\$\{(\w|\/|:)+\})"/>
<regexp-list>
<regexp name="onsiteURL"/>
<regexp name="offsiteURL"/>
<regexp name="expressionURL"/>
<regexp name="expressionURLWithSpecialCharacters"/>
</regexp-list>
^ inside of the <attribute name="href"> block of common-attributes. Is there something else I need to do in order to make this not be filtered out so that it can be correctly parsed by the global variable replacement? Thanks!
There are two issues here:
The RTE will encode your URL and turn hash=${/profile/hash} into hash=$%7B/profile/hash%7D when storing into JCR
Even if you pass 1, the expression you are trying to use will only match EXACTLY the URL of ${/profile/hash}. You would need to expand the expression to include everything else (scheme, domain/host, path, query etc.). Think onsiteURL and offsiteURL but allowing your expression as well in query parameters. Have a look at https://github.com/apache/sling-org-apache-sling-xss/blob/master/src/main/java/org/apache/sling/xss/impl/XSSFilterImpl.java#L115 to get a starting point.
Have you tried adding disableXSSFiltering="{Boolean}true”?
Vlad, your second point was helpful in that I hadn't considered that one of the regular expressions in the XSS Protection configuration href attribute block needed to match the ${/profile/hash} in addition to the rest of the URL preceding and following it. Although to your first point, the RTE actually did save the special characters as-is into the JCR and did not encode them, probably since I was using the source editor mode and not the inline text editor.
What I ended up doing was creating a new regular expression as follows:
<regexp name="onsiteURLWithVariableExpression"
value="(?!\s*javascript(?::|&colon;))(?:(?://(?:(?:(?:(?:\p{L}\p{M}*)|[\p{N}-._~])|(?:%\p{XDigit}\p{XDigit})|(?:[!$&&apos;()*+,;=]))*#)?(?:\[(?:(?:(?:\p{XDigit}{1,4}:){6}(?:(?:\p{XDigit}{1,4}:\p{XDigit}{1,4})|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])))|(?:::(?:\p{XDigit}{1,4}:){5}(?:(?:\p{XDigit}{1,4}:\p{XDigit}{1,4})|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])))|(?:(?:\p{XDigit}{1,4}){0,1}::(?:\p{XDigit}{1,4}:){4}(?:(?:\p{XDigit}{1,4}:\p{XDigit}{1,4})|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])))|(?:(?:(?:\p{XDigit}{1,4}:){0,1}\p{XDigit}{1,4})?::(?:\p{XDigit}{1,4}:){3}(?:(?:\p{XDigit}{1,4}:\p{XDigit}{1,4})|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])))|(?:(?:(?:\p{XDigit}{1,4}:){0,2}\p{XDigit}{1,4})?::(?:\p{XDigit}{1,4}:){2}(?:(?:\p{XDigit}{1,4}:\p{XDigit}{1,4})|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])))|(?:(?:(?:\p{XDigit}{1,4}:){0,3}\p{XDigit}{1,4})?::(?:\p{XDigit}{1,4}:){1}(?:(?:\p{XDigit}{1,4}:\p{XDigit}{1,4})|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])))|(?:(?:(?:\p{XDigit}{1,4}:){0,4}\p{XDigit}{1,4})?::(?:(?:\p{XDigit}{1,4}:\p{XDigit}{1,4})|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])))|(?:(?:(?:\p{XDigit}{1,4}:){0,5}\p{XDigit}{1,4})?::(?:\p{XDigit}{1,4}))|(?:(?:(?:\p{XDigit}{1,4}:){0,6}\p{XDigit}{1,4})?::))]|(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])\.(?:\p{N}|[\x31-\x39]\p{N}|1\p{N}{2}|2[\x30-\x34]\p{N}|25[\x30-\x35])|(?:(?:(?:\p{L}\p{M}*)|[\p{N}-._~])*|(?:%\p{XDigit}\p{XDigit})*|(?:[!$&&apos;()*+,;=])*))(?::\p{Digit}+)?(?:/|(/(?:(?:\p{L}\p{M}*)|[\p{N}-._~]|%\p{XDigit}\p{XDigit}|[!$&&apos;()*+,;=]|:|#)+/?)*))|(?:/(?:(?:(?:\p{L}\p{M}*)|[\p{N}-._~]|%\p{XDigit}\p{XDigit}|[!$&&apos;()*+,;=]|:|#)+(?:/|(/(?:(?:\p{L}\p{M}*)|[\p{N}-._~]|%\p{XDigit}\p{XDigit}|[!$&&apos;()*+,;=]|:|#)+/?)*))?)|(?:(?:(?:\p{L}\p{M}*)|[\p{N}-._~]|%\p{XDigit}\p{XDigit}|[!$&&apos;()*+,;=]|:|#)+(?:/|(/(?:(?:\p{L}\p{M}*)|[\p{N}-._~]|%\p{XDigit}\p{XDigit}|[!$&&apos;()*+,;=]|:|#)+)*)))?(?:\?(?:(?:\p{L}\p{M}*)|(\$\{(\w|\/|:)+\})|[\p{N}-._~]|%\p{XDigit}\p{XDigit}|[!$&&apos;()*+,;=]|:|#|/|\?)*)?(?:#(?:(?:\p{L}\p{M}*)|[\p{N}-._~]|%\p{XDigit}\p{XDigit}|[!$&&apos;()*+,;=]|:|#|/|\?)*)?"/>
which is just the onsiteURL with my original expressionURLWithSpecialCharacters: (\$\{(\w|\/|:)+\}) value added as a group in the query string parameter section. This enabled AEM to accept this as an href value in my anchor tag.
I appreciate everyone's help!

How to display multiple links in a modal box in perl?

I am trying to display multiple links in a single modal box on onclick function like this
$cgi->img({ -src =>'/images/question.png',
-width=>10,
-border=>0,
-height=>10,
-alt=>'Redirect Link',
-onClick=>"image_click()"
}
),$cgi->div({-id="modal1",-class=>"modal"},$cgi->div({-class=>"modal2"},$cgi->span({-class=>"close",-onclick=>span_click()"},'×'),$cgi->p({},$links),),)
It's working fine when I want to display a single link but if I want to display multiple links in the same box.I wasn't able to get it.Instead I am getting the text of it.
My links looks something like this
my $links="'select a link',\$cgi->a({-href=>somelink},'LINK1'),\$cgi->a({-href=>somelink},'LINK2');
What am I doing wrong?
In this line:
my $links="'select a link',\$cgi->a({-href=>somelink},'LINK1'),\$cgi->a({-href=>somelink},'LINK2');
$cgi->a(...) is a method call, you have it in a double quoted string, but you can't interpolate method calls in a double quoted string.
Try something like this instead:
my #links = 'select a link', $cgi->a({-href=>somelink},'LINK1'), ...
Which creates an array of things rather than trying to put all the things in a string. Then change
$cgi->p({}, $links)
to:
$cgi->p({}, #links)
I haven't tested this - sorry.
Just because you're using CGI doesn't mean you have to use the CGI methods for generating HTML. I would strongly recommend looking at using a templating module like Template::Toolkit, or a framework like Mojolicious which can be run from CGI and includes a templating system (and has next to no dependencies).

Does media wiki support links inside highlighted code?

Assume I have the following code section:
<syntaxhighlight lang = "php">
function my_func($str) {
$arr = split($str, ' ');
}
</syntaxhighlight>
This would be highlighted with the help of Geshi extension. However, I would also like to make split as a url link to the external site with documentation explaining what this function does. Is there like any way to do that in MediaWiki for the highlighted code?
Since Geshi works like the <pre> tag to display the code is displayed as typed instead of parsing it as wikicode, mediawiki can't parse anything inside it. Therefore its impossible to add a 'normal' link using wiki code.
Good news is that GeSHi already have exactly what you need!
First, you will need to set in localSettings.php:
$wgSyntaxHighlightKeywordLinks = true;
By doing that it will each function will be a link to http://www.php.net/<function name> (since your example is using php code).
If what you want is a link to somewhere else (your own site maybe), you will need to edit the 'URLS' array in $IP/SyntaxHighlight_GeSHi/geshi/geshi/php.php
(more information on GeSHi's documentation)
And if you will need links on functions for other languages other than php, just edit the according file instead. For example:
$IP/SyntaxHighlight_GeSHi/geshi/geshi/lolcode.php

How can I retrieve and parse just the html returned from an URL?

I want to be able to programmatically (without it displaying in the browser) send an URL such as http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=platypi&sprefix=platypi%2Caps&rh=i%3Aaps%2Ck%3Aplatypi" and get back in a string (or some more appropriate data type?) the html results of the page (the interesting part, anyway) so that I could parse that and reformat selected parts of it as matched text and images (which link to the appropriate page). I want to do this with Razor/Web Pages, if that makes any difference.
IOW, this is sort of a screen-scraping question, but really a "behind-the-screen" scraping.
Is it possible? How? A 100 point post-answer-bonus will be awarded to the (or the most helpful) answer.
Use the WebClient class (or .Net 4.5's better HttpClient class) to download the HTML, then use HTML AgilityPack to parse it