Jsoup is not Selecting Script Tag - html

I am trying to select an script tag on page with text contains
Document doc=jsoup.parse(somehtml);
Elements ele=doc.select("script:contains(accountIndex)");
Code for script tag on the page is
<script>(function() {var vm = ko.mapping.fromJS({
"accountIndex": 1,
"accountNumber": "*******",
"hideMoreDetailsText": "Hide More Details",
"viewAccountNumberText": "Show Account Number",
"hideAccountNumberText": "Hide Account Number",
});window.AccountDetails = vm;})();</script>
I am able to select this script tag if i pass css locator of script tag like
Elements ele=doc.select("body > script:nth-child(44)");
There are many script tag on the page so the second approach is not generic.It may change in future.
Can somebody please tell what is the issue with the first approach.Because i am able to select other tags on the page with contains of jsoup

The selector :contains(text) looks for an element that has that text value. A script doesn't have text, it has data (otherwise the JS would be visible in the browser). You can use the :containsData(data) selector instead.
E.g.:
Elements els = doc.select("script:containsData(accountIndex)");
Here's an example. The Selector documentation has all the handled query types (which is not just strict CSS).

jsoup only supports CSS selectors, and those only allow you to select based on CSS classes and properties of the DOM elements, not their text contents (CSS selector based on element text?). You could try using another framework for parsing and querying the HTML, for example XOM and TagSoup like described here: https://stackoverflow.com/a/11817487/7433999
Or you could add CSS classes to youc script tags like this:
<script class="class1">
// script1
</script>
<script class="class2">
// script2
</script>
Then you can select the script tags again via CSS using jsoup:
Elements elements = document.select("script.class1");

Related

style auto generated html attributes with regex

I have an ionic/angular app which autogenerates a custom tag element with a different _ngcontent attribute each time e.g.:
<tag _ngcontent-hgr-c2>...</tag> (1st refresh)
<tag _ngcontent-agj-c7>...</tag> (2nd refresh)
<tag _ngcontent-cfx-c5>...</tag> (3rd refresh)
Is there a way to use regex to target the custom tag attribute?
This didn't work:
tag[^=_ngcontent-] {
color: red !important;
}
Nor did just targetting the tag app e.g.:
tag {
color: red !important;
}
According to this answer, there is kind of regex in CSS, but it can be only applied to attribute's value, not to attribute itself. The W3C documentation says the same, so because Angular creates custom attributes, I'm afraid that it can be hard to achieve by regex.
If you want to style your tag like in the second example you can do it by defining its styles in global styles.scss. This is not the best solution, but should work.
This angular-blog article recently helped me understand the idea behind the style ecapsulation.
Unfortunately, there is no wildcarding support in CSS for attribute names.
If you have access to the application code which generates the custom tags, you should add classes to these elements (if the app supports it).
See also this question.

Scraping HTML elements between ::before and ::after with scrapy and xpath

I am trying to scrape some links from a webpage in python with scrapy and xpath, but the elements I want to scrape are between ::before and ::after so xpath can't see them as they do not exist in the HTML but are dynamically created with javascript. Is there a way to scrape those elements?
::before
<div class="well-white">...</div>
<div class="well-white">...</div>
<div class="well-white">...</div>
::after
This is the actual page http://ec.europa.eu/research/participants/portal/desktop/en/opportunities/amif/calls/amif-2018-ag-inte.html#c,topics=callIdentifier/t/AMIF-2018-AG-INTE/1/1/1/default-group&callStatus/t/Forthcoming/1/1/0/default-group&callStatus/t/Open/1/1/0/default-group&callStatus/t/Closed/1/1/0/default-group&+identifier/desc
I can't replicate your exact document state.
However if you load the page you can see some template language loaded in the same format your example data is:
Also if you check XHR network inpector you can see some AJAX requests for json data is being made:
So you can download the whole data you are looking for in handy json format over here:
http://ec.europa.eu/research/participants/portal/data/call/amif/amif_topics.json
scrapy shell "http://ec.europa.eu/research/participants/portal/data/call/amif/amif_topics.json"
> import json
> data = json.loads(response.body_as_unicode())
> data['topicData']['Topics'][0]
{'topicId': 1259874, 'ccm2Id': 31081390, 'subCallId': 910867, ...
Very very easy!
you just use the "Absolute XPath" and "Relative XPath" (https://www.guru99.com/xpath-selenium.html) together.By this trick you can pass form ::before (and maybe ::after). For example in your case (I supposed that,:
//div[#id='"+FindField+"'] // following :: td[#class='KKKK'] is before your "div".
FindField='your "id" associated to the "div"'
driver.find_element_by_xpath ( "//div[#id='"+FindField+"'] // following :: td[#class='KKKK'] / div")
NOTE:only one "/" must be use.
Also you can use only "Absolute XPath" in all addressing (Note:must be use "//" at the first Address.

What does it mean to set data-target attribute of a div to the id of that div?

I'm reading some code and there is a piece of html that reads:
<div id="uniqueId1234" data-target=".uniqueId1234">
...
</div>
and then earlier on in the same html file there is a span element that seems to use this div as a class:
<span class="uniqueId1234">
...
</span>
Can someone explain how this works? I thought that a class was something created in a css file. Sorry if this is a dumb question.
This is likely part of some piece of Javascript code or a library that listens for some type of change or event on your element with the data-target attribute.
When this event is triggered, it can then use the value of that attribute as a selector for performing some other logic as seen in this basic jQuery-based example below:
// When an element containing your data-target attribute is clicked
$('[data-target]').click(function(){
// Find the appropriate target (i.e. ".uniqueId1234")
var target = $(this).data('target');
// Then use it as a selector for some type of operation
$(target).toggle();
});
Classes are very common within CSS to style multiple elements, but they can also commonly be used as a mechanism in Javascript as well, which is likely the case in your scenario here.
What does it mean to set data-target attribute of a div to the id of that div?
Nothing standard. data-* attributes are designed to hold custom data for custom code (typically client side JS) to process.
I thought that a class was something created in a css file.
Classes are an HTML feature used to put elements into arbitrary groups. They are commonly used when writing CSS, but also client side JS and other code.

How do I find a reliable XPath for this html element (type is text, class is known, no id present)?

The element is similar to:
<input type="text" class="information">
There is no id for the element.
There is only one text type element inside the information class. I want to be able to enter text into this html element by using casperjs which works on top of phantomjs.
The XPath obtained from chrome developer tools is similar to:
//*[#id="abcid"]/div/div[1]/input
abcdid is the id of the div element which comprises of the text box and a few other elements. But I need a more reliable XPath. I'm not very experienced with finding XPaths so forgive me if the answer is too obvious.
If you want to use XPath selectors for nearly all CasperJS functions, you need to provide it as an object. If the selector is provided as a string it will be automatically assumed that it is a CSS selector.
You can build the XPath selector object yourself:
{
type: 'xpath',
path: '//input[#class="information"]'
}
or just use a XPath utility by first requiring it at the beginning of your script and then using it:
var x = require('casper').selectXPath;
// later ...
var text = casper.fetchText(x('//input[#class="information"]'));
Regarding your selector:
If there is only one input with the information class then you can use the XPath
//input[#class="information"]
or the CSS selector
input.information[type='text']
If the input has other classes too, the CSS selector will work as is, but the XPath selector must be changed to
//input[contains(#class,"information")]

HTML tag that causes other tags to be rendered as plain text [duplicate]

This question already has answers here:
How to display raw HTML code on an HTML page
(30 answers)
Closed 3 years ago.
I'd like to add an area to a page where all of the dynamic content is rendered as plain text instead of markup. For example:
<myMagicTag>
<b>Hello</b> World
</myMagicTag>
I want the <b> tag to show up as just text and not as a bold directive. I'd rather not have to write the code to convert every "<" to an "<".
I know that <textarea> will do it, but it has other undesirable side effects like adding scroll bars.
Does myMagicTag exist?
Edit: A jQuery or javascript function that does this would also be ok. Can't do it server-side, unfortunately.
You can do this with the script element (bolded by me):
The script element allows authors to include dynamic script and data blocks in their documents.
Example:
<script type="text/plain">
This content has the media type plain/text, so characters reserved in HTML have no special meaning here: <div> ← this will be displayed.
</script>
(Note that the allowed content of the script element is restricted, e.g. you can’t have </script> as text content (it would close the script element).)
Typically, script elements have display:none by default in browser’s CSS, so you’d need to overwrite that in your CSS, e.g.:
script[type="text/plain"] {display:block;}
You can use a function to escape the < >, eg:
'span.name': function(){
return this.name.replace(/</g, '<').replace(/>/g, '>');
}
Also take a look at <plaintext></plaintext>. I haven't used it myself but it is known to render everything that follows as plain text(by everything i mean to say it ignores the closing tag, so all the following code is rendered as text)
The tag used to be <XMP> but in HTML 4 it was already deprecated. Browser's don't seem to have dropped its support but I would not recommend it for anything beyond quick debugging. The MDN article about <XMP> lists two other tags, <plaintext> and <listing>, that were deprecated even earlier. I'm not aware of any current alternative.
Whatever, the code to encode plain text into HTML is pretty straightforward in most programming languages.
Note: the term similar means exactly that—all three are designed to inject plain text into HTML. I'm not implying that they are synonyms or that they behave identically—they don't.
There is no specific tag except the deprecated <xmp>.
But a script tag is allowed to store unformatted data.
Here is the only solution so far showing dynamic content, as you wanted.
Run code snippet for more info.
<script id="myMagicTag" type="text/plain" style="display:block;">
<b>Hello</b> World
</script>
Use Visible Data-blocks
<script>
document.querySelector("#myMagicTag").innerHTML = "<b>Unformatted</b> dynamic content"
</script>
No, that's not possible, you need to HtmlEncode it.
If your using a server-side language, that's not really difficult though.
In .NET you would do something like this:
string encodedtext = HttpContext.Current.Server.HtmlEncode(plaintext);
In my application, I need to prevent HTML from rendering
"if (a<b || c>100) ..."
and
"cout << ...".
Also the entire C++ code region HTML must pass through the GCC compiler with the desired effect. I've hit on two schemes:
First:
//<xmp>
#include <string>
//</xmp>}
For reasons that escape me, the <xmp> tag is deprecated. I find (2016-01-09) that Chrome and FF, at least, render the tag the way I want. While researching my problem, I saw a remark that <xmp> is required in HTML 5.
Second, in <head> ... </head>, insert:
<style type="text/css">
textarea { border: none; }
</style>
Then in <body> ... </body>, write:
//<br /> <textarea rows="4" disabled cols="80">
#include <stdlib.h>
#include <iostream>
#include <string>
//</textarea> <br />
Note: Set "cols="80" to prevent following text from appearing on the right. Set "rows=..." to one more line than you enclose in the tag. This prevents scroll bars. This second technique has several disadvantages:
The "disabled" attribute shades the region
Incomprehensible, complex comments in the code sent to the compiler
Harder to understand
More typing
However, this methhod is neither obsolete nor deprecated. The gods of HTML will make their faces to shine unto you.