Partial HTML Selection Using Jsoup - html

So I was wondering if there is a way to find the element that belongs to a specific String that you know exists on a HTML page as part of an attribute. The example is I know that "Apr-16-2015" is somewhere in an attribute on the HTML page. If I go look for it, it's part of the attribute title:
<a title="Apr-16-2015 5:04 AM"
However, I do not have the information about the exact time, i.e. the "5:04 AM". I was wondering if there is a way to partially search an attribute in order for it to return the full element.
This is my code:
org.jsoup.nodes.Element links = lastPage.select("[title=\"Apr-16-2015\"]").first();
Again, it doesn't work because I did not enter the full attribute title, as given above. My question: "Is there any way to make this selector work by not entering the full information, as I will be unable to have the latter part of the attribute to my disposition?"

You can use it in the following way:
lastPage.select("[title^=\"Apr-16-2015\"]").first();
As described on JSoup Documentation:
[attr^=value], [attr$=value], [attr*=value]: elements with attributes
that start with, end with, or contain the value, e.g. [href*=/path/]
References:
http://jsoup.org/cookbook/extracting-data/selector-syntax

Related

What does a reference inside a div tag without an attribute do?

I would like to start using Attribute Selectors in my css. I am seeing div tags that contain a reference WITHOUT any attribute statement like:
<div class="container" data-footer>
All my searches (for the last hour) to find out how "data-footer" can be listed without the use of an attribute= (e.g., id= or class= or etc.) have resulted in no information. Dozens of SO and Google links without a single example of a reference inside a div tag without the use of an attribute. Because I do not know what this is (or what to call it) I'm probably not searching with the right keywords. Is this a short-form way to pass an id or ???
Data- attributes without a value can be used as Boolean. For example:
if(div.hasAttribute('data-footer')) {
// then do something
}
In css you can access it like:
div[data-footer] {
}

How to access <a> tag which is present inside <th> in VBA that doesn't have id or classname?

I'm trying to automate the web form where I have <a> tag which is inside <th> tag. When I tried getElementByTagName("a").innerText I'm not getting the desired element/text. But when I wrote getElementByTagName("th").innerText it is showing me the exact text that I'm pointing at. But the issue is I wanted to click on the link which this text i.e <a> tag has. getElementByTagName("th").Click is not working. Can someone please help?
There's no such method as getElementByTagName().
There are: document.getElementsByTagName() and Element.getElementsByTagName(). Both return a live HTMLCollection.
In the latter Element refers to a DOM element. It allows you to search for specific tags in children of that element.
Plese refer to the following MDN documents:
Element.getElementsByTagName()
Document.getElementsByTagName()
Also, it's worth mentioning that, without document.* or anything else, the browser would assume you're trying to call window.getElementByTagName().
NOTE: I'm aware the question is tagged vba instead of javascript, but in this case it doesn't seem to matter.

Parsing awful HTML: How do I recognize boundaries with xpath?

This is almost going to sound like a joke, but I promise you this is real life. There is a site on the internet, one which you have all used, that does not believe in css classes. Everything is defined directly in the style tag on an element. It's horrifying.
My problem though is that it also makes the html extraordinarily difficult to parse. The structure that I've got to go on looks something like this:
<td>
<a name="<random_string>"></a>
<div style="generic-style, used by other elements">
<div style="similarly generic style">{some_stuff}</div>
</div>
<a name="<random_string>"></a>
...
</td>
Basically, I've got these a tags that are forming the boundaries of the reviews, whos only defining information is the random string that is their name. I don't actually care about the anchor tags, but I would like to grab the reviews between them using xpath.
I've looked into sibling queries, but they don't seem to be well suited for alternating boundaries. I also looked into the Kayessian method of xpath queries, which (aside from having an awesome name) only seems well suited to grab a particular div, rather than all divs between the anchor tags.
Any thoughts on how I could grab the divs here?
If //td/div[../a[#name]] works for you, then the following should also work :
//td[a/#name]/div
This way you don't need to go back and forth -or rather down and up-. For a more specific selector, you may want to try the following :
//td/div[preceding-sibling::*[1][self::a/#name]][following-sibling::*[1][self::a/#name]]
The XPath selects div element having all the following properties :
td/div : is child of <td> element
[preceding-sibling::*[1][self::a/#name]] : preceded directly by <a> element having attribute name
[following-sibling::*[1][self::a/#name]] : followed directly by <a> element having attribute name
I figured it out! It turns out that xpath will allow for relative attribute assertions. I am not sure if this behavior is desired, but it happens to work in this case! Here's the xpath:
//td/div[../a[#name]]
Nice and clean, the ../a[#name] basically just says:
Go up a level, and make sure on that level of the hierarchy there's an a element with a name attribute

How can I use <time> tag in <input>'s value attribute?

I have an order form and I need to do something like this:
<input id="myInput" type="text" name="myInput" value='<time datetime="2015-01-01" itemprop="startDate">1.1.2015</time>' class="width-100" readonly />
but in the browser, in the input area, where should be displayed just: 1.1.2015, as I supposed, is displayed the whole time tag: <time datetime="2015-01-01" itemprop="startDate">1.1.2015</time>
...idk why, and I can't figure out how to make this work.
The time tag in input's value is based on date selected from DB and returned by function, like: return '<time datetime="'.date('Y-m-d', strtotime($from)).'" itemprop="startDate">'.date('j.n.Y', strtotime($from)).'</time>';
Any advice would be helpful. Thanks
You can write almost anything you want inside an HTML attribute, as long as you encode it properly (e.g. < instead of <, " instead of ", etc.). All decent programming languages provide built-in methods to take care of the dirty details. Whether those values are valid, meaningful or useful in the context of the precise attribute is a different thing.
<input> elements are designed to hold plain text. Whatever you write into the value attribute will be rendered to the user as-is. If you type HTML tags, you'll display HTML code, nothing else.
If you want to send form data to the server, you can simply use form fields, as you are already doing. The missing bit is that form fields do not need to be visible. There's a specific control for hidden data: <input type="hidden">. From MDN reference:
hidden: A control that is not displayed, but whose value is submitted to the server.

How do I customise the HTML Filter from Power-Mezz?

I'm experimenting with the HTML Filter module from the PowerMezz library and would like to customise the filter rules for a particular instance of the function. Is this possible?
For example, by default the style attribute is permitted, however I'd like to have this attribute stripped:
>> filter-html {<p style="color:red">A Para</p>}
== {<p>A Para</p>}
As well as limiting some other tags/attributes that are otherwise allowed.
After studying the filter-html module it looks like the immediate answer is no --- there appears to be no way to change the filter options for a particular instance.
After some experimentation, however, I discovered that you can make small change to make something like this possible. Most attribute handling can be customized by changing the attributes-map block, but inline style attributes are not handled in that block. They are dealt with specifically in the check-attributes function.
I commented out these lines in check-attributes which then causes all style attributes get stripped out by default:
if value: select attributes 'style [
append style value
]
You would need to add the ones you didn't want filtered back in to the specific html tags in attribute-map. I make a copy of the original attribute-map, make my changes, run filter-html, then revert back to the original before the next filtering instance.