The link I used is here
My procedure of getting the "price" xpath are:
go to the link using chrome
use tool inspect element
search for the price number
then come into <span class="QuoteStrip-lastPrice">some_numbers</span>
right click the element, then select Copy Full XPATH
Then I'll get /html/body/div[2]/div/div[1]/div[3]/div/div[2]/div[1]/div[2]/div[3]/div/div[2]/span[1]
Problem:
Two days ago before this thread, I was using the full xpath I got from chrome inspection, it working just fine in xidel without a problem. But now,
xidel -se '/html/body/div[2]/div/div[1]/div[3]/div/div[2]/div[1]/div[2]/div[3]/div/div[2]/span[1]' ~/Desktop/document.html
or
xidel -se '/html/body/div[2]/div/div[1]/div[3]/div/div[2]/div[1]/div[2]/div[3]/div/div[2]/span[1]' https://www.cnbc.com/quotes/XAU=
gives me no output...
but, user2703456 given me the "correct" full xpath as follow /html/body/div[2]/div/div[1]/div[3]/div/div/div[1]/div[2]/div[3]/div/div[2]/span[1]
it solved the problem, however, I still confused. when I check both full xpath pointing at the same element, exact same element. why? why it was working before, but now wont?
XPath inspection snapshot
I suppose it's content comes dynamicly, and somehow the server or javascript changes html.
To overcome this problems i would use not the copied XPath but rely more on a combination of #class en #id attributes i.e. like this:
//*[#id='quote-page-strip']//*[#class='QuoteStrip-lastPrice']
Related
I am trying to create a TOC for my Markdown blog.
The methods I am finding here... : Markdown to create pages and table of contents?
....do not work for me because I am naming all of my headers # _</>_ The Setup because I am using CSS on to style the "", giving each header a nice colored Icon next to it. If I simply use ```# The Setup ```` it works great.
This causes issues whenever I try to use [The Setup](#The-Setup).
I tried a few things like [The Setup](#_</>_-The-Setup) and other things, but I can not get it to work.
If someone can point me in the right direction I would greatly appreciate it. Also, if anyone has a better way of adding custom icons next to headers, I think that would be the better way to go about it.
As always, thanks in advance.
The general solution is to examine the rendered HTML output to see what the tool is converting the special characters to, in the HTML's element ID. Every tool could handle the conversion differently (it could convert special characters to -, _, or just remove special characters). Some examples:
<h1 id="_____the-setup">The Setup</h1>
<h1 id="-the-setup">The Setup</h1>
<h1 id="the-setup">The Setup</h1>
Once you have identified the exact id that the tool is using, then you use that value as the heading link in the markdown's table of contents. For example:
[The Setup](#_____the-setup)
Now, the tricky part is that not all Markdown tools will export the rendered HTML, including VS Code. The workaround for VS Code is:
Open the markdown preview mode (which renders to html internally).
Open the VS Code Developer Tools (Help > Toggle Developer Tools).
Use DevTools to inspect the element (in this case, the heading element for "The Setup").
I see that VS Code named the id as the-setup, so in the markdown's table of contents, I write [The Setup](#the-setup). Now the table of content hyperlink works in VS Code. Caveat: it might not work in other Markdown tools if they render a different HTML element ID!
Another shortcut now available in VS Code (1.70 July 2022), is that markdown can autocomplete the header ID. So you just type #, and it will list the valid IDs:
I want to extract data from a website, but it seems that the elements that I want to extract are not "accessible".I also discovered they seem to be pseudo-elements. I can se that their tags are marked with a # before in my web-inspector.
Moreover, while using XPath I can't extract the text I want to access. Their is a point in the CSS "cascade tree" when I can't extract the content of a tag, you can see it below.
Here I can extract information up to the tag 'content fond'. But when I ask for the tag "fos_comment_thread" which is the tag just below, the return is empty. And it is especially this tag which is a pseudo-element, and the following behind. However the text I want to access is even more deeper in this part of the CSS tree...
Input
reponse.xpath=('//div[class#='row']/div[#class='span9 forum']/div[class#='content fond'].extract()
Output
['<div id="foc_comment_thread"<div>']
Input
reponse.xpath=('//div[class#='row']/div[#class='span9 forum']/div[class#='content fond']/div[id#='fos_comment_thread'].extract()
Output
[]
I don't understand why I can't extract, I think it is due to the fact that the rest of my tags are pseudo-elements,but I haven't found a solution to solve the problem...
The first thing you need to do is to not using your web-inspector tool and look at the raw HTML of the website.
Web inspectors take into account the transformations made by Javascript and may show you an update HTML after Javascript execution, that scrapy obviously can't see.
A W3C-validated HTML 5 web page contains this working, simple button inside a login form.
<input data-disable-with="Signing in, please wait…"
name="commit" type="submit" value="Sign in" />
I'm writing a largely pointless test :-) in a Rails 3.2.17 application that's just to get the hang of Capybara and I've already got completely stuck Googling, reading documentation and reading source code to the test framework, with no joy - attempting to find this button by its name (i.e. "commit") fails.
click_button("commit")
find_button("commit")
Both result in Capybara::ElementNotFound: Unable to find button "commit". If I use the visible button text of Sign in then the element is found, i.e. these:
click_button("Sign in")
find_button("Sign in")
...both work fine, so it would appear that the XML parser isn't having any trouble finding the element.
Documentation for click_button says that the locator works on "id, text or value", with "text" being meaningless for an input element like this (the visible text is taken from the value attribute), but relevant perhaps for button elements. So, we might expect that to fail, though if we view the code via the documentation, find that it calls down to find in the same way as find_button. Yet find_button is documented differently; it says it locates by "id, name or value". So sadly, we know from this that the documentation is broken because it says two different things for what turns out to be an identical call at the back end.
Either way, the element isn't found by name, and that means the lower level find call isn't searching name attributes as far as I can see. This means Capybara (2.2.1, on Nokogiri 1.6.1) is rather broken in that respect. How come nobody has noticed? I've Googled for ages and it doesn't seem to come up. I seem to be rather missing the point :-)
Why don't I just search for the English text in the button, you might ask? Because of internationalisation. This old, Rails 1 -> 2 -> 3 upgraded app has some I18n parts and other static text parts. I don't want to be forced to put I18n into any view that Capybara tests, just so I can have the test use I18n.t() to ensure a match despite different languages or locale file updates. Likewise, it would clearly be very stupid in 2014 to write hard-coded English strings into my tests.
That's why we have names and IDs and such... The unique (in theory!) identifiers that are machine-read, not human-read.
I could hack up something that CSS-selected by "type=submit" but seriously, why isn't Capybara searching the name attribute when its documentation says it does, and why does the documentation disagree on what attributes are searched on two methods that call down to exactly the same back-end implementation with exactly the same parameters?
TIA :)
It turns out the docs are misleading for both calls, as neither look at the attributes listed. It's also clearly very confusing what exactly a "button" means, since a couple of people herein seemed to think it literally only meant an HTML button element but that's not the case.
If you view the source for the documentation of, say, click_button:
https://github.com/jnicklas/capybara/blob/a94dfbc4d07dcfe53bbea334f7f47f584737a0c0/lib/capybara/node/actions.rb#L36
...you will see that this just calls (as I've mentioned elsewhere) to find with a type of :button, which in turn passes through to Capybara's Query engine which, in turn, ends up just using the standard internal selection mechanism to find things. It's quite elegant; in the same way that an external client can add their own custom selectors to making finding things more convenient:
http://rubydoc.info/github/jnicklas/capybara/master/Capybara#add_selector-class_method
...so Capybara adds its own selectors internally, including, importantly, :button:
https://github.com/jnicklas/capybara/blob/a94dfbc4d07dcfe53bbea334f7f47f584737a0c0/lib/capybara/selector.rb#L133
It's not done by any special case magic, just some predefined custom selectors. Thus, if you've been wondering what custom selectors are available from the get-go in Capybara, that's the file to read (it's probably buried in the docs too but I've not found the list myself yet).
Here, we see that the button code is actually calling XPath::HTML.button, which is a different chunk of code in a different repository, with this documentation:
http://rdoc.info/github/jnicklas/xpath/XPath/HTML#button-instance_method
...which is at the time of writing slightly out of date with respect to the code, since the code shows quite a lot more stuff being recognised, including input types of reset and button (i.e. <input type="button"...> rather than <button...>...</button>, though the latter is also included of course).
https://github.com/jnicklas/xpath/blob/59badfa50d645ac64c70fc6a0c2f7fe826999a1f/lib/xpath/html.rb#L22
We can also see in this code that the finder method really only finds by id, value and title - i.e. not by "text" and not by name either.
So assuming XPath is behaving as intended, though it's not clear from docs, we can see that Capybara isn't documenting itself correctly but probably ought to make the link down to XPath APIs for more information, to avoid the current duplication of information and the problems this can cause for both maintainers and API clients.
In the mean time, I've filed this issue:
https://github.com/jnicklas/capybara/issues/1267
You can also use css selectors which are default capybara locators. People say they are faster.
find('[name=commit]').click
Capybara do not look at name attribute in it's finders :(
You can use xpath selector if you want
find(:xpath, "//input[contains(#name, 'commit')]").click()
If anyone wants it is possible to add (quite easily) find by name selector. In order to do so:
Add following code to test/test_helper.rb (for minitest)
Capybara.add_selector(:name) do
xpath { |name| XPath.descendant[XPath.attr(:name).contains(name)] }
end
Use it
Now in your tests you can use following selector:
find(:name, 'part_of_the_name_attribute')
It will find every element which name attribute contains searched value.
Example
find(:name, 'user')
This will find elements (element could be of any type):
<select name='user_name'>
<input name='name_of_user'>
<textarea name='some_user_info'>
You can use this selector to find a button on a page with RSpec and Capybara:
expect(page).to have_selector(:link_or_button, "Button text")
Check your gem depencies. RSpec 3 or higher works with gem 'rspec-rails', '~> 3.7.1' then capybara version must be gem 'capybara', '~>2.18.0' and poltergeist should be gem 'poltergeist', '~>1.17.0'.
I need to move away from using xsltproc command-line tools for deployment on Heroku, since they don't really support it. The Nokogiri gem looks like it should work for everything I need, although I am having trouble finding a representative image from HTML.
What I mean by representative image is, the first of all images under /html/body that have "://" in the link and don't have "ads." or "ad." or "?" in the link. Is there a Nokogiri function that will do this, possibly returning an array of all images, and I can filter them how I want?
The following XPath should select the image that meets your stated criteria:
/html/body//img[#src[contains(.,'://')
and not(contains(.,'ads.')
or contains(.,'ad.')
or contains(.,'?')
)
]
][1]
You can use it like this:
doc.xpath("/html/body//img[#src[contains(.,'://')
and not(contains(.,'ads.') or contains(.,'ad.') or contains(.,'?'))]][1]")
Seems you need to read about XPath. Here is a pretty good (and simple) tutorial.
I am trying to parse the table from the URL (Google finance) below
http://www.google.com/finance/historical?q=BOM:533278
I am trying to extract only the close values in the close column. But when i try with the XPATH
hd.DocumentNode.SelectSingleNode("//td[#class='rgt']")
I am getting all the nodes of having attribute as class and value of attribute as rgt in one Node.innerText itself.
I need the values one by one, not all at the same time. I must be doing something silly here. Thank you.
Actual XPath found using Firebug is a follows
/html/body/div/div/div[3]/div[2]/div/div[2]
/div[2]/div/form/div[2]/table/tbody/tr[2]/td[5]
But some how after the form tag...HTMLagility pack is returning null node. Never thought this would take so long to implement.
If you're using Firebug or any Firefox extension (like XPather) to obtain the XPath of the elements you need to parse, you might need to remove the tbody tags from the XPath.
Take a look at the following answer here on SO: Why does firebug add <tbody> to <table>?
If you're using HtmlAgilityPack, the XPath returned by Firebug or by any other tool related with Firefox may differ, because the HTML source you're parsing can be different from the HTML source in Firefox.
Sometimes might be useful to open the same page in Internet Explorer 8 and using Developer Tools (F12) do the same you're doing with Firebug, or if not, use another tool like HAP Explorer that can be downloaded from the HtmlAgilityPack page
There are many ways to do it. Here is one solution, which is based on the Data td (the one withe the 'lm' class):
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
... load the doc ...
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//td[#class='lm']/../td[5]"))
{
Console.WriteLine("node=" + node.InnerText);
}
XPath for the first cell in Close column is //div[#id='prices']/table/tbody/tr[2]/td[5] and for the second one it's //div[#id='prices']/table/tbody/tr[3]/td[5] and so on.