IMPORTXML in Google Sheets cannot seem to dive multiple spans - html

I have a fun spreadsheet for playing with PowerBall (potential) winnings. I discovered that I could use IMPORTXML to auto-fill the cash value of the current jackpot. But something changed and I am trying to fix it, but there's a problem:
=IMPORTXML(A4,A?) where A4= https://www.powerball.com/powerball-prize-estimate and A? = the XPATH.
The problem is that the XPATH comes to this: //*[#id="block-winningnumbersmodule"]/div[2]/div[2]/span[3]
The Full XPATH is: /html/body/div[2]/div/header/div[2]/div/div[2]/div[2]/span[3]
The relevant div[2] looks like this from the page:
<div class="estimated-jackpot">
<span class="estimated">Estimated Jackpot</span>
<span class="number"></span>
<span class="cash-value" data-value="Cash Value:"></span>
</div>
The second span is empty and I can't access the third span.
I solved this issue for myself in that the same info turned out to be elsewhere in the page. The old XMLPATH changed very little:
From:
/html/body/div[2]/div/main/div/div[1]/article/div/div[3]
To:
/html/body/div[2]/div/main/div/div[2]/article/div/div[3]
I switched to the, presumably, less likely to change randomly:
//div[#class="field_prize_amount_cash"]
Which lands me the same place.
But I continue with the question because I think it's still valid...why can't I get past the first span?
These are the XPATHs that I tried and the results for each:
//*[#id="block-winningnumbersmodule"]/div[2]/div[2]/span[3] #N/A XPATH
/html/body/div[2]/div/header/div[2]/div/div[2]/div[2]/span[3] #N/A Full XPATH
//*[#id="block-winningnumbersmodule"]/div[2]/div[2]/span Estimated Jackpot
/html/body/div[2]/div/header/div[2]/div/div[2]/div[2]/span Estimated Jackpot
//div[#class="estimated-jackpot"] Estimated Jackpot
//span[#class="estimated"] Estimated Jackpot
//span[#class="number"] #N/A This is an empty span
//span[#class="cash-value"] #N/A This is what I am wanting
//*[#id="block-winningnumbersmodule"]/div[2]/div[2]/span[1] Estimated Jackpot For completeness' sake
//*[#id="block-winningnumbersmodule"]/div[2]/div[2]/span[2] #N/A Maybe it doesn't count the empty span? Poop!
So, help? That I don't need anymore except to scratch an itch :)
Well I figured it out! "The relevant div[2] looks like this from the page:" showed empty spans! Looking at the elements they were all filled in...they never made it to the page source! The part I found later in the page was in the elements AND the source. "There's your problem, right there!"
<div class="field_next_draw_date">
<time datetime="2021-05-16T02:59:59Z">2021-05-16T02:59:59+0000</time>
</div>
<div class="field_prize_amount">
$183 Million
</div>
<div class="field_prize_amount_cash">
$127.4 Million
</div>
Look! Numbers! Not nothing! So when I was getting #N/A it was because I was trying to read nothing. IMPORTXML is cool, and it will read nothing as nothing every time!
Now I leave this here as...a cautionary tale? ...a reminder? (of what?)
I know... A funny story about how the asking of a question, if done thoroughly, giving all information to help those answering with every conceivable piece of information so that they are able to answer, can wind up leading you to your own answer.
I guess.
Thanks for your help! :)

Related

How to find XPath for the banner text?

The text is Investors/ Lenders get access to creditworthy borrowers to lend funds as per their risk appetite and gain attractive stable returns or monthly income to create wealth.
How to find xpath for mentioned text?
This is only possible with XPath-2.0 or above, because you need the fn:string-join function to merge the text() values in one XPath expression:
string-join(normalize-space(//p[#class='banner-text']))
I haven't tested this expression, because you chose to include code as image and not as text. Probably the == $0 text is included in the result. You can fix this with the fn:substring-after function.
Here is the xpath:
//p[#class='banner-text']
Here is the css:
p.banner-text
Tested in Chrome Console with below xpath and css.
xpath:
document.evaluate("//p[#class='banner-text']", document, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null).snapshotItem(0).innerText
CSS:
document.querySelector(".banner-text").innerText
Using JQuery:
$x("//p[#class='banner-text']")[0].innerText
Result for all the three above:
Investors/ Lendors "get access to creditworty" borrowers "to lend funds per their risk appetite and gain attractive stable returns or monthly income to create wealth."

Selecting some kind of closest child with jQuery seems not to work

I try to implement a nested tab module the following way.
By clicking on a .tabs__menu item I want to get the next .tabs__contents to display the correct entry.
I've prepared a codepen with markup and leave out all unimportant code so don't be irritated that it's not working. I don't understand why the variable debug2 is 0 and debug3 is 1. I expect debug2 to be 1 as well since I expect the following expression should find the element. Can anyone help me with this?:
.find(".tabs__contents").not(".tabs__contents .tabs__contents");
https://codepen.io/anon/pen/JNLWQp
Thanks in advance and best wishes,
duc
ok I have an assumption why it's not working. It seems that the .not method doesn't starts to search relatively from the given collection but globally. With this statement
.not(".tabs__contents .tabs__contents")
debug2 finds itself and exclude it from the collection thats why the length is 0.

xpath scraping data from the second page

I am trying to scrape data from this webpage: http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33, and I specifically need data for fund number 26.
Have no problem getting data from the first page with this address (funds number 1-25), but for the hell of me can't scrape anything from the second page. Can someone help?
Thanks!
Here is the code I use: in Google Sheets:
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33","/html/body/form[#id='MainForm']/table/tr/td/div[#id='main']/div[#id='tabResult']/div[#id='Prices']/table/thead/tr[26]/td[#class='Center'][1]")
You can do 2 things - one is to append the PgIndex=2 onto the end of your URL, and then you can also significantly simplify your xpath to this:
//*[#id='Prices']//tr[2]/td[2]
This specifically grabs the second row on the table (tr which means table-row), in order to bypass the header row, then grabs the second field which is the table-data cell.
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33&PgIndex=2","//*[#id='Prices']//tr[2]/td[2]")
To get the second page, add &PgIndex=2 to your url. Then adjust the /table/thead/tr[26] to /table/thead/tr[2]. The result is:
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33&PgIndex=2","/html/body/form[#id='MainForm']/table/tr/td/div[#id='main']/div[#id='tabResult']/div[#id='Prices']/table/thead/tr[2]/td[#class='Center'][1]")

Obtaining site's HTML after GET that returns JSON

Introduction
The main question is way down, but I guess it'd help to have some background in this case
Ok, so I'd like to start by saying that this is actually my first question on stackoverflow - I have been using this site for ages, and I basically learned everything I know about coding from it, but that also means I have huge gaps in my knowledge - a lot of the methods I use are chosen just because they are the only ones I know - if you have any suggestion how I can improve the process described below, do not hesitate to share them with me. I would also like to apologize in advance if my request is in any way unclear, I'll try to elaborate in any such case.
Method
I am trying to crawl individual players' performance in UEFA Champions League matches (I'm actually building a Fantasy Football app, a project I reckon I'll have many questions about in the future as well). My source are the UEFA statistics sites, e.g. "http://www.uefa.com/uefachampionsleague/season=2016/matches/live/index.html?day=2&session=1&match=2015684". There, about in the middle you can choose "PLAYER" statistics, and see a "table" (actually it's a div with lists) where (by default) statistics concerning given (single) match are presented. When fully loaded in HTML there are two divs, one for match stats, one for overall, some elements of the lists are hidden (based on the position of the player for example), but all are there. The part that I'm interested in looks like this:
<div id="matchTab" class="tab-panel rounded-down scrollable ui-tabs-panel ui-widget-content ui-corner-bottom" aria-labelledby="ui-id-7" role="tabpanel" aria-expanded="true" aria-hidden="false">
<div class="tab-content">
<div class="col-stats">
<ul class="goals-table stats">
<li style="display: list-item;">
Goals scored
<span class="value goals-scored">1</span>
</li>
<li style="display: none;">
<li class="bg-highlight" style="display: none;">
<li style="display: none;">
<li class="bg-highlight" style="display: none;">
</ul>
<ul class="attempts-table stats" style="display: block;">
<ul class="passes-attempt-table stats">
<ul class="fouls-table fouls-table-gk stats">
</div>
<div class="col-stats">
</div>
</div>
(Most of the code is collapsed)
Right now I am using WinHTTPRequest in VBA, hoping to populate an Excel range with data, but in the end I will move the project to VB.NET and use SQL. The problem is that I don't seem to be able to obtain the date presented in the list. Using:
whReq.Open "GET", "http://www.uefa.com/uefachampionsleague/season=2016/matches/live/index.html?day=2&session=1&match=2015684"
Returns only the main structure of the side, not the data from the "table". I therefore used FireBug, and then WireShark to inspect the data that's being transferred when a player is changed in the selection. One packet is general stats about the player, his age, name, etc. - useless. The second one looks like this:
GET /livecommon/match-centre/cup=1/season=2016/round=2000634/player=103697/overall.json?v=1448115572662
(The "v" parameter is actually useless, works the same way without it)
which actually does return a bunch of data. Apparently it's in the form of JSON, and looks for example like this:
{"Players":null,"OverallStat":{"250021048":{"PlayerId":250021048,"TeamId":0,"MatchesPlayed":3,"MinutesPlayed":188,"GoalsScored":0,"GoalsConceded":0,"GoalsByMinute":0,"GoalsByAttempts":0,"TotalAttempts":2,"Assist":0,"Saves":0,"SavesOnAttempts":0,"SavesByMinutes":0,"FoulsCommitted":3,"FoulsCommittedByMinute":63,"FoulsSuffered":0,"FoulsSufferedByMinute":0,"FoulsPenalty":0,"FoulsSuffPenalty":0,"YellowCard":0,"RedCard":0,"Passes":149,"PassingDistribution":0,"PassesCompleted":68,"PassesAttempted":81,"PassingAccuracy":84,"Delivery":0,"Run":2,"AttempsOn":0,"AttempsOff":2,"Offside":1,"ShotHittingPost":0,"HittingBar":0,"ShotBlocked":2,"Corners":0,"Attacks":0,"BigChance":0,"BallPossession":0,"DistanceCovered":18655,"ClearancesAttempted":4,"ClearancesCompleted":3,"Blocked":0,"TackleCompleted":0,"TackleWrong":0}},"LastUpdatedCET":"05 November 2015, 11:47 CET","LastUpdateDay":5,"LastUpdateMonth":11,"LastUpdateYear":2015,"LastUpdateHour":11,"LastUpdateMinute":47}
And this it the kind of response that is obtained whenever a player is changed, no matter whether single match statistics or overall are chosen (of course one does not have to actually be on the site, no cookies are needed, just "http://www.uefa.com/livecommon/match-centre/cup=1/season=2016/round=2000634/player=250021048/overall.json" and there you have the statistics)
The thing is, those are overall statistics. I could understand there is some kind of code that calculates statistics for individual match (I can actually do with that myself, just refresh data after each match to calculate the statistics from individual matches). Nevertheless, they tend to update the statistics irregularly, so I would really prefer to be able to retrieve stats from individual matches. So:
Question
Given the above, is there a way to obtain the html of the site after the another GET for player statistics, using WinHTTPRequest? If not, what could be the best way? I did try working with InternetExplorer object, but was unable to produce any results at all (I actually had the same problem at work, where I needed to access our company's site - I couldn't even check for readystate, over site's worked, but I guess it's a topic for another question)
Thanks in advance, and again, sorry if anything I wrote is unclear.

Trouble with Xpath in Google Spreadsheets (ImportXML)

This is a great site, and I've already had a lot of questions answered simply by scrolling and searching through other postings. Unfortunately, I can't seem to track down an answer that specifically helps this problem, and figured I would try posting and looking for help-
I'm using ImportXML and google spreadsheets to 'scrape'a few product descriptions from a retail site. It's been working fine for the most part, and I have done it in 2 ways:
1) Specific call to the description part of a post:
=ImportXML(A1,"//div[#class='desc']")
2) Call to the entire 'product Card', which also returns info such as product title, price, time posted, and places these items in adjacent cells in my Google spreadsheet:
=ImportXML(A1,"//div[#class='productCard']")
Both have worked fine, but I've ran into a different problem using each method. If I can resolve even one of these problems, then I'll happily scrap the other method, I just need one of them to work. The problems are:
Method 1) The website prohibits sellers from including contact information in product postings-- when they include an email address anyways, the site automatically blocks it, so that in the posting it simply appears as "...you can reach me at [obscured]" or something like that. The [obscured] appears in a different colour text and is obviously treated differently somehow. When I scrape these descriptions using Method 1, ImportXML appears to get 'bumped' when it hits the word [obscured], and it passed the remaining text from that product description to the next cell over in my spreadsheet. This ruins the entire organization of the sheet, and I'd like to find a way where I can get ImportXML to just ignore the [obscured], and still place the entire text of the product description in one cell.
Method 2) My call for the entire 'product Card' is as follows:
=ImportXML(A1,"//div[#class='productCard']")
As mentioned, this works fine (for most products), and I don't mind the additional info (price, date, etc.) being posted in adjacent cells.
However, the website also allows certain products to be 'featured', where they appear in a different colour box on the site, and are therefore more likely to get a buyer's attention.
Using this method, the 'featured' products are not scraped or imported into my spreadsheet, but are simply passed over.
The source code (on actual site) (via 'inspect element' in Safari) for both the description (Method 1) and product card (Method 2) look as follows (for a normal product (a) and a featured product (b)):
(a)
<div id="productSearchResults">
<div class="productCard tracked">
<div>...</div>
<div class="stats">...</div>
<div class="desc collapsed descFull">...</div>
</div>
(b)
<div id="productSearchResults">
<div class="productCard featured tracked">
<div>...</div>
<div class="stats">...</div>
<div class="desc collapsed descFull">...</div>
</div>
You can see in both (a) an (b) the 'desc' class that I call in Method 1, which seems to work fine.
From my reading on this site, I think I've learned that a given class can't have more than one word, and therefore the use of "desc collapsed descFull" and "productCard tracked" and "productCard featured tracked" don't represent classes with 3, 2 and 3 words in the title, but instead cases where multiple classes have been assigned?
Regardless, the call to 'desc' (Method 1) works fine and seems to get all descriptions.
In method 2 therefore, I would have thought that a call to 'productCard' would get the info for all products, both featured and regular, as 'featured' is an extra class assigned to some 'productCard's. If I call all 'productCard's, shouldn't the normal AND featured ones be returned? This is currently not the case. I've tried calling just 'tracked' and just 'featured' as classes, and neither returns anything, so my logic that they are their own class equivalent to 'productCard' may be flawed.
In summary, the 'desc' call in Method 1 works fine, and even gets descriptions for 'featured' products. However, when contact information is included in the description and is displayed as [obscured] it bumps my data into the next cell in the spreadsheet, immediately following the word. This throws off and ruins all organization.
In Method 2, I am not getting the featured products at all, which greatly weakens what I am trying to do. Can either (or both!) of these problems be fixed??
Thanks so so much for any help you can give me.
***UPDATE: As seen in the comments below, use of the 'contain' as suggested improved Method 2 by retrieving both regular and featured products. However, featured product cards have extra text elements, and since the entire card is being scraped in this method, featured products do not match the cell alignment that regular products do. If there is a way to fix Method 1, this would therefore be much better.
As outlined in the comments below, the [obscured] text appears in a 'span' that follows underneath/indented from the
<div class="desc descFull collapsed"
as
<span class="obscureText">[obscured]</span>
Is there any way that I can import the 'desc's as I have been, but tell the XPath to essentially 'ignore' the [obscured] span, or at least deal with it in a way that doesn't make description text immediately after [obscured] appear one cell over?
Thanks so much everyone!
You can wrap your function with the concatenate()-function to make sure it all shows up in one cell:
=concatenate(ImportXML(A1,"//div[#class='productCard']"))