xpath scraping data from the second page - html

I am trying to scrape data from this webpage: http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33, and I specifically need data for fund number 26.
Have no problem getting data from the first page with this address (funds number 1-25), but for the hell of me can't scrape anything from the second page. Can someone help?
Thanks!
Here is the code I use: in Google Sheets:
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33","/html/body/form[#id='MainForm']/table/tr/td/div[#id='main']/div[#id='tabResult']/div[#id='Prices']/table/thead/tr[26]/td[#class='Center'][1]")

You can do 2 things - one is to append the PgIndex=2 onto the end of your URL, and then you can also significantly simplify your xpath to this:
//*[#id='Prices']//tr[2]/td[2]
This specifically grabs the second row on the table (tr which means table-row), in order to bypass the header row, then grabs the second field which is the table-data cell.
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33&PgIndex=2","//*[#id='Prices']//tr[2]/td[2]")

To get the second page, add &PgIndex=2 to your url. Then adjust the /table/thead/tr[26] to /table/thead/tr[2]. The result is:
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33&PgIndex=2","/html/body/form[#id='MainForm']/table/tr/td/div[#id='main']/div[#id='tabResult']/div[#id='Prices']/table/thead/tr[2]/td[#class='Center'][1]")

Related

What does the "ved" parameter in a google search refer to?

I've spent like two hours or more trying to figure out what a "ved" parameter on a Google search means. A curious person I am.
My finds so far:
$ved value changes-
1 - every different search result (diff keywords)
2 - every different resulted block (the url blocks/boxed on the resulted google search, but they are quite similar, as I'll write down below)
3 - every different geolocation perhaps
Consider these tests or lookups:
1-
Diff keywords, but first block/position in list:
&ved=2ahUKEwidsaSd4M_1AhVlk_0HHUxOCQYQFnoECAsQAg
&ved=2ahUKEwj2pZyN5s_1AhVRmuYKHZ5IB5EQFnoECAcQAg
I thought the "ved" value refers to the block/position of a url in the result list, but no.
2-
Twree different urls, first and second from the 1st and 2nd blocks of first page, then third from a "much farther on the list" block:
ved=2ahUKEwjq1-Wb1s_1AhW6SWwGHZwpBMwQFnoECD8QAQ
ved=2ahUKEwjq1-Wb1s_1AhW6SWwGHZwpBMwQFnoECCAQAQ
ved=2ahUKEwiZ2NDe1s_1AhVaTmwGHThIA5U4PBAWegQIGRAB
The same website url, from different countries (not considering blocks or position in list):
&ved=2ahUKEwiopK2X08_1AhUgxzgGHQEbDkcQFnoECBIQAQ
&ved=2ahUKEwjpueqC1M_1AhWJq3IEHYEDAfc4FBAWegQIDBAB
&ved=2ahUKEwih09Wz08_1AhUY7WEKHQYdBB8QFnoECEIQAQ
Very similar they are.
I'd really love to know what they mean. Any ideas are appreciated too!
I found an interesting article explaining the subject : https://moz.com/blog/inside-googles-ved-parameter
TL;DR:
A ved code contains up to five separate parameters, which each tell you something about the link that was clicked on:
1st (parameter1: Link index) gives you an idea of where the link was on the page.
2nd (parameter2: Link type) is a number that corresponds to the 'type' of the link that was clicked.
3rd (parameter7: Start result position) is the cumulative result position of the first result on the page.
4th (parameter 6: Result position) indicates the position of your page in the search results.
5th (parameter 5: Sub-result position) like the (parameter 6), except it tells you the position in a list of sub-results, such as breadcrumbs, or one-page sitelinks.

Extract a single row from a table

I’m trying to extract a single row from a table.
I'm using google sheet to create the links and in cell D3 it contains this url.
https://www.wsj.com/market-data/quotes/AAPL/options
I have several links in cell D3 to go through.
The word "Last Trade" appears several times in different tables but I'M ONLY INTERESTED IN THE VERY FIRST TABLE FROM THE TOP.
with this word and once this word is found i'm looking to extract the ROW just above it.
Below is the IMPORTXML, and its needs modification and it should be able to pull that last row.
=IMPORTXML(D3,"//tr[td1/#class='acenter inthemoney'][last()]")
Any help would be greatly appreciated.
Thanks.
For that row you will need:
(//tr[#class='last_trade_row'])[1]/preceding-sibling::tr[1]
And then pick the wright td...it's unclear which td you want. So if you wanted the third td the XPath would be:
(//tr[#class='last_trade_row'])[1]/preceding-sibling::tr[1]/td[3]
Its always the first table that ends with the word LAST TRADE and the row above it that i'm looking to extract, so in this case this is the row that i'm looking to extract, below is the picture.
https://www.wsj.com/market-data/quotes/AAPL/options
In the above case where you want the first td the XPath will then be
(//tr[#class='last_trade_row'])[1]/preceding-sibling::tr[1]/td[1]

Is there a way to access the first element in a column on a website using VBA?

Here is a screenshot of a column in a website page.
It is located in that way in the website page :
As you can see, all the rows have a 'Completed' button you can pres and followed by a number of lines. These rows refer to exports. So the columnis not static and is constantly changing.
However, everytime i run the macro i want to access the first row of the column.
Here is a sample code of he HTML code of the first 'Completed' button in the screenshot above:
I have many that have the same class name. Look at the highlighted rows as an example in the picture below:
I really have no idea how to write a VBA code to always access the first 'Completed' bytton in this column.
PS: In the HTML code, in the tag "a", the onclick="....." is constantly changing. So i cannot use this as an argument to access the desired field and click on the desired button.
Please if anyone could help me figure out how to do this, i would really be happy.
Thank you :)
If you want to click the 'Completed' button in the first column, you can use the code below:
Set doc = objIE.Document
doc.getElementsByTagName("tr")(0).getElementsByTagName("td")(0).getElementsByTagName("a")(0).Click
The code get the first <tr> then get the first <td> then get <a> in it.
<tr> tags are rows, <td> tags are cells inside those rows. You did not provide enough code to show the entire table, but generally speaking to access the first row of a table, you would need to refer to the collection object and use the index number you want.
.getElementsByTagName("tr")(0)
This will refer to the first row of a table. Same with getting the first column in the first row of your table:
.getElementsByTagName("tr")(0).getElementsByTagName("td")(0)
Once you tracked down the particular cell, now you are wanting to click the link. You can use the same method as above.
.getElementsByTagName("tr")(0).getElementsByTagName("td")(0).getElementsByTagName("a")(0).Click
And a final note, the first row of a table could be a header, so you may actually want the 2nd row (1) instead.
Thanks for updating with more HTML code. I am going to slightly switch gears and use querySelector() to grab the main table.
doc.querySelector("#divPage > table.advancedSearch_table > tbody"). _
getElementsByTagName("tr")(3).getElementsByTagName("td")(3).Children(0).Click
See if this works for you.

How to scrape text based on a specific link with BeautifulSoup?

I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.
The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).
I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.
Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.
The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.
This is always preceded by one of these two links:
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19 (Female)
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15 (Male)
I've tested a whole bunch of different things, including things like:
gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19')
print(gender_containers.get_text())
But for everything I've tried, I keep getting errors like:
ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.
What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩)and then the longer string, which has a lot more variation.
(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)
Sounds like you could use attribute = value css selector with $ ends with operator
If there can only be one occurrence per page
soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text
This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.
You could additionally handle possibility of not being present as follows:
from bs4 import BeautifulSoup
html ='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)
Multiple values:
genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]
Try the following code.
from bs4 import BeautifulSoup
data='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)
OutPut:
[女孩]

Get tabledata from html, JSOUP

What is the best way to extract data from a table from an url?
In short I need to get the actual data from the these 2 tables at: http://www.oddsportal.com/sure-bets/
In this example the data would be "Paddy power" and "3.50"
See this image:
(Sorry for posting image like this, but I still need reputation, i will edit later)
http://img837.imageshack.us/img837/3219/odds2.png
I have tried with Jsoup, but i dont know if this is the best way?
And I can't seem to navigate correctly down the tables, I have tried things like this:
tables = doc.getElementsByAttributeValueStarting("class", "center");
link = doc.select("div#col-content > title").first();
String text1 = doc.select("div.odd").text();
The tables thing seem to get some data, but doesn't include the text in the table
Sorry, man. The second field you want to retrieve is filled by JavaScript. Jsoup does not execute JavaScript.
To select title of first row you can use:
Document doc = Jsoup.connect("http://www.oddsportal.com/sure-bets/").get();
Elements tables = doc.select("table.table-main").select("tr:eq(2)").select("td:eq(2)");
System.out.println(tables.select("a").attr("title"));
Chain selects used for visualization.