Get tabledata from html, JSOUP - html

What is the best way to extract data from a table from an url?
In short I need to get the actual data from the these 2 tables at: http://www.oddsportal.com/sure-bets/
In this example the data would be "Paddy power" and "3.50"
See this image:
(Sorry for posting image like this, but I still need reputation, i will edit later)
http://img837.imageshack.us/img837/3219/odds2.png
I have tried with Jsoup, but i dont know if this is the best way?
And I can't seem to navigate correctly down the tables, I have tried things like this:
tables = doc.getElementsByAttributeValueStarting("class", "center");
link = doc.select("div#col-content > title").first();
String text1 = doc.select("div.odd").text();
The tables thing seem to get some data, but doesn't include the text in the table

Sorry, man. The second field you want to retrieve is filled by JavaScript. Jsoup does not execute JavaScript.
To select title of first row you can use:
Document doc = Jsoup.connect("http://www.oddsportal.com/sure-bets/").get();
Elements tables = doc.select("table.table-main").select("tr:eq(2)").select("td:eq(2)");
System.out.println(tables.select("a").attr("title"));
Chain selects used for visualization.

Related

Is there a way to access the first element in a column on a website using VBA?

Here is a screenshot of a column in a website page.
It is located in that way in the website page :
As you can see, all the rows have a 'Completed' button you can pres and followed by a number of lines. These rows refer to exports. So the columnis not static and is constantly changing.
However, everytime i run the macro i want to access the first row of the column.
Here is a sample code of he HTML code of the first 'Completed' button in the screenshot above:
I have many that have the same class name. Look at the highlighted rows as an example in the picture below:
I really have no idea how to write a VBA code to always access the first 'Completed' bytton in this column.
PS: In the HTML code, in the tag "a", the onclick="....." is constantly changing. So i cannot use this as an argument to access the desired field and click on the desired button.
Please if anyone could help me figure out how to do this, i would really be happy.
Thank you :)
If you want to click the 'Completed' button in the first column, you can use the code below:
Set doc = objIE.Document
doc.getElementsByTagName("tr")(0).getElementsByTagName("td")(0).getElementsByTagName("a")(0).Click
The code get the first <tr> then get the first <td> then get <a> in it.
<tr> tags are rows, <td> tags are cells inside those rows. You did not provide enough code to show the entire table, but generally speaking to access the first row of a table, you would need to refer to the collection object and use the index number you want.
.getElementsByTagName("tr")(0)
This will refer to the first row of a table. Same with getting the first column in the first row of your table:
.getElementsByTagName("tr")(0).getElementsByTagName("td")(0)
Once you tracked down the particular cell, now you are wanting to click the link. You can use the same method as above.
.getElementsByTagName("tr")(0).getElementsByTagName("td")(0).getElementsByTagName("a")(0).Click
And a final note, the first row of a table could be a header, so you may actually want the 2nd row (1) instead.
Thanks for updating with more HTML code. I am going to slightly switch gears and use querySelector() to grab the main table.
doc.querySelector("#divPage > table.advancedSearch_table > tbody"). _
getElementsByTagName("tr")(3).getElementsByTagName("td")(3).Children(0).Click
See if this works for you.

How to scrape text based on a specific link with BeautifulSoup?

I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.
The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).
I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.
Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.
The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.
This is always preceded by one of these two links:
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19 (Female)
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15 (Male)
I've tested a whole bunch of different things, including things like:
gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19')
print(gender_containers.get_text())
But for everything I've tried, I keep getting errors like:
ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.
What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩)and then the longer string, which has a lot more variation.
(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)
Sounds like you could use attribute = value css selector with $ ends with operator
If there can only be one occurrence per page
soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text
This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.
You could additionally handle possibility of not being present as follows:
from bs4 import BeautifulSoup
html ='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)
Multiple values:
genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]
Try the following code.
from bs4 import BeautifulSoup
data='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)
OutPut:
[女孩]

Example to combine headers footers paragraphs of html and Tables

jspdf-autotable examples['header-footer'] example gets me most of what I need for my task.
I am trying to add rich text (constant font some bold and under line words) before and after a table. looking at examples.content did not make it clear.
So a complete PDF might be:
1. some paragraphs of text
2. a table on more than one page
3. some paragraphs of text
4. another table on more than one page
how do I combine all of this in one var doc = new jsPDF(); ?
Example code would be very appreciated.
The key is to use doc.autoTable.previous.finalY to get the final y position where the table ended drawing. You can than dynamically use that to draw text with doc.text(). If you want further guidence, please update your question with more info on what you have tried and what didn't work.

xpath scraping data from the second page

I am trying to scrape data from this webpage: http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33, and I specifically need data for fund number 26.
Have no problem getting data from the first page with this address (funds number 1-25), but for the hell of me can't scrape anything from the second page. Can someone help?
Thanks!
Here is the code I use: in Google Sheets:
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33","/html/body/form[#id='MainForm']/table/tr/td/div[#id='main']/div[#id='tabResult']/div[#id='Prices']/table/thead/tr[26]/td[#class='Center'][1]")
You can do 2 things - one is to append the PgIndex=2 onto the end of your URL, and then you can also significantly simplify your xpath to this:
//*[#id='Prices']//tr[2]/td[2]
This specifically grabs the second row on the table (tr which means table-row), in order to bypass the header row, then grabs the second field which is the table-data cell.
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33&PgIndex=2","//*[#id='Prices']//tr[2]/td[2]")
To get the second page, add &PgIndex=2 to your url. Then adjust the /table/thead/tr[26] to /table/thead/tr[2]. The result is:
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33&PgIndex=2","/html/body/form[#id='MainForm']/table/tr/td/div[#id='main']/div[#id='tabResult']/div[#id='Prices']/table/thead/tr[2]/td[#class='Center'][1]")

AngularJS - Conditionally display key and value if they exist

This may be a little confusing to describe.
Basically, I am parsing multiple external JSON feeds that display in different views depending on the 'active tab' displayed. They both share the same partial template, so they both look exactly the same, just different content.
The problem that I am facing now is, that in some feeds, some keys are placed in an array and others are not.
For example, the feeds parses this kind of data:
JSON Feed 1 - One 'attributes' inside of 'link'
"link":{
"attributes":{
"href":"www.link1.com"
}
}
JSON Feed 2 - Two 'attributes' inside of 'link'
"link":[
{
"attributes":{
"href":"www.link1.com"
}
},
{
"attributes":{
"href":"www.link2.com"
}
}
]
The only way I am able to get the value "www.link1.com" is via:
For Feed 1:
link1
And for Feed 2:
link1
I am trying to figure out what would be the best way to do:
1) If link[0] exists - display it, else if [link] exists, display that instead.
2) Or if targeting the activeTab would be safer? For instance, if activeTab = view2 or view4, use [link][0], else if activeTab = view1 or view3 use [link], else if I do not want it to be displayed, do not display anything.
Also a relatable question, if I am on view2 can I only display [link][0] on that view?
Any feedback would be appreciated. Thanks!
In your model controller, you can reconstruct the JSON objects to make them similar. The value of link in both feeds should be an array.
Then in your template you can simply use ngRepeat to get the items from inside the array.
Okay - so I found a solution to one of the questions above: "How to only display [link][0] in a specific view"
Pro: It's a simple code that depends on the activeTab / view that is being displayed.
Con(?): Since I am really a newbie to AngularJS - not sure if this is the best solution.
Basically:
Depending on the ng-view that is currently displayed, than a specific JSON object will be displayed, such as:
<a ng-show="activeTab == 'view1' || activeTab == 'view3'" ng-href="{{item['link'][0]['attributes']['href']}}">
<h6>Link1 from Feed2</h6>
</a>
Although the primary question is still unresolved: How to swap/switch JSON objects (key,values) if one exists, and not the other. I am still definitely trying to find a solution, although any help is still appreciated.
Please let me know what you think, or how I can improve the solution to the problem!
Thanks!
Roc.