I am trying to get a specific columns data which is in a form of table in a webpage to my excel file using VBA, I'm good at opening webpage and log-in and navigate to table area but I'm unable to get the specific columns from the table. I don't have idea to pull only a column from table with in the web page.
I use chrome for the automation. Below is the sample Html code for your reference.
<table class="Performed-Detailes-Mac">
<thead class="table-head-basic">
<tr>
<th>File</th>
<th>Name</th>
<th>Date</th>
<th>Wait 1</th>
<th>Wait 2</th>
<th>Status</th>
<th class="text-right">Machines</th>
<th class="text-right">Usage</th>
</tr>
</thead>
<tbody>
<tr class="table-row">
<td data-bind="text: id">File12</td>
<td data-bind="text: Name">JCB</td>
<td data-bind="text: Date">02/01/2022</td>
<td data-bind="text: check1">10:55 </td>
<td data-bind="text: check2">12:30</td>
<td data-bind="text: Status">Completed</td>
<td class="text-right" data-bind="text: Machines">2</td>
<td class="text-right" data-bind="text: Str">100 Percent</td>
</tr>
<tr class="table-row" data-bind="visible : $root.isEditItemsOnDetailsEnabled || $root.Items().length > 0">
<td class="text-right" data-bind="text: TotalDuration">1.75</td>
</tr>
</tbody>
</table>
For reference I have provided only one line (tr) code with header details.
From the above html I would like to extract only "Date" and "Machines" column details with all rows.
The code which I tried is provided below. I did some here and there in For loop but no luck as of now.
Sub GetTable()
Dim Dr As New Selenium.ChromeDriver
Dim hTable, hBody, hTR, hTD, tb As Object
Dim bb, tr, td As Object
Dr.Get "My Webpage Url"
Dr.Wait 2000
With Sheet1
Set hTable = Dr.FindElementsByCss(".Performed-Detailes-Mac")
For Each tb In hTable
Set hBody = tb.FindElementsByTag("tbody")
For Each bb In hBody
Set hTR = bb.FindElementsByTag("tr")
For r = 1 To hTR.Count - 2
Set hTD = hTR(r).FindElementsByTag("td")
If hTD.Count = 0 Then Set hTD = hTD(r).FindElementsByTag("th")
Lastrow = .Cells(Rows.Count, 1).End(xlUp).Row + 1
For c = 1 To hTD.Count
.Cells(Lastrow, c).Value = hTD(c - 1).Text
Next c
Next r
Next bb
Exit For
Next tb
End With
End Sub
This is my first query, My apologies if I'm wrong in anywhere.
Thanks Gold
Related
I am using a StringBuilder to create a HTML file from my DataTable. The file is created but when I open it in the webbrowser I have to scroll all the way down to see the table. In other words there is a big blank page first with nothing at all.
Public Function ConvertToHtmlFile(ByVal myTable As DataTable) As String
Dim myBuilder As New StringBuilder
If myTable Is Nothing Then
Throw New System.ArgumentNullException("myTable")
Else
'Open tags and write the top portion.
myBuilder.Append("<html xmlns='http://www.w3.org/1999/xhtml'>")
myBuilder.Append("<head>")
myBuilder.Append("<title>")
myBuilder.Append("Page-")
myBuilder.Append("CLAS Archive")
myBuilder.Append("</title>")
myBuilder.Append("</head>")
myBuilder.Append("<body>")
myBuilder.Append("<br /><table border='1px' cellpadding='5' cellspacing='0' ")
myBuilder.Append("style='border: solid 1px Silver; font-size: x-small;'>")
myBuilder.Append("<br /><tr align='left' valign='top'>")
For Each myColumn As DataColumn In myTable.Columns
myBuilder.Append("<br /><td align='left' valign='top' style='border: solid 1px blue;'>")
myBuilder.Append(myColumn.ColumnName)
myBuilder.Append("</td><p>")
Next
myBuilder.Append("</tr><p>")
'Add the data rows.
For Each myRow As DataRow In myTable.Rows
myBuilder.Append("<br /><tr align='left' valign='top'>")
For Each myColumn As DataColumn In myTable.Columns
myBuilder.Append("<br /><td align='left' valign='top' style='border: solid 1px blue;'>")
myBuilder.Append(myRow(myColumn.ColumnName).ToString())
myBuilder.Append("</td><p>")
Next
Next
myBuilder.Append("</tr><p>")
End If
'Close tags.
myBuilder.Append("</table><p>")
myBuilder.Append("</body>")
myBuilder.Append("</html>")
'Get the string for return. myHtmlFile = myBuilder.ToString();
Dim myHtmlFile As String = myBuilder.ToString()
Return myHtmlFile
End Function
A sample HTML table (from the MDN docs):
<table>
<thead>
<tr>
<th colspan="2">The table header</th>
</tr>
</thead>
<tbody>
<tr>
<td>The table body</td>
<td>with two columns</td>
</tr>
</tbody>
</table>
If you study the "permitted content" within the various table elements (also dive deeper, for instance <tr>), there cannot be a <br> or <p> between <table>, <tr> or <td> elements, only table-related elements are allowed.
A <tr> is already a row in the table, so you don't need breaks or paragraphs to move it to a separate row.
FWIW I find using XElement to build Html pages easier than using strings.
Dim myHtml As XElement
'XML literals
' https://learn.microsoft.com/en-us/dotnet/standard/linq/xml-literals
'note lang and xmlns missing. see below
myHtml = <html>
<head>
<meta charset="utf-8"/>
<title>Put title here</title>
</head>
<body>
<table border="1px" cellpadding="5" cellspacing="0" style="border: solid 1px Silver; font-size: x-small;">
<thead>
<tr>
<th colspan="4">The table header</th>
</tr>
</thead>
<tbody>
</tbody>
</table>
</body>
</html>
'test. five rows, four columns
For r As Integer = 1 To 5
Dim tr As XElement = <tr align="left" valign="top"></tr>
For c As Integer = 1 To 4
Dim td As XElement
' XML embedded expressions
' https://learn.microsoft.com/en-us/dotnet/standard/linq/xml-literals#use-embedded-expressions-to-create-content
td = <td align="left" valign="top"><%= "Row:" & r.ToString & " Col:" & c.ToString %></td>
tr.Add(td)
Next
myHtml.<body>.<table>.<tbody>.LastOrDefault.Add(tr)
Next
Dim s As String = myHtml.ToString
'add lang and xmlns to string!!
s = s.Replace("<html>", "<html lang='en' xmlns='http://www.w3.org/1999/xhtml'>")
I am working on an Excel VBA project to scrape some specific information from a website. The view of this data on the website is as such:
Website View:
What I am looking to do is extract text based on two criteria: Name and post date. For example, I have the name Kaelan and the post date of 11/16/2016. I want to extract the amount of $365.
This is the HTML code:
<div class="familyLedgerAmountCategory" id="id_4541278">
<table>
<tr>
<td class="tdCategoryRow">
<div class="cmFloatLeft divExpandToggle expanded" id="divCategoryToggle_id_4541278"></div>
<div class="cmFloatLeft" id="divCategoryLabel_id_4541278" style="width: 430px;">
Kaelan
</div><span style="margin-left: 5px;">$ 465.00</span>
</td>
</tr>
<tbody>
<tr class="trListTableBody LedgerExisting" id="CamperFamilyLedgerRowControl_14816465">
<td class="tdCamperFamilyLedgerTableColumnDescription tdBorderTop" id="tdCamperFamilyLedgerTableColumnDescription_CamperFamilyLedgerRowControl_14816465">
<div class="divListTableBodyCell" id="tdColumnDescriptionCell">
<table class="tblListTableBodyCell">
<tr>
<td>
<div class="divListTableBodyLabel">
<a class="aColumnDescriptionCell" id="aColumnDescriptionCell_CamperFamilyLedgerRowControl_14816465" name="aColumnDescriptionCell_CamperFamilyLedgerRowControl_14816465" target="_self" title="Click to view details">2017 Super Early Bird Teen Camp - Tuition</a>
</div>
</td>
</tr>
</table>
</div>
</td>
<td class="tdCamperFamilyLedgerTableColumnPostDate tdBorderTop" id="tdCamperFamilyLedgerTableColumnPostDate_CamperFamilyLedgerRowControl_14816465">
<div class="divListTableBodyCell" id="tdColumnPostDateCell">
<table class="tblListTableBodyCell">
<tr>
<td>
<div class="divListTableBodyLabel">
11/16/2016
</div>
</td>
</tr>
</table>
</div>
</td>
<td class="tdCamperFamilyLedgerTableColumnEffective tdBorderTop" id="tdCamperFamilyLedgerTableColumnEffective_CamperFamilyLedgerRowControl_14816465">
<div class="divListTableBodyCell" id="tdColumnEffectiveCell">
<table class="tblListTableBodyCell">
<tr>
<td>
<div class="divListTableBodyLabel">
11/15/2016
</div>
</td>
</tr>
</table>
</div>
</td>
<td class="tdCamperFamilyLedgerTableColumnQty tdBorderTop" id="tdCamperFamilyLedgerTableColumnQty_CamperFamilyLedgerRowControl_14816465">
<div class="divListTableBodyCell" id="tdColumnQtyCell">
<table class="tblListTableBodyCell">
<tr>
<td>
<div class="divListTableBodyLabel">
1
</div>
</td>
</tr>
</table>
</div>
</td>
<td class="tdCamperFamilyLedgerTableColumnAmount tdBorderTop" id="tdCamperFamilyLedgerTableColumnAmount_CamperFamilyLedgerRowControl_14816465">
<div class="divListTableBodyCell" id="tdColumnAmountCell">
<table class="tblListTableBodyCell">
<tr>
<td>
<div class="divListTableBodyLabel">
$ 365.00
</div>
</td>
</tr>
</table>
</div>
</td>
<td class="tdCamperFamilyLedgerTableColumnAction tdBorderTop" id="tdCamperFamilyLedgerTableColumnAction_CamperFamilyLedgerRowControl_14816465"></td>
</tr>
</tbody>
</table>
</div>
My attempt to pull the amount is as follows:
Sub Test()
Dim ie As Object
Dim oElement As Object
Dim wsTarget As Worksheet
Dim i As Integer
Dim NewWB As Workbook
Set NewWB = ActiveWorkbook
Set wsTarget = NewWB.Sheets(1)
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = True
ie.navigate website...
Wait 6
ie.document.All.Item("txtUserName").Value = "User"
ie.document.All.Item("pswdPassword").Value = "Pass
Wait 1
ie.document.getElementById("btnLogin").Click
Wait 5
ie.navigate website...
i = 1
For Each oElement In ie.document.getElementsByClassName("cmFloatLeft")
If oElement.innerText = "Kaelan" Then
extract1 = oElement.getElementsByClassName("divListTableBodyLabel").innerText
MsgBox extract1
Else
End If
Next
However, I get an error when running the code above. Can I get the class for cmFloatLeft that I am looking for and then try to call the divLisTableBodyLabel class immediately even though that class does not fall directly below the cmFloatLeft class?
Sorry, I'm still pretty new to scraping web data.
Thanks
That structure is a bit difficult to scrape - you could try going "up" from the "Kaelan" node to the patent table, and then looping over that to extract the various pieces of information. If the post structures are consistent then that could provide one approach.
Set doc = IE.document
Set els = doc.getElementsByClassName("cmFloatLeft")
i = 1
For Each oElement In els
Debug.Print oElement.innerText
If Trim(oElement.innerText) = "Kaelan" Then
Set tbl = GetParent(oElement, "table") '<< find the parent table
If Not tbl Is Nothing Then
'loop over the parent table
For Each rw In tbl.Rows
For Each cl In rw.Cells
Debug.Print cl.innerText
Next cl
Next rw
End If
End If
Next
Function to find a named parent (by tag name):
Function GetParent(el, tagParent)
Dim rv As Object
Set rv = el
Do While Not rv.parentElement Is Nothing
Set rv = rv.parentElement
If UCase(rv.tagName) = UCase(tagParent) Then
Set GetParent = rv
Exit Function
End If
Loop
Set GetParent = Nothing
End Function
I have been looking everywhere for any possible workaround to this issue.
All of the data at my company is accessed via a web portal that produces static HTML pages. Unfortunately our department cannot be given direct access to the Server which would make my life easy so I need to page scrape this portal to find the data that I need. My navigation is fine and i am quite experienced with scraping where elements are named or given an ID, however this does not have either.
Anyway, background out of the way.
I want to grab a table from the page that has a unique style of "empty-cells: show;":
<TABLE cellspacing=10 cellPadding=10 border="1" style="empty-cells: show;">
</TABLE>
Or failing that there is a heading in the first row which always contains the same text string. Once I have that table I can manipulate the data I need from it. Hugely sensitive data here guys, so I can't provide the full page code unfortunately.
I know that there have been many posts regarding GetElementByRegex but I cannot find a post or website that actually explains how to use it. Instead they all want me to install their add-on which isn't an option (I need to learn this to sate my thirst for knowledge).
To help I have added the full table code below removing the sensitive data:
<TABLE cellspacing=10 cellPadding=10 border="0" width=100%>
<tr>
<td>
<TABLE cellspacing=10 cellPadding=10 border="1" style="empty-cells: show;">
<TR class="row0">
<TD style="width: 25%; background-color: #A3DCF5;"><strong>TITLE:</strong></TD>
<TD>LINE1</TD>
</TR>
<TR class="row1">
<TD> </TD><td>LINE2</td>
</TR>
<TR class="row0">
<TD> </TD><td>LINE3</td>
</TR>
<TR class="row1">
<TD> </TD><td>LINE4</td>
</TR>
<TR class="row0">
<TD> </TD><td>LINE5</td>
</TR>
</TABLE>
</td>
</tr>
</TABLE>
There are many other tables though so using a Len check will not help me top sift through the TD tags.
Dim tbls, tbl, tr, j, td, row, sht
Set tbls = IE.document.getElementsByTagName("table")
For Each tbl in tbls
'item indexes are zero-based (AFAIR)
If tbl.Rows(0).Cells(1).innerText = "LINE1" Then
'EDIT: extracting the table contents
Set sht = ActiveSheet
row = 3
For Each tr In t.getelementsbytagname("TR")
j = 1
For Each td In tr.getelementsbytagname("TD")
sht.Cells(row + 1, j).Value = td.innerText
j = j + 1
Next
row = row + 1
Next
Exit For 'stop looping
End If
Next
Answer by Glitch_Doctor edited out of question:
Thank you for all the help Tim!
The below worked perfectly for me:
Dim tbls, tbl
Dim L1, L2, L3, L4, L5 As String
Set tbls = IE.Document.getElementsByTagName("table")
For Each tbl In tbls
If tbl.Rows(0).Cells(0).innerText = "Card Address:" Then
On Error Resume Next
L1 = tbl.Rows(0).Cells(1).innerText
L2 = tbl.Rows(1).Cells(1).innerText
L3 = tbl.Rows(2).Cells(1).innerText
L4 = tbl.Rows(3).Cells(1).innerText
L5 = tbl.Rows(4).Cells(1).innerText
Exit For
End If
Next
Worksheets("Sheet2").Range("A1").Value = L1
Worksheets("Sheet2").Range("A2").Value = L2
Worksheets("Sheet2").Range("A3").Value = L3
Worksheets("Sheet2").Range("A4").Value = L4
Worksheets("Sheet2").Range("A5").Value = L5
End Sub
I am using the VBA automation to get some informations of a ticket system in my job. I am trying to get the value into the generated table but only information that doest'go to the column "A" on sheet "Plan1" is <td> which contains the overflow: hidden CSS atribute. I don't know if are them related but coincidently are the only data that don't appears. Someone can help me?
HTML code:
<div id="posicionamentoContent">
<table class="grid">
<thead>...</thead>
<tbody>
<tr id="937712" class="gridrow">
<td width="200px"> Leonardo Peixoto </td>
<td width="200px"> 23/12/2015 09:45 </td>
<td width="200px"> SIM </td>
<td width="200px"> Telhado da loja com pontos de vazamento.</td>
<td width="200px" align="center"></td>
<td width="200px" align="center"></td>
</tr>
...
...
...
The complete code: http://i.stack.imgur.com/4BsFo.png
I need to get the first 4 <td> text ( Leonardo Peixoto, 23/12/2015 09:45, SIM and Telhado da loja com pontos de vazamento.) but they are only texts which I can't get.
Obs: When I use developers tools (f12) to inspect each element, it shows me perfectly the information I need inside <td>. But when I open "source code" page to checkthe html, the code is like this:
<div id="tabPosicionamento" style="padding: 5px 0 5px 0;" class="ui-tabs-hide">
div id="posicionamentoContent"></div>
</div>
Example VBA:
Sub extractTablesData1()
'we define the essential variables
Dim IE As Object, obj As Object
Dim ticket As String
Set IE = CreateObject("InternetExplorer.Application")
ticket= InputBox("Enter the ticket code")
With IE
.Visible = False
.navigate ("https://www.example.com/details/") & ticket
While IE.ReadyState <> 4
DoEvents
Wend
ThisWorkbook.Sheets("Plan1").Range("A1:K500").ClearContents
Set data = IE.document.getElementsByClassName("thead")(0).getElementsByTagName("td")
i = 0
For Each elemCollection In data
ThisWorkbook.Sheets("Plan1").Range("A" & i + 1) = data(i).innerText
i = i + 1
Next elemCollection
End With
IE.Quit
Set IE = Nothing
....
....
End Sub
This function returns in column "A" of sheet Plan1 only <td class=info3"></td> and <td class=info4"></td> but I need <td class=info1"></td> and <td class=info2 also."></td>
I wasn't able to read the page code due the proxy blocking me, but I faced a similar issue a while ago and the solution I found out was put all data on clipboard and paste. After that I clean the data on the sheet.
Here the code I used to do that:
Set ieTable = ie.document.getElementById("ID")
If Not ieTable Is Nothing Then
Set clip = New DataObject
clip.SetText "<html>" & ieTable.outerHTML & "</html>"
clip.PutInClipboard
Sheet1.Range("A1").Select
ActiveSheet.PasteSpecial Format:="Unicode Text", link:=False, DisplayAsIcon:=False, NoHTMLFormatting:=True
End If
Considering that you need to isolate the 4 td lines, you can do that with a loop for every search.
In your sample it numerates the Data, but not using it. Also, the cell assignment should be cells(x,y).value. Here is the working code.
Sub extractTablesData1()
'we define the essential variables
Dim IE As Object, Data As Object
Dim ticket As String
Set IE = CreateObject("InternetExplorer.Application")
With IE
.Visible = False
.navigate ("put your data url here")
While IE.ReadyState <> 4
DoEvents
Wend
Set Data = IE.document.getElementsByTagName("tr")(0).getElementsByTagName("td")
i = 1
For Each elemCollection In Data
ActiveWorkbook.Sheets(1).Cells(1, i).Value = elemCollection.innerHTML
i = i + 1
Next elemCollection
End With
IE.Quit
Set IE = Nothing
End Sub
It doesn't bring the information what I need (lasts <td>)
<div id="posicionamentoContent">
<table class="grid">
<thead>...</thead>
<tbody>
<tr id="937712" class="gridrow">
<td width="200px"> Leonardo Peixoto </td>
<td width="200px"> 23/12/2015 09:45 </td>
<td width="200px"> SIM </td>
<td width="200px"> Telhado da loja com pontos de vazamento.</td>
<td width="200px" align="center"></td>
<td width="200px" align="center"></td>
</tr>
So what im trying to do here is to write a simple html table to a xlsx (excel) file using epplus. The code ive got this far is
controller:
public void saveToExcel(string tbl)
{
using (ExcelPackage p = new ExcelPackage())
{
p.Workbook.Worksheets.Add("demo");
ExcelWorksheet ws = p.Workbook.Worksheets[1];
ws.Name = "Demo";
Response.ContentType = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet";
Response.AddHeader("content-disposition", "attachment; filename=ExcelDemo.xlsx");
Response.BinaryWrite(p.GetAsByteArray());
}
}
now this creates a empty workbook. And all I want to do right now is to write this table I have in my
View:
<Table id="tbl" name="tbl">
<tr>
<td>
Title 1
</td>
<td >
Title 1
</td>
<td>
Title 1
</td>
</tr>
<tr>
<td >
Row 1
</td>
<td>
Row 1
</td>
<td>
Row 1
</td>
</tr>
<tr>
<td >
Row 2
</td>
<td>
Row 2
</td>
<td>
Row 2
</td>
</tr>
<tr>
<td >
Row 2
</td>
<td>
Row 2
</td>
<td>
Row 2
</td>
</tr>
</table>
#Html.ActionLink("saveToExcel", "saveToExcel")
to the workbook. But I just dont know how and where to start.
Thankful for any pointers in the right direction.
I Guess:
First of all you have to convert your HTML-table to a .NET Datatable
This can be found here Convert Table
Next you use this code (considering your created datatable is called 'data' :
Dim attachment As String = "attachment; filename=MyExcelPage.xlsx"
Dim epackage As ExcelPackage = New ExcelPackage
Dim excel As ExcelWorksheet = epackage.Workbook.Worksheets.Add("ExcelTabName")
excel.Cells("A1").LoadFromDataTable(data, True)
HttpContext.Current.Response.Clear()
HttpContext.Current.Response.ClearHeaders()
HttpContext.Current.Response.ClearContent()
HttpContext.Current.Response.AddHeader("content-disposition", attachment)
HttpContext.Current.Response.ContentType = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
HttpContext.Current.Response.BinaryWrite(epackage.GetAsByteArray())
HttpContext.Current.Response.End()
epackage.Dispose()