Is encoding/xml the best library to parse HTML table files like this one and exist some examples how to do it?
<html><head>
<meta charset="utf-8">
</head>
<body>
<a name="Test1">
<center>
<b>Test 1</b> <table border="0">
<tbody><tr>
<th> Type </th>
<th> Region </th>
</tr>
<tr>
<td> <table border="0">
<thead>
<tr>
<th><b>Type</b></th>
<th> </th>
<th> Count </th>
<th> Percent </th>
</tr>
</thead>
<tbody><tr>
<td> <b>T1</b> </td>
<th> </th>
<td class="numeric" bgcolor="#ff0000"> 34,314 </td>
<td class="numeric" bgcolor="#ff0000"> 31.648% </td>
</tr>
<tr>
<td> <b>T2</b> </td>
<th> </th>
<td class="numeric" bgcolor="#bf3f00"> 25,820 </td>
<td class="numeric" bgcolor="#bf3f00"> 23.814% </td>
</tr>
<tr>
<td> <b>T3</b> </td>
<th> </th>
<td class="numeric" bgcolor="#24da00"> 4,871 </td>
<td class="numeric" bgcolor="#24da00"> 4.493% </td>
</tr>
</tbody></table><br>
</td>
<td> <table border="0">
<thead>
<tr>
<th><b> Type</b></th>
<th> </th>
<th> Count </th>
<th> Percent </th>
</tr>
</thead>
<tbody><tr>
<td> <b>T4</b> </td>
<th> </th>
<td class="numeric" bgcolor="#ff0000"> 34,314 </td>
<td class="numeric" bgcolor="#ff0000"> 31.648% </td>
</tr>
<tr>
<td> <b>T5</b> </td>
<th> </th>
<td class="numeric" bgcolor="#53ab00"> 11,187 </td>
<td class="numeric" bgcolor="#53ab00"> 10.318% </td>
</tr>
<tr>
<td> <b>T6</b> </td>
<th> </th>
<td class="numeric" bgcolor="#bf3f00"> 25,820 </td>
<td class="numeric" bgcolor="#bf3f00"> 23.814% </td>
</tr>
</tbody></table><br>
</td>
</tr>
</tbody></table>
</center>
</a>
</body></html>
Thank you in advance.
Depends on your HTML.
Strictly speaking, the only one kind of HTML which is guaranteed to be parsed by a conforming XML parser is XHTML, but despite the fact XHTML once has been thought of as coming to be the HTML standard, it has not really taken off the ground and these days it's considered obsolete (in favor of the much hyped "HTML5" thing and all the ecosystem around it). The basic problem with HTML is that while it looks like XML it has different rules. One glaring distinction is that <br> is a perfectly legal HTML but is an unterminated element in XML (in the latter, it has to be spelled <br/>), and there are a lot more differences.
On the other hand, your particular example looks quite XML'ish to me, so if you can guarantee your data, while being HTML, will always be a well-formed XML at the same time, you can just use the encoding/xml package. Otherwise go for go.net/html, as suggested by #elithrar, or find some other package.
Related
Example
This is a "T-account" as shown in the book Principles of Macroeconomics by Gregory Mankiw:
Code
To render this, I took the approach of using nested tables:
<table class="table">
<thead>
<tr>
<th>ASSETS</th>
<th>LIABILITIES AND OWNERS' EQUITY</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<table>
<tr>
<td>Reserves</td>
<td style="text-align: right;">200</td>
</tr>
<tr>
<td>Loans</td>
<td style="text-align: right;">700</td>
</tr>
<tr>
<td>Securities</td>
<td style="text-align: right;">100</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>Deposits</td>
<td style="text-align: right;">800</td>
</tr>
<tr>
<td>Debt</td>
<td style="text-align: right;">150</td>
</tr>
<tr>
<td>Capital</td>
<td style="text-align: right;">50</td>
</tr>
</table>
</td>
</tr>
</tbody>
</table>
In my application (with some CSS that comes with Blazor) here's the result:
Question
This seems to work OK, although it seems a bit odd. There's the assumption that the rows in each table will always align, for example.
Is there a better or more idiomatic way to implement the T-account in HTML?
So I have a huge HTML Table, some of which I've inserted here below:
<thead>
<tr class="tableizer-firstrow">
<th>Name</th>
<th>Language</th>
<th>Pages</th>
<th>Author</th>
<th>Publisher</th>
<th>Category</th>
<th>Class 1</th>
<th>Class 2</th>
<th>Class 3</th>
<th>Class 4</th>
<th>Class 5</th>
<th>Class 6</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sarvanna Shikshan- Swapna Navhe Hakka!</td>
<td>Marathi</td>
<td>64</td>
<td>Vinaya Deshpande</td>
<td>Bharat Gyan Vigyan Samuday (BGVS) Maharashtra</td>
<td>Uncategorized</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>Apalya Gavat Aple Arogya</td>
<td>Marathi</td>
<td> </td>
<td>-</td>
<td>Cehat Pune</td>
<td>Uncategorized</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</tbody>
I have thousands of rows, but what I want for each row is something like this
<tr>
<td class="name">Sarvanna Shikshan- Swapna Navhe Hakka!</td>
<td class="Language">Marathi</td>
<td class="Pages">64</td>
<td class="Author">Vinaya Deshpande</td>
<td class="Publisher">Bharat Gyan Vigyan Samuday (BGVS) Maharashtra</td>
<td class="Category">Uncategorized</td>
<td class="Class 1"> </td>
<td class="Class 2"> </td>
<td class="Class 3"> </td>
<td class="Class 4"> </td>
<td class="Class 5"> </td>
<td class="Class 6"> </td>
</tr>
Is there any way I can insert this for all of the cells? Maybe a search and replace which acts on every 13th iteration or something like that? There's no way I'll be able to do this manually anyway. Sorry if it's in the wrong topic, I'm not very familiar with Stackoverflow.
MAKE A BACKUP OF THE HTML CODE AS THIS MIGHT FAIL!
First open the html in Notepad++.
Open up "Search and replace" in the Search option (Ctrl+F).
Check the "Regular Expresssion" Option and the "find \r and \n" Option.
This is the search text:
<tr>(.*?)\r\n(.*?)<td>(.*?)</td>\r\n(.*?)<td>(.*?)</td>\r\n(.*?)<td>(.*?)</td>\r\n(.*?)<td>(.*?)</td>\r\n(.*?)<td>(.*?)</td>\r\n(.*?)<td>(.*?)</td>\r\n(.*?)<td>(.*?)</td>\r\n(.*?)<td>(.*?)</td>\r\n(.*?)<td>(.*?)</td>\r\n(.*?)<td>(.*?)</td>\r\n(.*?)<td>(.*?)</td>\r\n(.*?)<td>(.*?)</td>\r\n(.*?)</tr>
Click "replace all" or just go trough step by step by clicking "replace".
This is what the replace looks like:
<tr>\1\r\n\2<td class=\"name\">\3</td>\r\n\4<td class=\"Language\">\5</td>\r\n\6<td class=\"Pages\">\7</td>\r\n\8<td class=\"Author\">\9</td>\r\n$10<td class=\"Publisher\">$11</td>\r\n$12<td class=\"Category\">$13</td>\r\n$14<td class=\"Class 1\">$15</td>\r\n$16<td class=\"Class 2\">$17</td>\r\n$18<td class=\"Class 3\">$19</td>\r\n$20<td class=\"Class 4\">$21</td>\r\n$22<td class=\"Class 5\">$23</td>\r\n$24<td class=\"Class 6\">$25</td>\r\n$26</tr>
I'm not the best programmer so ya. I hope this works for all as I just tested it on the small snippet you gave us.
Here is a javascript solution: You can always run it in your browser, then copy/paste the output from the DevTool.
Note: Class attribute cannot contain space. If it does, the element will have 2 class. For example: Class and 1 not just Class 1
var cls = ['name', 'Language', 'Pages', 'Author', 'Publisher', 'Category', 'Class-1', 'Class-2', 'Class-3', 'Class-4', 'Class-5', 'Class-6'];
[].forEach.call(document.querySelectorAll('tbody > tr'), function(row) {
[].forEach.call(row.querySelectorAll('td'), function(cell, index) {
cell.classList.add(cls[index]);
});
});
table {
border-collapse:collapse;
}
td {
border:1px solid;
}
<table>
<thead>
<tr class="tableizer-firstrow">
<th>Name</th>
<th>Language</th>
<th>Pages</th>
<th>Author</th>
<th>Publisher</th>
<th>Category</th>
<th>Class 1</th>
<th>Class 2</th>
<th>Class 3</th>
<th>Class 4</th>
<th>Class 5</th>
<th>Class 6</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sarvanna Shikshan- Swapna Navhe Hakka!</td>
<td>Marathi</td>
<td>64</td>
<td>Vinaya Deshpande</td>
<td>Bharat Gyan Vigyan Samuday (BGVS) Maharashtra</td>
<td>Uncategorized</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>Apalya Gavat Aple Arogya</td>
<td>Marathi</td>
<td> </td>
<td>-</td>
<td>Cehat Pune</td>
<td>Uncategorized</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</tbody>
</table>
I face difficulties using print.xtable to insert a html table inside another table.
DF=data.frame(A=c("a","b"),B=c("This is a text
<table border=1>
<tr> <th> </th> <th> x </th> <th> error </th> </tr>
<tr> <td align=\"right\"> 1 </td> <td> element1 </td> <td> thing1 </td> </tr>
<tr> <td align=\"right\"> 2 </td> <td> element2 </td> <td> thing2 </td> </tr>
<tr> <td align=\"right\"> 3 </td> <td> element3 </td> <td> thing3 </td> </tr>
</table>","ok"))
This seems to work fine: (the html tags of the inner table are similar to the html tags of the outer table)
xtable(DF,digits=2)
but print.xtable(xtable(DF,digits=2), type="html") is converting the inner table tags to < and > :
<!-- html table generated in R 3.1.2 by xtable 1.7-4 package -->
<!-- Mon Feb 16 05:55:32 2015 -->
<table border=1>
<tr> <th> </th> <th> A </th> <th> B </th> </tr>
<tr> <td align="right"> 1 </td> <td> a </td> <td> This is a text
<table border=1>
<tr> <th> </th> <th> x </th> <th> error </th> </tr>
<tr> <td align="right"> 1 </td> <td> element1 </td> <td> thing1 </td> </tr>
<tr> <td align="right"> 2 </td> <td> element2 </td> <td> thing2 </td> </tr>
<tr> <td align="right"> 3 </td> <td> element3 </td> <td> thing3 </td> </tr>
</table> </td> </tr>
<tr> <td align="right"> 2 </td> <td> b </td> <td> ok </td> </tr>
</table>
hence my question: is there a way to make sure all tags are kept intacts?
The problem is that the default sanitize.text.function is changing the html tags. You can try to reset it to a function that does not change anything (setting it to NULL will call the default:
print.xtable(xtable(DF,digits=2), type="html",sanitize.text.function=function(x){x})
Let's say I have a data frame in R. I'd like to write it to a file as a simple HTML table. Just the <table>, <tr>, and <td> tags.
So far this seems harder than it should be. Right now I'm trying to use R2THML like so:
HTML(dataframe, file=outpath, append=FALSE)
But then I get a ugly, html-styled file that might look like so:
<table cellspacing=0 border=1>
<caption align=bottom class=captiondataframe></caption>
<tr><td>
<table border=0 class=dataframe>
<tbody>
<tr class= firstline >
<th> </th>
<th>name </th>
<th>donations </th>
<th>clicks </th>
...
</tr>
<tr>
<td class=firstcolumn>1
</td>
<td class=cellinside>Black.text
</td>
...
</tbody>
</table>
</td></table>
<br>
Is there a way to get output that's simpler (without specifying border, headings, captions, etc. Without outputting a table inside another table)? Or is this as good as it gets?
The xtable package can generate HTML output as well as LaTeX output.
# install.packages("xtable")
library("xtable")
sample_table <- mtcars[1:3,1:3]
print(xtable(sample_table), type="html", file="example.html")
gives, in the file example.html:
<!-- html table generated in R 3.0.1 by xtable 1.7-1 package -->
<!-- Fri Jul 19 09:08:15 2013 -->
<TABLE border=1>
<TR> <TH> </TH> <TH> mpg </TH> <TH> cyl </TH> <TH> disp </TH> </TR>
<TR> <TD align="right"> Mazda RX4 </TD> <TD align="right"> 21.00 </TD> <TD align="right"> 6.00 </TD> <TD align="right"> 160.00 </TD> </TR>
<TR> <TD align="right"> Mazda RX4 Wag </TD> <TD align="right"> 21.00 </TD> <TD align="right"> 6.00 </TD> <TD align="right"> 160.00 </TD> </TR>
<TR> <TD align="right"> Datsun 710 </TD> <TD align="right"> 22.80 </TD> <TD align="right"> 4.00 </TD> <TD align="right"> 108.00 </TD> </TR>
</TABLE>
This could be further simplified with more options to xtable and print.xtable:
print(xtable(sample_table, align="llll"),
type="html", html.table.attributes="")
gives
<!-- html table generated in R 3.0.1 by xtable 1.7-1 package -->
<!-- Fri Jul 19 09:13:33 2013 -->
<TABLE >
<TR> <TH> </TH> <TH> mpg </TH> <TH> cyl </TH> <TH> disp </TH> </TR>
<TR> <TD> Mazda RX4 </TD> <TD> 21.00 </TD> <TD> 6.00 </TD> <TD> 160.00 </TD> </TR>
<TR> <TD> Mazda RX4 Wag </TD> <TD> 21.00 </TD> <TD> 6.00 </TD> <TD> 160.00 </TD> </TR>
<TR> <TD> Datsun 710 </TD> <TD> 22.80 </TD> <TD> 4.00 </TD> <TD> 108.00 </TD> </TR>
</TABLE>
(which could be directed to a file with the file argument to print.xtable as in the previous example.)
You could also have a look at the tableHTML package, that was developed for this reason.
library(tableHTML)
mtcars %>%
tableHTML()
And to print the HTML on the console:
tableHTML(mtcars[1:2, 1:3]) %>%
print(viewer = FALSE)
# <table style="border-collapse:collapse;" class=table_9302 border=1>
# <thead>
# <tr>
# <th id="tableHTML_header_1"> </th>
# <th id="tableHTML_header_2">mpg</th>
# <th id="tableHTML_header_3">cyl</th>
# <th id="tableHTML_header_4">disp</th>
# </tr>
# </thead>
# <tbody>
# <tr>
# <td id="tableHTML_rownames">Mazda RX4</td>
# <td id="tableHTML_column_1">21</td>
# <td id="tableHTML_column_2">6</td>
# <td id="tableHTML_column_3">160</td>
# </tr>
# <tr>
# <td id="tableHTML_rownames">Mazda RX4 Wag</td>
# <td id="tableHTML_column_1">21</td>
# <td id="tableHTML_column_2">6</td>
# <td id="tableHTML_column_3">160</td>
# </tr>
# </tbody>
# </table>
The table can also be styled with CSS using the add_css_ family of functions, if needed.
Details of the package and tutorials (vignettes) are here
A prettier but slower option:
library(htmlTable)
htmlTable(iris)
to_html_table<-function(dataframe){
tags$table(
tags$thead(tags$tr(lapply(colnames(dataframe), function(x) tags$th(x)))),
tags$tbody(
apply(dataframe,1, function(x) { tags$tr(lapply(x, function(y) tags$td(y)))})
))
}
The answer is actually quite simple, if you use xtable. (Thanks to SeƱor O for the tip.)
install.packages("xtable")
library(xtable)
out_table_x <- xtable(out_table)
print(out_table_x, type='html', file="./example.html")
I am trying to format a table to look like this...
Basically i want the "Dates" row to have two columns inside it (to and from) both of them 50% the width of dates...but however when i try to format it. "To" takes all of date and "From" takes all of Name. they arent locked under "Dates"
Any help will be appreciated...Thank you
<th width="100%">Dates</th><th>Name</th><th>Age</th>
<tr>
<tr>
<td width="50%">To</td>
<td width="50%">From</td>
</tr>
</tr>
Change
<table border="1">
<tr class="heading"> <td colspan="6">Information</td> </tr >
<th width ="15" colspan="2">Dates</th><th> Name</th><th>Age</th>
<tr>
<tr>
<td width="2">From</td>
<td width="2">To</td>
<td></td>
<td></td>
</tr>
<tr>
<td width="5">
<input type="text" class="input" name="1fdate" /></td>
<td width="2">
<input type="text" class="input" name="1fdate" /></td>
</tr>
</tr>
</table>
I hope this is what you need. You use colspan and rowspan to merge the cells. When you set colspan to "2" in Date cell, it spans the row with two cells (or colums). And you set also rowspan of the cells next to Date to "2" so that they will span the rows taken by whole Date section.
<table width="600" border="0">
<tr>
<th width="200" colspan="2" scope="col">Date</th>
<th width="200" rowspan="2" scope="col">Name</th>
<th width="200" rowspan="2" scope="col">Age</th>
</tr>
<tr>
<th width="100">To</th>
<th width="100">From</th>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
</table>
Change
<th width="100%">Dates</th>
to have colspan value. Like
<th colspan="2">Dates</th>
Replace first line with below
<th width="100%" colspan="2">Dates</th><th>Name</th><th>Age</th>