Remove Block of HTML Table Data Linux Bash - html

I have a html file that I process using a bash script and want to remove empty tables. The file is generated from a sql statement, but contains the table header when no records are found. I want to remove the header where no records are found.
<table border="1">
<caption>Table with data</caption>
<tr>
<th align="center">type</th>
<th align="center">column1</th>
<th align="center">column2</th>
<th align="center">column3</th>
<th align="center">column4</th>
</tr>
Data rows exists here
</table>
<table border="1">
<caption>Empty Table To Remove</caption>
<tr>
<th align="center">type</th>
<th align="center">column1</th>
<th align="center">column2</th>
<th align="center">column3</th>
<th align="center">column4</th>
<th align="center">column5</th>
<th align="center">column6</th>
<th align="center">column7</th>
</tr>
</table>
<table border="1">
<caption>Table with data</caption>
<tr>
<th align="center">type</th>
<th align="center">column1</th>
<th align="center">column2</th>
<th align="center">column3</th>
<th align="center">column4</th>
</tr>
Data rows exists here
</table>
I tried to use a combination of grep and sed to remove the empty table. I was able to accomplish this when the tables contained an equal number of columns. I am having issues now that I have tables with a different number of columns.
When the table had an equal number of columns, I was able to loop through based on the caption, do a count and then remove. This is not working since the number of columns vary.

Like this, using xmlstarlet and xpath:
$ xmlstarlet format -H file.html | sponge file.html
$ xmlstarlet ed -d '//table[./caption/text()="Empty Table To Remove"]' file.html
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<table border="1"><caption>Table with data</caption><tr><th align="center">type</th><th align="center">column1</th><th align="center">column2</th><th align="center">column3</th><th align="center">column4</th></tr>
Data rows exists here
</table>
<table border="1"><caption>Table with data</caption><tr><th align="center">type</th><th align="center">column1</th><th align="center">column2</th><th align="center">column3</th><th align="center">column4</th></tr>
Data rows exists here
</table>
</body>
</html>
To edit in place like sed -i, use
xmlstarlet edit -L ...
Not explained, but don't use sed nor regex to parse HTML/XML

Related

Make a table row span multiple columns using kable and kableExtra

I am trying to create an HTML table using R and the kable and kableExtra packages. I am having problems creating a row that spans several columns. I want to create a table where the last row contains the same values for all the columns without actually repeating this value. I've created a small example of what I am trying to do below.
library(kableExtra)
library(knitr)
summary_stats <- matrix(c(51,43,22,22),ncol=2,byrow=TRUE)
colnames(summary_stats) <- c("Mean","SD")
rownames(summary_stats) <- c("Age","Observations")
summary_stats
kable_table <- kable(summary_stats) %>%
kable_styling()
Instead of repeating the number 22 on the last row for the two columns, I'd like to center it between the two columns.
I am able to achieve what I want with the following HTML code using the colspan argument:
<table class="table" style="margin-left: auto; margin-right: auto;">
<thead>
<tr>
<th style="text-align:left;"> </th>
<th style="text-align:right;"> Mean </th>
<th style="text-align:right;"> SD </th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;"> Age </td>
<td style="text-align:right;"> 51 </td>
<td style="text-align:right;"> 43 </td>
</tr>
<tr>
<td style="text-align:left;"> Observations </td>
<td style="text-align:center;" colspan = "2"> 22 </td>
</tr>
</tbody>
</table>
Note that the HTML code is just the output of the kable_table object I created in R where I've manually edited the HTML code to include the colspan argument. I would like to do this programmatically within R instead of having to manually change the code.
I've tried to use the row_spec function from the kableExtra package to add the necessary code but I am limited by the fact that the add_css option (as expected) only accepts arguments related to styling. In other words, I cannot pass the colspan argument to the option.
My question is if there is a reasonable way of adding the necessary HTML to the table after I've created it or if there is any option within the kable/kabeExtra framework that allows me to do this that I've missed?

Accessible Table with Sub Headings / Category Separation

EDIT: To the person who tagged this as having nothing to do with ADA. This question has everything to do with ADA. I have tons of websites with tables formatted like that which I am trying to figure out how to make them understandable to someone using a screen reader.
Hello I am trying to figure out a way to make a table which has subheadings / separator rows to announce the proper headings when being read by a screen reader.
The first table works as I would like, announcing the rowgroup's TH and then the column heading. However the second table doesn't announce as I was hoping. For example, Jill announces "Field Techs, Name, Jill" Instead of "Office, Name, Jill" as I had expected.
I've tried scope="col" and scope="colgroup" but neither helped. Is this even possible? or just a badly structured table?
Thank you for reading and any help/advice you may offer!
table thead, table th { background:#d3d3d3; }
table { margin-bottom:40px; }
<!-- This table's headings seem to work properly -->
<table width="100%" cellspacing="0" cellpadding="4" >
<thead>
<tr>
<td> </td>
<th id="name_col" scope="col" width="50%">Name</th>
<th id="position_col" scope="col" width="50%">Position</th>
</tr>
</thead>
<tbody>
<tr>
<th id="office_row" scope="rowgroup" rowspan="2">Office</th>
<td headers="office_row name_col">Jill</td>
<td headers="office_row position_col">Office Manager</td>
</tr>
<tr>
<td headers="office_row name_col">Robert</td>
<td headers="office_row position_col">Project Manager</td>
</tr>
<tr>
<th id="field_row" scope="rowgroup" rowspan="2">Field Techs</th>
<td headers="field_row name_col">Jason</td>
<td headers="field_row position_col">Tech</td>
</tr>
<tr>
<td headers="field_row name_col">Mike</td>
<td headers="field_row position_col">Tech</td>
</tr>
</tbody>
</table>
<!-- This table's headings don't announce correctly. Jill announces "Field Techs, Name, Jill"-->
<table width="100%" cellspacing="0" cellpadding="4" >
<thead>
<tr>
<th id="name_col" scope="col" width="50%">Name</th>
<th id="position_col" scope="col" width="50%">Position</th>
</tr>
<tr>
<th id="office_group" colspan="2">Office</th>
</tr>
</thead>
<tbody>
<tr>
<td headers="office_group name_col">Jill</td>
<td headers="office_group position_col">Office Manager</td>
</tr>
<tr>
<td headers="office_group name_col">Robert</td>
<td headers="office_group position_col">Project Manager</td>
</tr>
</tbody>
<thead>
<tr>
<th id="field_group" colspan="2">Field Techs</th>
</tr>
</thead>
<tbody>
<tr>
<td headers="field_group name_col">Jason</td>
<td headers="field_group position_col">Tech</td>
</tr>
<tr>
<td headers="field_group name_col">Mike</td>
<td headers="field_group position_col">Tech</td>
</tr>
</tbody>
</table>
table can only have zero or one thead element (see documentation).
Permitted contents : An optional caption element, followed by zero or more colgroup elements, followed by an optional thead element
By having multiple thead elements only the last one is considered by your browser and your screenreader. You can use ARIA attributes and roles to handle multiple separated heading lines (using for instance aria-labelledby attribute to specify the heading).
One example from WCAG:
ARIA9: Using aria-labelledby to concatenate a label from several text nodes
You are using both the scope method and header/id's method in one table, which will create problems. Also, as others have pointed out, you're using multiple <th> and <tbody> elements, which isn't good either.
I've prepared some code samples here on how to correctly code this table using both the scope method and header/id's method:
https://jsfiddle.net/oody1b8x/
It's worth noting that <th> and <tbody> are not accessibility-related elements, even though they appear to be. These are essentially only used when printing. It lets the printer know that the header rows can be repeated on the next page if the table requires pagination.
Also -- don't use ARIA for this purpose; it will only create more problems. The native HTML semantics are perfectly capable of communicating how this data is structured.

How to extract only the 1st table tag from a html page having various nested table tag

I have the following html page. I want to extract data only within the 1st table tag in C#. the html page code is:
<table cellpadding=2 cellspacing=0 border=0 width=100%>
<tbody>
<tr>
<td align=right><b>11/09/2013 at 09:48</b></td>
</tr>
</tbody>
</table>
<center>
<table border="1" bordercolor="silver" cellpadding="2" cellspacing="0" width="100%">
<thead>
<tr>
<th width=100>ETA</th>
<th width=100>Ship Name</th>
<th width=80>From port</th>
<th width=80>To berth</th>
<th width=130>Agent</th>
</tr>
</thead>
<tbody>
<tr><td>11/09/2013 at 09:00 </td>
<td>SONANGOL KALANDULA </td>
<td>Cabinda </td>
<td>Valero 6 </td>
<td>Graypen </td>
</tr>
</tbody>
</table>
To be more specific I want to extract only the row having date 11/09/2013 at 09:48 the below mentioned code is under the first of tag I am using regex
"<table[^>]*>([^<]*(?:(?!</table)<[^<]*)*)[</table>]*"
but with this I am getting whole of the page source that is I am getting the data between all the table tags but I want only text between first table tag.
Can anyone tell me regular expression with which I can only extract this particular portion from the whole html page?
When trying out your version here, it seems to work to me on the input you specified, though [</table>]* should really be just </table> ([</table>]* means any number of characters in the set: <,/,t,a,b,l,e,>)
This seems like it would bear simplification, though. This should also work:
<table[^>]*>.*?</table>
All bets are off if you have nested tables, of course.

How can I make a HTML table with headers in one vertical column?

I want to make a HTML file that has the headers in one vertical column, and the data in the column to the right. There will only be 2 columns in total. I've looked at the html docs and seen stuff about scope, but I'm not entirely sure how to use it in this context. Example:
The HTML is pretty straightforward, just be sure to use the [scope] attribute to specify the correct orientation of the table.
<table>
<tbody>
<tr>
<th scope="row">City</th>
<td>$city</td>
</tr>
<tr>
<th scope="row">Latitude</th>
<td>$latitude</td>
</tr>
<tr>
<th scope="row">Longitude</th>
<td>$longitude</td>
</tr>
<tr>
<th scope="row">Country</th>
<td>$country</td>
</tr>
</tbody>
</table>
From the docs for the [scope] attribute:
The row state means the header cell applies to some of the subsequent cells in the same row(s).
You can create the tables with elements proceeded by elements like so:
<table>
<tr>
<th scope="row">Category 1</th><td>data1</td>
</tr>
<tr>
<th scope="row">Category 2</th><td>data2</td>
</tr>
<tr>
<th scope="row">Category 3</th><td>data3</td>
</tr>
Here is an example of it in action:
vertical headers

split html table

i have a html table which looks like this:
<table>
<thead>
<tr>
<th >title1</th>
<th >title2</th>
<th >title3</th>
<th >title4</th>
<th >title5</th>
<th >title6</th>
<th >title7</th>
</tr>
</thead>
<tbody>
<tr>
<td>data1</td>
...
<td>data7</td>
</tr>
</tbody>
the issue I am having is that I only have around 300px to put all this information in, I was wondering if there was some way that I can tell the table to split if it reaches the end of 300px limit. is this even possible ? or shall i just go back to using divs ?
I'm not sure what 'splitting' is, but a good alternative would be to wrap the table in a container with overflow-x: auto set. That will make it scrollable.
Live Example