C++: How to recursively/iteratively search HTML file (using Boost C++)? - html

I'm working on an application where I need to fetch a HTML file (from the web) and obtain a piece of information, by searching for a string.
I reckon it is more effective and easier to treat the HTML file as a XML file and iterate over the tags in the HTML file and match the content with a string.
Here is the HTML table I'm interested in:
<table width='100%' class='datatable' cellspacing='0' cellpadding='0'>
<tr>
<td>
</td>
<td width='30px'>
</td>
<td width='220px'>
</td>
<td width='50px'>
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Aktiv tid: <!--This is a string I will search for.-->
</td>
<td colspan='3'>
1 dag, 17:03:46 <!--This is a piece of information I need to obtain.-->
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Bandbredd (upp/ned) [kbps/kbps]:
</td>
<td colspan='3'>
1.058 / 21.373
</td>
</tr>
<tr>
<td height='7' colspan='4'>
<img src='/images/spacer.gif' width='1' height='7' border='0' alt=''>
</td>
</tr>
<tr>
<td width='170'>
Överförda data (skickade/mottagna) [GB/GB]: <!--This is another string I will search for.-->
</td>
<td colspan='3'>
1,67 / 42,95 <!--This is another piece of information I need to obtain.-->
</td>
</tr>
</table>
So I will search for the <td> tags containing either of the following strings:
Aktiv tid:
Överförda data (skickade/mottagna) [GB/GB]:
After that I need to select the next <td> tag containing the piece of information I want (in the same <tr>.
I successfully fetched the HTML file using cURL but need a little help with the XML search algorithm.
Thank you in advance!
(EDIT: Here is the pseudocode for my desired application (should be very self-explanatory):
extern "C" {
#include "url.h"
}
#include <string>
#include <iostream>
std::string xmlSearch(std::string fn, std::string str);
int main(void)
{
/* download HTML file from URL to file */
url("http://myurl.com/","page.html");
/* search page.html for "Aktiv tid:" and return the content of the next <td> tag. */
std::string data0 = xmlSearch("page.html","Aktiv tid:");
/* search page.html for "Överförda data (skickade/mottagna) [GB/GB]:" and return the content of the next <td> tag. */
std::string data1 = xmlSearch("page.html","Överförda data (skickade/mottagna) [GB/GB]:");
/* process results */
}
std::string xmlSearch(std::string fn, std::string str){
/* perform search algorithim */
/* return content of the next <td> tag. */
}
)

I could see myself doing this with a quick-and-dirty script, not with C++, really.
In one line:
(tidy -asxml input.xml | xmllint --xpath 'descendant-or-self::*[starts-with(text(), "Aktiv tid:")]/following-sibling::*/text()' -) 2>/dev/null
Here
tidy converts quirky html to xml
xmllint queries it:
from * (any element) which [starts-with(text(), "Aktiv tid:")]
select the text() from the following sibling
2>/dev/null is there to suppress any warning from tidy and xmllint
Presto, it prints:
1 dag, 17:03:46
For the precise input from your question.

Related

Xpath grep elements

I`m using Scrapy Python to try to grep data from the site.
How I can grep this structure with Xpath?
<div class="foo">
<h3>Need this text_1</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
45767
</td>
<td class="tmp_outcome">
<b>Win_1</b><br>
<span class="tmp_category">TEST_1</span>
</td>
</tr>
<tr>
<td class="tmp_year">
1232004
</td>
<td class="tmp_outcome">
<b>Win_2</b><br>
<span class="tmp_category">TEST_2</span>
</td>
</tr>
<tr>
<td class="tmp_year">
122004
</td>
<td class="tmp_outcome">
<b>Win_3</b><br>
<span class="tmp_category">TEST_3</span>
</td>
</tr>
</tbody>
<h3>Need this text_2</h3>
<table class="thesamename">
<tbody>
<td class="tmp_year">
234
</td>
<td class="tmp_outcome">
<b>Win_E</b><br>
<span class="tmp_category">TEST_E</span>
</td>
</tr>
<tr>
<td class="tmp_year">
3476
</td>
<td class="tmp_outcome">
<b>Win_C</b><br>
<span class="tmp_category">TEST_C</span>
</td>
</tr>
</tbody>
<h3>Need this text_3</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
85567
</td>
<td class="tmp_outcome">
<b>Win_T</b><br>
<span class="tmp_category">TEST_T</span>
</td>
</tr>
<tr>
<td class="tmp_year">
435656
</td>
<td class="tmp_outcome">
<b>Win_A</b><br>
<span class="tmp_category">TEST_A</span>
</td>
</tr>
<tr>
<td class="tmp_year">
980
</td>
<td class="tmp_outcome">
<b>Win_Z</b><br>
<span class="tmp_category">TEST_Z</span>
</td>
</tr>
</tbody>
I would like to have output with this structure:
"Section": {
Need this text_1 :
[45767 : Win_1 : TEST_1]
[1232004 : Win_2 : TEST_2]
[122004: Win_3 : TEST_3]
,
Need this text_2:
[234 : Win_E : TEST_E]
[3476 : Win_C : TEST_C]
,
Need this text_3:
[85567 : Win_T : TEST_T]
[435656 : Win_A : TEST_A]
[980: Win_Z : TEST_Z]
}
How can I create the proper xpath select to take this structure?
I can take separately all "h3" , all "a" then all tags with class but how I can match?
GREP YOU SAY?! LOL Well, You would be entirely wron to name it so but for the sake ofkeeping the jargon cleanfor understanding your just parsing/extracting.... So new to scrapy? or web dev sideof things? No matter... Theres no way I couldexpect to teach you in one answer here how to xpth/regex like a pro... only wayis for you to keep at but I throw in my input.
First of all, xpath is amazingly usefull wen it comes to websites that are necessarily build to stadard, which doesnt make them bad per say but in the html snipet you gave... its structured all right soo.. Id recommend css extract .. THESE ARE THE VALUES...
year = response.css('td.tmp_year a::text').extract()
outcome = response.css('td.tmp_outcome b::text').extract()
category= response.css('span.tmp_category::text').extract()
PRO-TIP: For what ever case you deem it neccesary, you can save a web page asan HTML file and use scrapy shell by referencing the direct file path to it... So I save you html snippet to a file on my desktop then ran...
scrapy shell file:///home/scriptso/Desktop/letsGREPlol.html
ANYWAYS... as far as xpath... since you asked lol... cake. lets compare the xpath with the cssand tell me you can see... it? lol
response.css('td.tmp_outcome b::text').extract()
so is a td tag....and the class name is tmp_outcome, thn the next node is a bold tag... of which where the text is thusly declaring it as text with the ::text
response.xpath('//td[#class="tmp_outcome"]/b/text()').extract()
So xpath is basically saying we star with a patter inthe entire site of the td tag... and class= tmp_outcome, then the bold, then in xpath to declare type /text() is for text.... /#href is for.. yeah you guessedit

How to populate an array with text from html webscraping in ruby

I have used the nokogiri ruby gem to webscrape an html file for only the text under the tableData class. The html code is setup like so:
<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>
and the code I used to webscrape looks like this:
vt = page.css("td[class='tableData']").text
puts vt
Which gives this output:
Jane Doe 01/01/201701/09/2017 VacationJohn Doe 01/01/201701/09/2017 Vacation
I want to populate an array within an array with only the 4 text values pertaining to each person. Which should look like this:
[[Jane Doe, 01/01/2017, 01/09/2017, Vacation], [John Doe, 01/01/2017, 01/09/2017, Vacation]]
I am new to coding and I'm not sure how to create a for loop to iterate over either the html code itself or the vt variable to produce an array of arrays. I know there are some push statements involved following the for loop but its the actual structure of the for loop that I am having trouble putting together. If you could provide some explanation in your answer for how the for loop works in this situation it would be much appreciated.
This is the basic structure you need. map is needed :
html=%q(<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>)
require 'nokogiri'
doc = Nokogiri::XML(html)
array = doc.xpath('//tr').map do |tr|
tr.xpath('td').map{ |td| td.text }
end
p array
# [[" Jane Doe", " 01/01/2017", "01/09/2017 ", "Vacation"], ["John Doe", " 01/01/2017", "01/09/2017 ", "Vacation"]]
Try parsing the snippet as XML, finding all "tr" elements via XPath, and collecting their "td//text()" children:
require 'nokogiri'
doc = Nokogiri::XML(get_html_snippet)
data = doc.xpath('//tr').map do |tr|
tr.xpath('td').map { |td| td.text.strip }
end
data # => [["Jane Doe", "01/01/2017", "01/09/2017", "Vacation"], ["John Doe", "01/01/2017", "01/09/2017", "Vacation"]]

How to match and replace multiline html file with sed

I have a text file something like this.
<tbody>
<tr>
<td>
String1
</td>
<td>
String2
</td>
<td>
String3
</td>
...
...
<td>
StringN
</td>
</tr>
</tbody>
This is the output that I want.
<tbody>
<tr>
String1;String2;String3;... ...;StringN
</tr>
</tbody>
Here is my BUGGY code.
sed '{
:a
N
$!ba
s|<td.*>\(.*\)</td>|\1|
}'
I wanted to remove all <td> and </td> tags and get all the strings delimitered by some string (I can filter those strings later using that as the delimiter charater). I used the solution given in this URL. Output does not come as I expected.
This is the actual Code
<tbody>
<tr>
<td>
120.52.72.58:80
</td>
<td>
HTTP
</td>
<td>
<span class="text-danger">Transparent</span>
</td>
<td>
<abbr title="2016-12-15 00:07:46">12h ago</abbr>
</td>
<td class="small">
<span class="text-muted">—</span>
</td>
<td>
<img src="/flags/png/cn.png" alt="China (CN)" title="China (CN)" onerror="this.style.display='none'"> <abbr title="China">CN</abbr>
</td>
<td class="small">
Beijing
</td>
<td class="small">
Beijing
</td>
<td class="small">
China Unicom IP network
</td>
<td class="small">
<span class="text-muted">—</span>
</td>
</tr>
</tbody>
Output does not come as I expected.
Your sed code does not work because the <td.*>\(.*\)</td> matches the part of the pattern space from the first <td up to the last </td> due to the greediness of the * quantifier. Unfortunately, sed doesn't support a more modern regex flavor with ungreedy quantifiers; thus, some other tool would be more appropriate.
I wanted to remove all <td> and </td> tags and get all the strings delimitered by some string …
If those tags are always (as in your examples) on a separate line, we can do with a simple sed command:
sed '/<\/*td.*>/d'
All the strings are thereafter delimited by some string which is \n followed by spaces.

Unable to select only first occurrence of multiple attributes with same name?

Here is my html code:
<table id="laptop_detail" class="table">
<tbody>
<tr>
<td style="padding-left:18px" class="ha">Camera Pixels</td>
<td class="val">8 megapixel camera</td>
</tr>
<tr>
<td style="padding-left:36px" class="ha">Camera Pixels</td>
<td class="val">8 megapixel camera</td>
</tr>
</tbody>
and my xpath:
$x('//*[#id="laptop_detail"]//tr/td[contains(., "Camera Pixels")]/following-sibling::td[1]/text()')
My problem is I am unable to find any working way of selecting only one occurrence of attribute.
Enclose the part locating the "Camera Pixels" td element into parenthesis:
(//*[#id="laptop_detail"]//tr/td[contains(., "Camera Pixels")])[1]/following-sibling::td
Demo:
$ xmllint index.html --xpath '(//*[#id="laptop_detail"]//tr/td[contains(., "Camera Pixels")])[1]/following-sibling::td'
<td class="val">8 megapixel camera</td>

Rails 3 string contains HTML code need to loop through the code in string

I am working on the rails 3 application where i need to pass the html code in to the string variable and pass it to the web services as parameter.
I have the following code with the loop inside but since it is declare in to the string it is not working with the <%%> and #{} tag
#emaildata = "<H3>FLOOR VIEW ACTION REQUEST</H3>
<table border='0' cellspacing='4'>
<tr>
<td>Submitted On:</td>
<td align='left'><strong>#{Date.today}</strong></td>
</tr>
<tr>
<td> Originator: </td>
<td align='left'><strong>#{session[:user_name]}</strong></td>
</tr>
</table>
<table border=0 width=100%>
<tr bgcolor='##006699'>
<td align='center'><font color='##FFFFFF'><strong>ACTION CODE</strong></font></td>
<td align='center'><font color='##FFFFFF'><strong>PART<BR />NUMBER</strong></font></td>
<td align='center'><font color='##FFFFFF'><strong>LOCATION</strong></font></td>
<td align='center'><font color='##FFFFFF'><strong>BIN QTY</strong></font></td>
<td align='center'><font color='##FFFFFF'><strong>PACK QTY</strong></font></td>
<td align='center'><font color='##FFFFFF'><strong>UM</strong></font></td>
<td align='center'><font color='##FFFFFF'><strong>SCAN CODE</strong></font></td>
<td align='center'><font color='##FFFFFF'><strong>REASON / COMMENTS</strong></font></td>
</tr>
<% (1..PartNoListInEmail.length).each_index do |i|%>
<tr bgcolor='##E0E5E5'>
<td align='center'>#{#ActionCodeListInEmail[i]}</td>
<td align='center'>#{#PartNoListInEmail[i]}</td>
<td align='center'>#{#SendToListInEmail[i]}</td>
<td align='center'>#{#OrderQtyListInEmail[i]}</td>
<td align='center'>#{#PackQtyListInEmail[i]}</td>
<td align='center'>#{#UMListInEmail[i]}</td>
<td align='center'>#{#ScancodeListInEmail[i]}</td>
<td align='center'>#{#reasonForActionIn[i]}</td>
</tr>
<%end%>
</table>"
Please help me .
Save your html as partial as a html.erb
#emaildata = "<%= escape_javascript(render :partial=>'some_partial_name', :locals => {:PartNoListInEmail => #PartNoListInEmail}).html_safe %>"
For combining strings with HTML, you want to use a template system like Erb or Haml. If you don't intend to immediately render the template back to a browser, you can still use Erb to do this by calling Erb directly, having it parse the HTML string and variables and return the result as a string.
Once you go down this road, be extra careful of user provided content and escape anything untrustworthy. When you render erb templates normally in rails, rails does a fair amount of work for you to avoid these sorts of problems, but once you do something like what your example showed, or if you use Erb directly to parse it, you no longer benefit from Rails' safety checks, and therefore will need to put in your own checks.