How to match and replace multiline html file with sed - html

I have a text file something like this.
<tbody>
<tr>
<td>
String1
</td>
<td>
String2
</td>
<td>
String3
</td>
...
...
<td>
StringN
</td>
</tr>
</tbody>
This is the output that I want.
<tbody>
<tr>
String1;String2;String3;... ...;StringN
</tr>
</tbody>
Here is my BUGGY code.
sed '{
:a
N
$!ba
s|<td.*>\(.*\)</td>|\1|
}'
I wanted to remove all <td> and </td> tags and get all the strings delimitered by some string (I can filter those strings later using that as the delimiter charater). I used the solution given in this URL. Output does not come as I expected.
This is the actual Code
<tbody>
<tr>
<td>
120.52.72.58:80
</td>
<td>
HTTP
</td>
<td>
<span class="text-danger">Transparent</span>
</td>
<td>
<abbr title="2016-12-15 00:07:46">12h ago</abbr>
</td>
<td class="small">
<span class="text-muted">—</span>
</td>
<td>
<img src="/flags/png/cn.png" alt="China (CN)" title="China (CN)" onerror="this.style.display='none'"> <abbr title="China">CN</abbr>
</td>
<td class="small">
Beijing
</td>
<td class="small">
Beijing
</td>
<td class="small">
China Unicom IP network
</td>
<td class="small">
<span class="text-muted">—</span>
</td>
</tr>
</tbody>

Output does not come as I expected.
Your sed code does not work because the <td.*>\(.*\)</td> matches the part of the pattern space from the first <td up to the last </td> due to the greediness of the * quantifier. Unfortunately, sed doesn't support a more modern regex flavor with ungreedy quantifiers; thus, some other tool would be more appropriate.
I wanted to remove all <td> and </td> tags and get all the strings delimitered by some string …
If those tags are always (as in your examples) on a separate line, we can do with a simple sed command:
sed '/<\/*td.*>/d'
All the strings are thereafter delimited by some string which is \n followed by spaces.

Related

Xpath grep elements

I`m using Scrapy Python to try to grep data from the site.
How I can grep this structure with Xpath?
<div class="foo">
<h3>Need this text_1</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
45767
</td>
<td class="tmp_outcome">
<b>Win_1</b><br>
<span class="tmp_category">TEST_1</span>
</td>
</tr>
<tr>
<td class="tmp_year">
1232004
</td>
<td class="tmp_outcome">
<b>Win_2</b><br>
<span class="tmp_category">TEST_2</span>
</td>
</tr>
<tr>
<td class="tmp_year">
122004
</td>
<td class="tmp_outcome">
<b>Win_3</b><br>
<span class="tmp_category">TEST_3</span>
</td>
</tr>
</tbody>
<h3>Need this text_2</h3>
<table class="thesamename">
<tbody>
<td class="tmp_year">
234
</td>
<td class="tmp_outcome">
<b>Win_E</b><br>
<span class="tmp_category">TEST_E</span>
</td>
</tr>
<tr>
<td class="tmp_year">
3476
</td>
<td class="tmp_outcome">
<b>Win_C</b><br>
<span class="tmp_category">TEST_C</span>
</td>
</tr>
</tbody>
<h3>Need this text_3</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
85567
</td>
<td class="tmp_outcome">
<b>Win_T</b><br>
<span class="tmp_category">TEST_T</span>
</td>
</tr>
<tr>
<td class="tmp_year">
435656
</td>
<td class="tmp_outcome">
<b>Win_A</b><br>
<span class="tmp_category">TEST_A</span>
</td>
</tr>
<tr>
<td class="tmp_year">
980
</td>
<td class="tmp_outcome">
<b>Win_Z</b><br>
<span class="tmp_category">TEST_Z</span>
</td>
</tr>
</tbody>
I would like to have output with this structure:
"Section": {
Need this text_1 :
[45767 : Win_1 : TEST_1]
[1232004 : Win_2 : TEST_2]
[122004: Win_3 : TEST_3]
,
Need this text_2:
[234 : Win_E : TEST_E]
[3476 : Win_C : TEST_C]
,
Need this text_3:
[85567 : Win_T : TEST_T]
[435656 : Win_A : TEST_A]
[980: Win_Z : TEST_Z]
}
How can I create the proper xpath select to take this structure?
I can take separately all "h3" , all "a" then all tags with class but how I can match?
GREP YOU SAY?! LOL Well, You would be entirely wron to name it so but for the sake ofkeeping the jargon cleanfor understanding your just parsing/extracting.... So new to scrapy? or web dev sideof things? No matter... Theres no way I couldexpect to teach you in one answer here how to xpth/regex like a pro... only wayis for you to keep at but I throw in my input.
First of all, xpath is amazingly usefull wen it comes to websites that are necessarily build to stadard, which doesnt make them bad per say but in the html snipet you gave... its structured all right soo.. Id recommend css extract .. THESE ARE THE VALUES...
year = response.css('td.tmp_year a::text').extract()
outcome = response.css('td.tmp_outcome b::text').extract()
category= response.css('span.tmp_category::text').extract()
PRO-TIP: For what ever case you deem it neccesary, you can save a web page asan HTML file and use scrapy shell by referencing the direct file path to it... So I save you html snippet to a file on my desktop then ran...
scrapy shell file:///home/scriptso/Desktop/letsGREPlol.html
ANYWAYS... as far as xpath... since you asked lol... cake. lets compare the xpath with the cssand tell me you can see... it? lol
response.css('td.tmp_outcome b::text').extract()
so is a td tag....and the class name is tmp_outcome, thn the next node is a bold tag... of which where the text is thusly declaring it as text with the ::text
response.xpath('//td[#class="tmp_outcome"]/b/text()').extract()
So xpath is basically saying we star with a patter inthe entire site of the td tag... and class= tmp_outcome, then the bold, then in xpath to declare type /text() is for text.... /#href is for.. yeah you guessedit

how to create html table in fpdf

I am newbie.
I have a problem how to convert html table with fpdf.
This is my html code
<html>
<body>
<table border=1 align=center>
<tr>
<tr>
<td> </td>
<td>PO1</td>
<td>PO2</td>
</tr>
<tr>
<td>LO1</td>
<td> "THIS WILL TAKE THE VALUE FROM DATABASE " </td>
<td> "THIS WILL TAKE THE VALUE FROM DATABASE " </td>
</tr>
<tr>
<td>LO2</td>
<td> "THIS WILL TAKE THE VALUE FROM DATABASE " </td>
<td> "THIS WILL TAKE THE VALUE FROM DATABASE " </td>
</tr>
</table>
</body>
</html>
Can anyone tell me how to do it?
please take a look the example at http://www.fpdf.org/en/script/script41.php

Extract a value from HTML source code using jmeter

I know it might be a duplicate but I am not able to extract a value from this HTML source. Any help would be greatly appreciated.
So what I am trying to do is get the pid of the project from page.
The names of the project are being read from a csv file and I need to get the pid.
For example if the project here is "AA project", just the project key "AA" can also be used, the pid that needs to be extracted is 10441.
Since the values are not a label, I cannot figure out how to extract these.
Update : just using pid=(\d....) gives all the pid without any reference to the project name or key.
<table id="project-list" class="aui">
<thead>
<tr>
<th></th>
<th>Name</th>
<th>Key</th>
<th class="project-list-type">Project Type</th>
<th>URL</th>
<th>Project Lead</th>
<th>Default Assignee</th>
<th>Operations</th>
</tr>
</thead>
<tbody>
<tr data-project-key="AA">
<td class="cell-type-icon" data-cell-type="avatar">
<div class="aui-avatar aui-avatar-small aui-avatar-project jira-system-avatar"><span class="aui-avatar-inner"><img src="/secure/projectavatar?pid=10441&amp;avatarId=10011&amp;size=small" alt="Project Avatar for 10441" /></span></div>
</td>
<td data-cell-type="name">
<a id="view-project-10441" href="/plugins/servlet/project-config/AA/summary">AA project</a>
</td>
<td data-cell-type="key">AA</td>
<span>Software</span>
</td>
<td class="cell-type-url" data-cell-type="url">
No URL
</td>
<td class="cell-type-user" data-cell-type="lead">
<a class="user-hover" rel="localadmin" id="view_AA_projects_localadmin" href="/secure/ViewProfile.jspa?name=localadmin">Atlassian Administrator</a>
</td>
<td class="cell-type-user" data-cell-type="default-assignee">
Unassigned
</td>
<td data-cell-type="operations">
<ul class="operations-list">
<li><a class="edit-project" id="edit-project-10441" href="/secure/project/EditProject!default.jspa?pid=10441&returnUrl=ViewProjects.jspa">Edit</a></li>
<li><a id="change_project_type_10441" class="change-project-type-link" data-project-id="10441" href="#">Change project type</a></li>
<li><a id="delete_project_10441" href="/secure/project/DeleteProject!default.jspa?pid=10441&returnUrl=ViewProjects.jspa">Delete</a></li>
</ul>
</td>
</tr>
<tr data-project-key="AAL">
<td class="cell-type-icon" data-cell-type="avatar">
<div class="aui-avatar aui-avatar-small aui-avatar-project jira-system-avatar"><span class="aui-avatar-inner"><img src="/secure/projectavatar?pid=10442&amp;avatarId=10011&amp;size=small" alt="Project Avatar for 10442" /></span></div>
</td>
<td data-cell-type="name">
<a id="view-project-10442" href="/plugins/servlet/project-config/AAL/summary">AAL project</a>
</td>
<td data-cell-type="key">AAL</td>
<td class="cell-type-project-type">
<span>Software</span>
</td>
<td class="cell-type-url" data-cell-type="url">
No URL
</td>
<td class="cell-type-user" data-cell-type="lead">
<a class="user-hover" rel="localadmin" id="view_AAL_projects_localadmin" href="/secure/ViewProfile.jspa?name=localadmin">Atlassian Administrator</a>
</td>
<td class="cell-type-user" data-cell-type="default-assignee">
Unassigned
</td>
<td data-cell-type="operations">
<ul class="operations-list">
I wouldn't recommend using regular expressions to parse HTML data as it will be a headache to develop and maintain and it will be very sensitive to markup changes hence very fragile, see https://stackoverflow.com/a/1732454/2897748 for details.
Go for XPath Extractor instead, the relevant configuration would be:
Reference Name: anything meaningful, i.e. id
XPath Query: substring-after(//tr[#data-project-key='AA']/td[#data-cell-type='name']/a/#id,'view-project-')
Check Use Tidy if your response is not XHTML-compliant
Demo:
References:
XPath Tutorial
XPath Language Reference

Generating all data HTML

I want to webcrawl this link using the XML package. Problem is the data are not automatically generated. This piece of HTML generates the table:
<table width="1280px" id="maintable">
<tr id="tabletoggles">
<td> </td>
<td id="tablelabel"> </td>
<td id="abovestats" class="abovestats" align="right">
<span class="revscore likelink"></span>
<b>Stats:</b>
<span class="statso stattab">Serve</span> | <span class="statsr stattab likelink">Return</span> | <span class="statsw stattab likelink">Raw</span>
</td></tr>
<tr>
<td id="footer" class="footer"> </td>
<td colspan="2" id="stats" class="stats"><table id="matches"></table></td>
</tr>
<tr>
<td id="belowmenus"> <br/> <br/> <br/> <br/> </td>
<td colspan="2" id="belowmatches"> </td>
</tr>
</table></div>
</div>
When using the function readHTMLTable in XML on this piece of HTML I just get nonsensical values:
readHTMLTable("http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic&f=ACareerqq",which = 3)
V1 V2
1 Â
2 Â Â Â Â Â Â
How can I retrieve the "full link" containing all data? I can do it manually for each page using Firebug but I'd like to have a solution which can retrieve multiple urls at the same time.
I believe that this is due to lack of UTF8 encode.
What language are you using for get this data?
If you are using PHP to take the data, I recommend using
header('Content-Type: text/html; charset=utf-8');
before the entire code.

string replace in phpmyadmin while keeping inner text

i need to change quite some html entries in a mysql database. my problem is that some tags need to be replaced while the surrounded code needs to stay the same. in detail: all td-tags in tr-tags with the class "kopf" need to be changed to th-tags (and the addording closing for the tags)
it would not be a problem without the closing tags..
update `tt_content` set `bodytext` = replace(`bodytext`,'<tr class="kopf"><td colspan="2">','<tr><th colspan="2">');
this would work
from what i found the %-sign is used, but how exactly?:
update `tt_content` set `bodytext` = replace(`bodytext`,'<tr class="kopf"><td colspan="2">%</td></tr>','<tr><th colspan="2">%</th></tr>');
i guess this would replace all the code within the old td tags by a %-sign?? how can i achive the needed replacement?
edit: just to clarify things here is a possible entry in the db:
<table class="techDat" > <tbody> <tr class="kopf"> <td colspan="2"> <p><strong>Technical data:</strong></p> </td> </tr> <tr> <td> <p>Operating time depending on battery chargeBetriebszeit je Akkuladung</p> </td> <td> <p>Approx. 4 h</p> </td> </tr> <tr> <td> <p>Maximum volume</p> </td> <td> <p>Approx. 120 dB(A)</p> </td> </tr> <tr> <td> <p>Weight</p> </td> <td> <p>Approx. 59 g</p> </td> </tr> </tbody> </table>
after the mysql replacement it should look like
<table class="techDat" > <tbody> <tr> <th colspan="2"> <p><strong>Technical data:</strong></p> </th> </tr> <tr> <td> <p>Operating time depending on battery chargeBetriebszeit je Akkuladung</p> </td> <td> <p>Approx. 4 h</p> </td> </tr> <tr> <td> <p>Maximum volume</p> </td> <td> <p>Approx. 120 dB(A)</p> </td> </tr> <tr> <td> <p>Weight</p> </td> <td> <p>Approx. 59 g</p> </td> </tr> </tbody> </table>
Try two replaces
update `tt_content` set `bodytext` =
replace(replace(`bodytext`,
'<tr class="kopf"><td colspan="2">','<tr><th colspan="2">'),
'</td></tr>','</th></tr>')
Try updating your records with two queries :
1) for without % sign:
updatett_contentsetbodytext= replace(bodytext,'<tr class="kopf"><td colspan="2">','<tr><th colspan="2">');
2) for % sign
updatett_contentsetbodytext= replace(bodytext,'<tr class="kopf"><td colspan="2">%</td></tr>','<tr><th colspan="2">%</th></tr>')
where instr(bodytext,'%') > 0 ;