I`m using Scrapy Python to try to grep data from the site.
How I can grep this structure with Xpath?
<div class="foo">
<h3>Need this text_1</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
45767
</td>
<td class="tmp_outcome">
<b>Win_1</b><br>
<span class="tmp_category">TEST_1</span>
</td>
</tr>
<tr>
<td class="tmp_year">
1232004
</td>
<td class="tmp_outcome">
<b>Win_2</b><br>
<span class="tmp_category">TEST_2</span>
</td>
</tr>
<tr>
<td class="tmp_year">
122004
</td>
<td class="tmp_outcome">
<b>Win_3</b><br>
<span class="tmp_category">TEST_3</span>
</td>
</tr>
</tbody>
<h3>Need this text_2</h3>
<table class="thesamename">
<tbody>
<td class="tmp_year">
234
</td>
<td class="tmp_outcome">
<b>Win_E</b><br>
<span class="tmp_category">TEST_E</span>
</td>
</tr>
<tr>
<td class="tmp_year">
3476
</td>
<td class="tmp_outcome">
<b>Win_C</b><br>
<span class="tmp_category">TEST_C</span>
</td>
</tr>
</tbody>
<h3>Need this text_3</h3>
<table class="thesamename">
<tbody>
<tr>
<td class="tmp_year">
85567
</td>
<td class="tmp_outcome">
<b>Win_T</b><br>
<span class="tmp_category">TEST_T</span>
</td>
</tr>
<tr>
<td class="tmp_year">
435656
</td>
<td class="tmp_outcome">
<b>Win_A</b><br>
<span class="tmp_category">TEST_A</span>
</td>
</tr>
<tr>
<td class="tmp_year">
980
</td>
<td class="tmp_outcome">
<b>Win_Z</b><br>
<span class="tmp_category">TEST_Z</span>
</td>
</tr>
</tbody>
I would like to have output with this structure:
"Section": {
Need this text_1 :
[45767 : Win_1 : TEST_1]
[1232004 : Win_2 : TEST_2]
[122004: Win_3 : TEST_3]
,
Need this text_2:
[234 : Win_E : TEST_E]
[3476 : Win_C : TEST_C]
,
Need this text_3:
[85567 : Win_T : TEST_T]
[435656 : Win_A : TEST_A]
[980: Win_Z : TEST_Z]
}
How can I create the proper xpath select to take this structure?
I can take separately all "h3" , all "a" then all tags with class but how I can match?
GREP YOU SAY?! LOL Well, You would be entirely wron to name it so but for the sake ofkeeping the jargon cleanfor understanding your just parsing/extracting.... So new to scrapy? or web dev sideof things? No matter... Theres no way I couldexpect to teach you in one answer here how to xpth/regex like a pro... only wayis for you to keep at but I throw in my input.
First of all, xpath is amazingly usefull wen it comes to websites that are necessarily build to stadard, which doesnt make them bad per say but in the html snipet you gave... its structured all right soo.. Id recommend css extract .. THESE ARE THE VALUES...
year = response.css('td.tmp_year a::text').extract()
outcome = response.css('td.tmp_outcome b::text').extract()
category= response.css('span.tmp_category::text').extract()
PRO-TIP: For what ever case you deem it neccesary, you can save a web page asan HTML file and use scrapy shell by referencing the direct file path to it... So I save you html snippet to a file on my desktop then ran...
scrapy shell file:///home/scriptso/Desktop/letsGREPlol.html
ANYWAYS... as far as xpath... since you asked lol... cake. lets compare the xpath with the cssand tell me you can see... it? lol
response.css('td.tmp_outcome b::text').extract()
so is a td tag....and the class name is tmp_outcome, thn the next node is a bold tag... of which where the text is thusly declaring it as text with the ::text
response.xpath('//td[#class="tmp_outcome"]/b/text()').extract()
So xpath is basically saying we star with a patter inthe entire site of the td tag... and class= tmp_outcome, then the bold, then in xpath to declare type /text() is for text.... /#href is for.. yeah you guessedit
Related
I have used the nokogiri ruby gem to webscrape an html file for only the text under the tableData class. The html code is setup like so:
<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>
and the code I used to webscrape looks like this:
vt = page.css("td[class='tableData']").text
puts vt
Which gives this output:
Jane Doe 01/01/201701/09/2017 VacationJohn Doe 01/01/201701/09/2017 Vacation
I want to populate an array within an array with only the 4 text values pertaining to each person. Which should look like this:
[[Jane Doe, 01/01/2017, 01/09/2017, Vacation], [John Doe, 01/01/2017, 01/09/2017, Vacation]]
I am new to coding and I'm not sure how to create a for loop to iterate over either the html code itself or the vt variable to produce an array of arrays. I know there are some push statements involved following the for loop but its the actual structure of the for loop that I am having trouble putting together. If you could provide some explanation in your answer for how the for loop works in this situation it would be much appreciated.
This is the basic structure you need. map is needed :
html=%q(<div class="table-wrap">
<table class="table">
<tbody>
<tr>
<td class="tableData"> Jane Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
<tr>
<td class="tableData">John Doe</td>
<td class="tableData"> 01/01/2017</td>
<td class="tableData">01/09/2017 </td>
<td class="tableData">Vacation</td>
</tr>
</tbody>
</table>
</div>)
require 'nokogiri'
doc = Nokogiri::XML(html)
array = doc.xpath('//tr').map do |tr|
tr.xpath('td').map{ |td| td.text }
end
p array
# [[" Jane Doe", " 01/01/2017", "01/09/2017 ", "Vacation"], ["John Doe", " 01/01/2017", "01/09/2017 ", "Vacation"]]
Try parsing the snippet as XML, finding all "tr" elements via XPath, and collecting their "td//text()" children:
require 'nokogiri'
doc = Nokogiri::XML(get_html_snippet)
data = doc.xpath('//tr').map do |tr|
tr.xpath('td').map { |td| td.text.strip }
end
data # => [["Jane Doe", "01/01/2017", "01/09/2017", "Vacation"], ["John Doe", "01/01/2017", "01/09/2017", "Vacation"]]
I have a text file something like this.
<tbody>
<tr>
<td>
String1
</td>
<td>
String2
</td>
<td>
String3
</td>
...
...
<td>
StringN
</td>
</tr>
</tbody>
This is the output that I want.
<tbody>
<tr>
String1;String2;String3;... ...;StringN
</tr>
</tbody>
Here is my BUGGY code.
sed '{
:a
N
$!ba
s|<td.*>\(.*\)</td>|\1|
}'
I wanted to remove all <td> and </td> tags and get all the strings delimitered by some string (I can filter those strings later using that as the delimiter charater). I used the solution given in this URL. Output does not come as I expected.
This is the actual Code
<tbody>
<tr>
<td>
120.52.72.58:80
</td>
<td>
HTTP
</td>
<td>
<span class="text-danger">Transparent</span>
</td>
<td>
<abbr title="2016-12-15 00:07:46">12h ago</abbr>
</td>
<td class="small">
<span class="text-muted">—</span>
</td>
<td>
<img src="/flags/png/cn.png" alt="China (CN)" title="China (CN)" onerror="this.style.display='none'"> <abbr title="China">CN</abbr>
</td>
<td class="small">
Beijing
</td>
<td class="small">
Beijing
</td>
<td class="small">
China Unicom IP network
</td>
<td class="small">
<span class="text-muted">—</span>
</td>
</tr>
</tbody>
Output does not come as I expected.
Your sed code does not work because the <td.*>\(.*\)</td> matches the part of the pattern space from the first <td up to the last </td> due to the greediness of the * quantifier. Unfortunately, sed doesn't support a more modern regex flavor with ungreedy quantifiers; thus, some other tool would be more appropriate.
I wanted to remove all <td> and </td> tags and get all the strings delimitered by some string …
If those tags are always (as in your examples) on a separate line, we can do with a simple sed command:
sed '/<\/*td.*>/d'
All the strings are thereafter delimited by some string which is \n followed by spaces.
I know it might be a duplicate but I am not able to extract a value from this HTML source. Any help would be greatly appreciated.
So what I am trying to do is get the pid of the project from page.
The names of the project are being read from a csv file and I need to get the pid.
For example if the project here is "AA project", just the project key "AA" can also be used, the pid that needs to be extracted is 10441.
Since the values are not a label, I cannot figure out how to extract these.
Update : just using pid=(\d....) gives all the pid without any reference to the project name or key.
<table id="project-list" class="aui">
<thead>
<tr>
<th></th>
<th>Name</th>
<th>Key</th>
<th class="project-list-type">Project Type</th>
<th>URL</th>
<th>Project Lead</th>
<th>Default Assignee</th>
<th>Operations</th>
</tr>
</thead>
<tbody>
<tr data-project-key="AA">
<td class="cell-type-icon" data-cell-type="avatar">
<div class="aui-avatar aui-avatar-small aui-avatar-project jira-system-avatar"><span class="aui-avatar-inner"><img src="/secure/projectavatar?pid=10441&avatarId=10011&size=small" alt="Project Avatar for 10441" /></span></div>
</td>
<td data-cell-type="name">
<a id="view-project-10441" href="/plugins/servlet/project-config/AA/summary">AA project</a>
</td>
<td data-cell-type="key">AA</td>
<span>Software</span>
</td>
<td class="cell-type-url" data-cell-type="url">
No URL
</td>
<td class="cell-type-user" data-cell-type="lead">
<a class="user-hover" rel="localadmin" id="view_AA_projects_localadmin" href="/secure/ViewProfile.jspa?name=localadmin">Atlassian Administrator</a>
</td>
<td class="cell-type-user" data-cell-type="default-assignee">
Unassigned
</td>
<td data-cell-type="operations">
<ul class="operations-list">
<li><a class="edit-project" id="edit-project-10441" href="/secure/project/EditProject!default.jspa?pid=10441&returnUrl=ViewProjects.jspa">Edit</a></li>
<li><a id="change_project_type_10441" class="change-project-type-link" data-project-id="10441" href="#">Change project type</a></li>
<li><a id="delete_project_10441" href="/secure/project/DeleteProject!default.jspa?pid=10441&returnUrl=ViewProjects.jspa">Delete</a></li>
</ul>
</td>
</tr>
<tr data-project-key="AAL">
<td class="cell-type-icon" data-cell-type="avatar">
<div class="aui-avatar aui-avatar-small aui-avatar-project jira-system-avatar"><span class="aui-avatar-inner"><img src="/secure/projectavatar?pid=10442&avatarId=10011&size=small" alt="Project Avatar for 10442" /></span></div>
</td>
<td data-cell-type="name">
<a id="view-project-10442" href="/plugins/servlet/project-config/AAL/summary">AAL project</a>
</td>
<td data-cell-type="key">AAL</td>
<td class="cell-type-project-type">
<span>Software</span>
</td>
<td class="cell-type-url" data-cell-type="url">
No URL
</td>
<td class="cell-type-user" data-cell-type="lead">
<a class="user-hover" rel="localadmin" id="view_AAL_projects_localadmin" href="/secure/ViewProfile.jspa?name=localadmin">Atlassian Administrator</a>
</td>
<td class="cell-type-user" data-cell-type="default-assignee">
Unassigned
</td>
<td data-cell-type="operations">
<ul class="operations-list">
I wouldn't recommend using regular expressions to parse HTML data as it will be a headache to develop and maintain and it will be very sensitive to markup changes hence very fragile, see https://stackoverflow.com/a/1732454/2897748 for details.
Go for XPath Extractor instead, the relevant configuration would be:
Reference Name: anything meaningful, i.e. id
XPath Query: substring-after(//tr[#data-project-key='AA']/td[#data-cell-type='name']/a/#id,'view-project-')
Check Use Tidy if your response is not XHTML-compliant
Demo:
References:
XPath Tutorial
XPath Language Reference
Here is my html code:
<table id="laptop_detail" class="table">
<tbody>
<tr>
<td style="padding-left:18px" class="ha">Camera Pixels</td>
<td class="val">8 megapixel camera</td>
</tr>
<tr>
<td style="padding-left:36px" class="ha">Camera Pixels</td>
<td class="val">8 megapixel camera</td>
</tr>
</tbody>
and my xpath:
$x('//*[#id="laptop_detail"]//tr/td[contains(., "Camera Pixels")]/following-sibling::td[1]/text()')
My problem is I am unable to find any working way of selecting only one occurrence of attribute.
Enclose the part locating the "Camera Pixels" td element into parenthesis:
(//*[#id="laptop_detail"]//tr/td[contains(., "Camera Pixels")])[1]/following-sibling::td
Demo:
$ xmllint index.html --xpath '(//*[#id="laptop_detail"]//tr/td[contains(., "Camera Pixels")])[1]/following-sibling::td'
<td class="val">8 megapixel camera</td>
i need to change quite some html entries in a mysql database. my problem is that some tags need to be replaced while the surrounded code needs to stay the same. in detail: all td-tags in tr-tags with the class "kopf" need to be changed to th-tags (and the addording closing for the tags)
it would not be a problem without the closing tags..
update `tt_content` set `bodytext` = replace(`bodytext`,'<tr class="kopf"><td colspan="2">','<tr><th colspan="2">');
this would work
from what i found the %-sign is used, but how exactly?:
update `tt_content` set `bodytext` = replace(`bodytext`,'<tr class="kopf"><td colspan="2">%</td></tr>','<tr><th colspan="2">%</th></tr>');
i guess this would replace all the code within the old td tags by a %-sign?? how can i achive the needed replacement?
edit: just to clarify things here is a possible entry in the db:
<table class="techDat" > <tbody> <tr class="kopf"> <td colspan="2"> <p><strong>Technical data:</strong></p> </td> </tr> <tr> <td> <p>Operating time depending on battery chargeBetriebszeit je Akkuladung</p> </td> <td> <p>Approx. 4 h</p> </td> </tr> <tr> <td> <p>Maximum volume</p> </td> <td> <p>Approx. 120 dB(A)</p> </td> </tr> <tr> <td> <p>Weight</p> </td> <td> <p>Approx. 59 g</p> </td> </tr> </tbody> </table>
after the mysql replacement it should look like
<table class="techDat" > <tbody> <tr> <th colspan="2"> <p><strong>Technical data:</strong></p> </th> </tr> <tr> <td> <p>Operating time depending on battery chargeBetriebszeit je Akkuladung</p> </td> <td> <p>Approx. 4 h</p> </td> </tr> <tr> <td> <p>Maximum volume</p> </td> <td> <p>Approx. 120 dB(A)</p> </td> </tr> <tr> <td> <p>Weight</p> </td> <td> <p>Approx. 59 g</p> </td> </tr> </tbody> </table>
Try two replaces
update `tt_content` set `bodytext` =
replace(replace(`bodytext`,
'<tr class="kopf"><td colspan="2">','<tr><th colspan="2">'),
'</td></tr>','</th></tr>')
Try updating your records with two queries :
1) for without % sign:
updatett_contentsetbodytext= replace(bodytext,'<tr class="kopf"><td colspan="2">','<tr><th colspan="2">');
2) for % sign
updatett_contentsetbodytext= replace(bodytext,'<tr class="kopf"><td colspan="2">%</td></tr>','<tr><th colspan="2">%</th></tr>')
where instr(bodytext,'%') > 0 ;