I am trying to match multi-line HTML source code with a regular expression (using AutoIt). HTML source code to match:
<li class="mission">
<div>
<div class="missionTitle">
<h3>Eat a quarter-pounder with cheese</h3>
<div class="missionProgress">
<span>100%</span>
<div class="missionProgressBar" style="width: 100%;"></div>
</div>
</div>
<div class="missionDetails">
<ul class="missionRewards">
<li class="rewardCash">5,000–8,000</li>
<li class="rewardXP">XP +5</li>
</ul>
<div class="fightItems clearfix">
<h5><span>Prerequisites:</span></h5>
<div class="fightItemsWrap">
<div class="fightItem tooltip" title="Sunglasses" data-attack="Attack: 2" data-defence="Defence: 2">
<img src="/img/enhancement/3.jpg" alt="">
<span>× 1</span>
</div>
<div class="fightItem tooltip" title="Broad Shoulders" data-attack="Attack: 0" data-defence="Defence: 3">
<img src="/img/enhancement/1003.jpg" alt="">
<span>× 1</span>
</div>
<div class="fightItem tooltip" title="Irish Fond Anglia" data-attack="Attack: 4" data-defence="Defence: 8">
<img src="/img/enhancement/2004.jpg" alt="">
<span>× 1</span>
</div>
</div>
</div>
<form action="/quest/index/i/kdKJBrgjdGWKqtfDrHEkRM2duXVn1ntH/h/c0b2d58642cd862bfad47abf7110042e/t/1336917311" method="post">
<input type="hidden" id="id" name="id" value="17"/>
<button class="button buttonIcon btnEnergy"><em>5</em></button>
</form>
</div>
</div>
</li>
It is present multiple times on a single page (but items within <div class="fightItems clearfix">...</div> vary).
I need to match
<h3>Eat a quarter-pounder with cheese</h3>,
the first span <span>100%</span> and
<input type="hidden" id="id" name="id" value="17"/>.
Expected result (for every occurrence on a page):
$a[0] = "Eat a quarter-pounder with cheese"
$a[1] = "100%"
$a[2] = "17"
What I came up with:
(?U)(?:<div class="missionTitle">\s+<h3>(.*)</h3>\s+<div class="missionProgress">\s+<span>(.*)</span>)|(?:<form .*\s+.*<input\stype="hidden"\sid="id"\sname="id"\svalue="(\d+)"/>\s+.*\s+</form>)
But that leaves some array-items empty. I also tried the (?s) flag, but then it only captures first occurrence (and stops matching after).
I had not to use . to match words or integers, because of the (?s) flag. The correct regex is:
(?U)(?s)<div class="missionTitle">\s+<h3>([\w\s]+)</h3>(?:.*)<div class="missionProgress">\s+<span>(\d+)%</span>(?:.*)<input.* value="(\d+)"/>
Regular expression to match multi-line HTML source code:
As per documentation;
\R matches newline characters (?>\r\n|\n|\r),
dot . does not (unless (?s) is set).
\s matches white space characters.
Generally some combination is required (like \R\s*?).
Non-capturing groups are redundant (match without capturing instead).
If uniquely enclosed, single characters may be excluded instead (like attribute="([^"]*?)" for text between double-quotes).
Example (contains double-quotes; treat as per Documentation - FAQ - double quotes):
(?s)<div class="missionTitle">.*?<h3>(.*?)</h3>.*?<div class="missionProgress">.*?<span>([^<]*?)</span>.*?<input type="hidden" id="id" name="id" value="([^"]*?)"/>
Visual explanation:
If regular expressions should be used on HTML (beyond simple listings like this) is a different question (been, done, T-shirt).
Related
The problem is that my angular code triggers an error on the form controls when I add a white space to the text input.I would like the regex to allow white spaces. I've tried several different regex patterns. I believe the one im currently using should be allow letters and whitespaces.
TypeScript
form = this.fb.group({
title: [,[Validators.required,Validators.pattern("[a-zA-Z\s]+")]],
author: [,[Validators.required,Validators.pattern('/^[a-zA-Z\s]*$/')]],
description: [,Validators.required],
date: [new Date]
})
HTML
<div class="form-group">
<label for="title"> Article Title </label>
<span
style="color: red;font-style: italic"
*ngIf="(mouseOverSubmit || form.controls.title?.touched)
&& form.controls.title?.errors?.required">
Required
</span>
<span
style = "color:red;font-style: italic"
*ngIf= "form.controls.title?.touched
&& form.controls.title?.errors?.pattern">
Only letters and numbers allowed
</span>
<input (ngModel)="title"
name="title"
formControlName="title"
class="form-control"
type="text"
id="title">
</div>
Here is validator example for you
\s Any Whitespace
\S Any Non-whitespace character
Use in this way Validators.pattern("^[a-zA-Z ]*$")
To allow only one space between two words use in this way
Validators.pattern("^[\w+ ?]*$")
I am trying to use Nokogiri's CSS method to get some names from my HTML.
This is an example of the HTML:
<section class="container partner-customer padding-bottom--60">
<div>
<div>
<a id="technologies"></a>
<h4 class="center-align">The Team</h4>
</div>
</div>
<div class="consultant list-across wrap">
<div class="engineering">
<img class="" src="https://v0001.jpg" alt="Person 1"/>
<p>Person 1<br>Founder, Chairman & CTO</p>
</div>
<div class="engineering">
<img class="" src="https://v0002.png" alt="Person 2"/></a>
<p>Person 2<br>Founder, VP of Engineering</p>
</div>
<div class="product">
<img class="" src="https://v0003.jpg" alt="Person 3"/></a>
<p>Person 3<br>Product</p>
</div>
<div class="Human Resources & Admin">
<img class="" src="https://v0004.jpg" alt="Person 4"/></a>
<p>Person 4<br>People & Places</p>
</div>
<div class="alliances">
<img class="" src="https://v0005.jpg" alt="Person 5"/></a>
<p>Person 5<br>VP of Alliances</p>
</div>
What I have so far in my people.rake file is the following:
staff_site = Nokogiri::HTML(open("https://www.website.com/company/team-all"))
all_hands = staff_site.css("div.consultant").map(&:text).map(&:squish)
I am having a little trouble getting all elements within the alt="" tag (the name of the person), as it is nested under a few divs.
Currently, using div.consultant, it gets all the names + the roles, i.e. Person 1Founder, Chairman; CTO, instead of just the person's name in alt=.
How could I simply get the element within alt?
Your desired output isn't clear and the HTML is broken.
Start with this:
require 'nokogiri'
doc = Nokogiri::HTML('<html><body><div class="consultant"><img alt="foo"/><img alt="bar" /></div></body></html>')
doc.search('div.consultant img').map{ |img| img['alt'] } # => ["foo", "bar"]
Using text on the output of css isn't a good idea. css returns a NodeSet. text against a NodeSet results in all text being concatenated, which often results in mangled text content forcing you to figure out how to pull it apart again, which, in the end, is horrible code:
doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
doc.search('p').text # => "foobar"
This behavior is documented in NodeSet#text:
Get the inner text of all contained Node objects
Instead, use text (AKA inner_text or content) against the individual nodes, resulting in the exact text for that node, that you can then join as you want:
Returns the content for this Node
doc.search('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.
I tried to use this XPath:
//*[contains(normalize-space(text()),'Jira')]
Also tried:
//*[contains(text(),'Jira')]
In the below HTML example, there is space before and after text "Jira". I am not able to click on the link:
<a href="#/crm/usergroup-edit?id=572a3c84e4b07f6189958700"
ng-repeat="gp in groups | filter : userGroupSearch | orderBy:'-name':1"
class="ng-scope">
<div class="inventoryPanel" ng-style="myStyle" style="width: 15.8%;">
<h4 class="ng-binding">
<div class="groupIcon G">
<div class="text ng-binding">P</div>
</div>Jira
</h4>
</div>
</a>
The following XPath will select all a elements whose string value contains a Jira substring:
//a[contains(.,'Jira')]
My target is to search my notes by Title and truncate Body by a letters or words limit. I don't know if its possible to do that and how. I don't want to mess up with any custom filters , directives etc.
This is a idea of the html:
<section class="notespage" ng-controller="NotesController">
<h2>Notes list</h2>
<div>
Search:
<input type="text" ng-model="searchText">
</div>
<div class="notesleft">
<div class="notelist">
<div ng-repeat="note in notes | filter:searchText | truncate:{body:letterLimit}">
<div><h3>{{note.title}}</h3></div>
<div>{{note.body}}</div>
</div>
</div>
</div>
</section>
Where this should be ? In NotesController ?
$scope.letterLimit = 20 ;
Since you don't want to "mess up" with custom filters etc, you can use the built-in limitTo filter to limit the number of characters displayed as note.body:
$scope.letterLimit = 20;
<div>Search:<input type="search" ng-model="searchText" /></div>
<div ng-repeat="note in notes | filter:searchText">
<div><h3>{{note.title}}</h3></div>
<div>{{note.body | limitTo:letterLimit}}</div>
<div>
See, also, this short demo.
This is my html page.
<body>
<div id="form1">
<root xmlns:NS='www.yembi.com'>
<NS:ENTRY Action='Users'>
<div class="colm1">
<NS:DISPLAY datafield="FirstName"></NS:DISPLAY>
</div>
<div class="colm2">
<NS:Text datafield='FirstName'/>
</div>
<div class="colm1">
<NS:DISPLAY datafield="MiddleName"></NS:DISPLAY>
</div>
<div class="colm2">
<NS:Text datafield='MiddleName'/>
</div>
<div class="colm1">
<NS:DISPLAY datafield="LastName"></NS:DISPLAY>
</div>
<div class="colm2">
<NS:Text datafield='LastName'/>
</div>
<div class="colm1">
<NS:DISPLAY datafield="Phone"></NS:DISPLAY>
</div>
<div class="colm2">
<NS:Text datafield='Phone'/>
</div>
<div class="colm3">
<NS:BUTTON ></NS:BUTTON>
</div>
</NS:ENTRY>
</root>
</div>
I wanna get this html into a temp variable. so tried like below
var Text = $("#form1").html();
I am getting full html content, But with small exceptions.
It not displaying the double quotes in div attributes.
I am getting the below output:
<ROOT xmlns:NS='www.yembi.com'>
<NS:ENTRY Action='Users'>
<DIV class=colm1>
<NS:DISPLAY datafield="FirstName"></NS:DISPLAY>
</DIV>
<DIV class=colm2> // Here it automatically removes the double quotes. how to retain this double quotes ?
<NS:Text datafield='FirstName'/>
</DIV>
<DIV class=colm1>
<NS:DISPLAY datafield="MiddleName"></NS:DISPLAY>
</DIV>
<DIV class=colm2>
<NS:Text datafield='MiddleName'/>
</div>
<DIV class=colm1>
<NS:DISPLAY datafield="LastName"></NS:DISPLAY>
</DIV>
<DIV class=colm2>
<NS:Text datafield='LastName'/>
</DIV>
<DIVclass=colm1>
<NS:DISPLAY datafield="Phone"></NS:DISPLAY>
</DIV>
<DIV class=colm2>
<NS:Text datafield='Phone'/>
</DIV>
<DIV class=colm3>
<NS:BUTTON ></NS:BUTTON>
</DIV>
</NS:ENTRY>
</ROOT>
As mentioned above it automatically remove the double quotes in DIV. How to retain this double quotes??
I suspect you are using jQuery or an equivalent:
See here from the documentation:
"This method uses the browser's innerHTML property. Some browsers may not return HTML that exactly replicates the HTML source in an original document. For example, Internet Explorer sometimes leaves off the quotes around attribute values if they contain only alphanumeric characters."
http://api.jquery.com/html/
So this is really browser dependent. You could use a javascript regex to make sure the quotes are where you need them to be:
Regex to put quotes for html attributes