Parsing HTML NSRegularExpression - html

i'm trying to parse an HTML page using NSRegularExpressions..
The page is a repetition of this html code:
<div class="fact" id="fact66">STRING THAT I WANT</div> <div class="vote">
#106
<span id="p106">246080 / 8.59 </span>
<span id="f106" class="vote2">
(+++)
(++)
(+)
(-)</span>
<span id="ve106"></span>
</div>
So, i'ld like to get the string between the div
<div class="fact" id="fact66">STRING THAT I WANT</div>
So i made a regex that looks like this
<div class="fact" id="fact[0-9].*\">(.*)</div>
Now, in my code, i implement it using this:
NSString *htmlString = [NSString stringWithContentsOfURL:[NSURL URLWithString:#"http://www.myurl.com"] encoding:NSASCIIStringEncoding error:nil];
NSRegularExpression* myRegex = [[NSRegularExpression alloc] initWithPattern:#"<div class=\"fact\" id=\"fact[0-9].*\">(.*)</div>\n" options:0 error:nil];
[myRegex enumerateMatchesInString:htmlString options:0 range:NSMakeRange(0, [htmlString length]) usingBlock:^(NSTextCheckingResult *match, NSMatchingFlags flags, BOOL *stop) {
NSRange range = [match rangeAtIndex:1];
NSString *string =[htmlString substringWithRange:range];
NSLog(string);
}];
But it returns nothing... I tested my regex in Java and PHP and it works great, what am i doing wrong ?
Thanks

Try using this regex:
#"<div class=\"fact\" id=\"fact[0-9]*\">([^<]*)</div>"
Regex:
fact[0-9].*
means: fact followed by a number between 0 and 9, followed by any character repeated any number of times.
I also suggest using:
([^<]*)
instead of
(.*)
to match between the two divs so to deal with regex greediness, or alternatively:
(.*?)
(? will make the regex non-greedy, so it stops at the first instance of </div>.

Related

IOS parse HTML but get weird value " "

Hi i'm doing my assignment and I want to get some information from this website:. I used TFHpple.h from Raywenderlich tutorial .Every thing went fine until I try to get the view count(this number: 8.024.835 ) but in my code it return this number "
" I NSLOG its element.raw then I see this code:
<p>
Số lượt xem:
<span class="color-fuchsia" id="PageViews"/>    
Yêu thích:
<span class="color-hotpink" id="LikeCount"/>
</p>
but when I use firebug to its html, it display like this:
<p>
Số lượt xem:
<span id="PageViews" class="color-fuchsia">8.024.835</span>
     Yêu thích:
<span id="LikeCount" class="color-hotpink">1.565</span>
</p>
How to get the correct value please help me.
this is my code to parse and nslog the html.
-(void) GetBookViewCount{
NSURL *url = [NSURL URLWithString:#“http://blogtruyen.com/truyen/conan”];
NSData *htmlData = [NSData dataWithContentsOfURL:url];
TFHpple *parser = [TFHpple hppleWithHTMLData:htmlData];
NSString* XpathQueryString = #"//div[#class='description']/p";
NSArray *Nodes = [parser searchWithXPathQuery:XpathQueryString];
for (TFHppleElement *element in Nodes) {
NSLog(#"%#",element.raw);
}
}
It looks like there a bunch of odd whitespace characters in between the two spans there.
    
This number here:
Looks like an ascii code for a symbol (though I can't find one that matches), so when you parse the code it might be breaking when you hit those characters. I'm not familiar with TFHpple.h but you may need to implement some input sanitization (stripping out those characters).

iOS: Find the end of a specific paragraph in an HTML NSString

So I receive an NSString with html code like this:
<p class="img"><img src="blahblahblah"></p><p>This is some text</p>
I would like to find the end of the img-classed paragraph, so I can insert a heading in between the two paragraphs. Please note:
that the img-classed paragraph is not necessarily the first paragraph in the string.
there can be multiple img-classed paragraphs in the string but I only need to insert something after the first one
I would like to find the character-position after the first img-classed </p> in the string, and not parse it.
You want to parse, there is really no other option. But then make sure to find a criteria which is really unique.
Here is the Cocoa+NSString solution :
NSScanner *scanner = [NSScanner scannerWithString:originalString];
[scanner scanUpToString:#"<p class=\"img">" intoString:nil];
[scanner scanString:#"par_categorie_2\">" intoString:nil];
[scanner scanUpToString:#"</p>" intoString:nil];
[scanner scanString:#"</p>" intoString:nil];
NSInteger insertionPoint = scanner.scanLocation;
NSMutableString *modifiedString = [[NSMutableString alloc] initWithString:originalString];
[modifiedString insertString:insertedString atIndex:insertionPoint];

Getting the HTML tags in hpple as well as text?

The code below takes all of the text from a certain div. Is it possible for me to take all the text from the div as well as the html attributes? So it also adds all of the <p> </p>'s and <br> </br>'s to the string, myString?
//trims string from previous page
NSString *trimmedString = [stringy stringByTrimmingCharactersInSet:
[NSCharacterSet whitespaceAndNewlineCharacterSet]];
NSData *data = [[NSString stringWithContentsOfURL:[NSURL URLWithString:trimmedString]] dataUsingEncoding:NSUTF8StringEncoding];
TFHpple *xpathParser = [[TFHpple alloc] initWithHTMLData:data];
NSArray *elements = [xpathParser searchWithXPathQuery:#"//div[#class='field-item even']"];
TFHppleElement *element = [elements lastObject]; //may need to change this number?!
NSString *mystring = [self getStringForTFHppleElement:element];
trimmedTextView.text = [trimmedTextView.text stringByAppendingString:mystring];
Method here:
-(NSString*) getStringForTFHppleElement:(TFHppleElement *)element
{
NSMutableString *result = [NSMutableString new];
// Iterate recursively through all children
for (TFHppleElement *child in [element children])
[result appendString:[self getStringForTFHppleElement:child]];
// Hpple creates a <text> node when it parses texts
if ([element.tagName isEqualToString:#"text"])
[result appendString:element.content];
return result;
}
Any ideas would be appreciated. Cheers.
Try this:
NSString *htmlDataString = [webView stringByEvaluatingJavaScriptFromString: #"document.documentElement.outerHTML"];
This will take all the HTML out to string. You can then parse it in your native code and find div which is your interest what you have did in above example.
You can do it as well with any DOM element in your HTML like:
NSString *htmlDataString = [webView stringByEvaluatingJavaScriptFromString: #"document.documentElement.getElemenById('mydiv')"];
which is more efficient but requires a bit of javascript skill.

iOS: Strip <img...> from NSString (a html string)

So I have an NSString which is basically an html string with all the usual html elements. The specific thing I would like to do is to just strip it from all the img tags.
The img tags may or may not have max-width, style or other attributes so I do not know their length up front. They always end with />
How could I do this?
EDIT: Based on nicolasthenoz's answer, I came up with a solution that requires less code:
NSString *HTMLTagss = #"<img[^>]*>"; //regex to remove img tag
NSString *stringWithoutImage = [htmlString stringByReplacingOccurrencesOfRegex:HTMLTagss withString:#""];
You can use the NSString method stringByReplacingOccurrencesOfString with the NSRegularExpressionSearch option:
NSString *result = [html stringByReplacingOccurrencesOfString:#"<img[^>]*>" withString:#"" options:NSCaseInsensitiveSearch | NSRegularExpressionSearch range:NSMakeRange(0, [html length])];
Or you can also use the replaceMatchesInString method of NSRegularExpression. Thus, assuming you have your html in a NSMutableString *html, you can:
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"<img[^>]*>"
options:NSRegularExpressionCaseInsensitive
error:nil];
[regex replaceMatchesInString:html
options:0
range:NSMakeRange(0, html.length)
withTemplate:#""];
I'd personally lean towards one of these options over the stringByReplacingOccurrencesOfRegex method of RegexKitLite. There's no need to introduce a third-party library for something as simple as this unless there was some other compelling issue.
Use a regular expression, find the matchs in your string and remove them !
Here is how
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"<img[^>]*>"
options:NSRegularExpressionCaseInsensitive
error:nil];
NSMutableString* mutableString = [yourStringToStripFrom mutableCopy];
NSInteger offset = 0; // keeps track of range changes in the string due to replacements.
for (NSTextCheckingResult* result in [regex matchesInString:yourStringToStripFrom
options:0
range:NSMakeRange(0, [yourStringToStripFrom length])]) {
NSRange resultRange = [result range];
resultRange.location += offset;
NSString* match = [regex replacementStringForResult:result
inString:mutableString
offset:offset
template:#"$0"];
// make the replacement
[mutableString replaceCharactersInRange:resultRange withString:#""];
// update the offset based on the replacement
offset += ([match length] - resultRange.length);
}
You can use below function in Swift 4,5:
func filterImgTag(text: String) -> String{
return text.replacingOccurrences(of: "<img[^>]*>", with: "", options: String.CompareOptions.regularExpression)
}
Hope it can help you all! comment below if it work for you. Thanks.

HTML from NSAttributedString

Rather than converting HTML to an attributed string, I need to convert it back to HTML. This can easily be done on Mac as can be seen here: http://www.justria.com/2011/01/18/how-to-convert-nsattributedstring-to-html-markup/
Unfortuately, the method dataFromRange:documentAttributes: is only available on Mac via the NSAttributedString AppKit Additions.
My question is how can you do this on iOS?
Not the 'easy' way, but what about iterating through the attributes of the string using:
- (void)enumerateAttributesInRange:(NSRange)enumerationRange
options:(NSAttributedStringEnumerationOptions)opts
usingBlock:(void (^)(NSDictionary *attrs, NSRange range, BOOL *stop))block
Have an NSMutableString variable to accumulate the HTML (lets call it 'html'). In the block, you would construct the HTML manually using strings. For instance if the text attributes 'attrs' specify red, bold text:
[html appendFormat:#"<span style='color:red; font-weight: bold;'>%#</span>", [originalStr substringWithRange:range]]
EDIT: Stumbled across this yesterday:
NSAttributedString+HTMLFromRange category from "UliKit"
(https://github.com/uliwitness/UliKit/blob/master/NSAttributedString+HTMLFromRange.m)
Looks like it will do what you want.
Use the below code. it works well.
NSAttributedString *s = ...;
NSDictionary *documentAttributes = #{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType};
NSData *htmlData = [s dataFromRange:NSMakeRange(0, s.length) documentAttributes:documentAttributes error:NULL];
NSString *htmlString = [[NSString alloc] initWithData:htmlData encoding:NSUTF8StringEncoding];